OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Alex Elder	572c588eda	libceph: move init of bio_iter If a message has a non-null bio pointer, its bio_iter field is initialized in write_partial_msg_pages() if this has not been done already. This is really a one-time setup operation for sending a message's (bio) data, so move that initialization code into prepare_write_message_data() which serves that purpose. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-07-05 21:14:16 -07:00
Alex Elder	df6ad1f973	libceph: move init_bio_*() functions up Move init_bio_iter() and iter_bio_next() up in their source file so the'll be defined before they're needed. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-07-05 21:14:15 -07:00
Alex Elder	fd154f3c75	libceph: don't mark footer complete before it is This is a nit, but prepare_write_message() sets the FOOTER_COMPLETE flag before the CRC for the data portion (recorded in the footer) has been completely computed. Hold off setting the complete flag until we've decided it's ready to send. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-07-05 21:14:13 -07:00
Alex Elder	84ca8fc87f	libceph: encapsulate advancing msg page In write_partial_msg_pages(), once all the data from a page has been sent we advance to the next one. Put the code that takes care of this into its own function. While modifying write_partial_msg_pages(), make its local variable "in_trail" be Boolean, and use the local variable "msg" (which is just the connection's current out_msg pointer) consistently. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-07-05 21:14:12 -07:00
Alex Elder	739c905baa	libceph: encapsulate out message data setup Move the code that prepares to write the data portion of a message into its own function. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-07-05 21:14:10 -07:00
Sage Weil	d59315ca8c	libceph: drop ceph_con_get/put helpers and nref member These are no longer used. Every ceph_connection instance is embedded in another structure, and refcounts manipulated via the get/put ops. Signed-off-by: Sage Weil <sage@inktank.com>	2012-06-22 08:13:45 -05:00
Sage Weil	36eb71aa57	libceph: use con get/put methods The ceph_con_get/put() helpers manipulate the embedded con ref count, which isn't used now that ceph_connections are embedded in other structures. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-06-22 07:30:27 -05:00
Yan, Zheng	b132cf4c73	rbd: Clear ceph_msg->bio_iter for retransmitted message The bug can cause NULL pointer dereference in write_partial_msg_pages Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@inktank.com> (cherry picked from commit `43643528cc`)	2012-06-20 07:43:50 -05:00
Dan Carpenter	26ce171915	libceph: fix NULL dereference in reset_connection() We dereference "con->in_msg" on the line after it was set to NULL. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-06-19 08:52:33 -05:00
Sage Weil	9a64e8e0ac	Linux 3.5-rc1 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQEcBAABAgAGBQJPyr4LAAoJEHm+PkMAQRiGhvMH/1uXaDmJiiyAtMhC9kQbLclK 5RpUOV+ukRrPXBJhwWGEZvC9G/DiWAfZ/19Ee6qTGZbA46yxkgZklqO+bw7fuOLH dPf4MNXdhgOgbs0KkVAk6aXIYzIU836pcYg+LcapG8E8SZp3SWbJzrVbUPFwPM+m Sv11ZcpJfM2HH9wFRdKErUOiZHsMY+LZHcw0nx+BObytjgzBbzHNkpF57F714TUO QplYpIToO3XtGhIM1yRDxww+2zFlVNsCZ8IC57EDbLb8BMZWuyZoFgWZqLAnrU0u vy7CHLledMSvs855juJ9JxGo/EDnfwJpCnjmcp8BY+h4b5T/k5mGK6d9aeXYRf4= =CcWn -----END PGP SIGNATURE----- Merge tag 'v3.5-rc1' Linux 3.5-rc1 Conflicts: net/ceph/messenger.c	2012-06-15 12:32:04 -07:00
Sage Weil	89a86be0ce	libceph: transition socket state prior to actual connect Once we call ->connect(), we are racing against the actual connection, and a subsequent transition from CONNECTING -> CONNECTED. Set the state to CONNECTING before that, under the protection of the mutex, to avoid the race. This was introduced in `928443cd96`, with the original socket state code. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-06-15 11:26:37 -07:00
Yan, Zheng	43643528cc	rbd: Clear ceph_msg->bio_iter for retransmitted message The bug can cause NULL pointer dereference in write_partial_msg_pages Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-06-07 08:27:33 -05:00
Alex Elder	8921d114f5	libceph: make ceph_con_revoke_message() a msg op ceph_con_revoke_message() is passed both a message and a ceph connection. A ceph_msg allocated for incoming messages on a connection always has a pointer to that connection, so there's no need to provide the connection when revoking such a message. Note that the existing logic does not preclude the message supplied being a null/bogus message pointer. The only user of this interface is the OSD client, and the only value an osd client passes is a request's r_reply field. That is always non-null (except briefly in an error path in ceph_osdc_alloc_request(), and that drops the only reference so the request won't ever have a reply to revoke). So we can safely assume the passed-in message is non-null, but add a BUG_ON() to make it very obvious we are imposing this restriction. Rename the function ceph_msg_revoke_incoming() to reflect that it is really an operation on an incoming message. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-06-06 09:23:55 -05:00
Alex Elder	6740a845b2	libceph: make ceph_con_revoke() a msg operation ceph_con_revoke() is passed both a message and a ceph connection. Now that any message associated with a connection holds a pointer to that connection, there's no need to provide the connection when revoking a message. This has the added benefit of precluding the possibility of the providing the wrong connection pointer. If the message's connection pointer is null, it is not being tracked by any connection, so revoking it is a no-op. This is supported as a convenience for upper layers, so they can revoke a message that is not actually "in flight." Rename the function ceph_msg_revoke() to reflect that it is really an operation on a message, not a connection. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-06-06 09:23:54 -05:00
Alex Elder	92ce034b5a	libceph: have messages take a connection reference There are essentially two types of ceph messages: incoming and outgoing. Outgoing messages are always allocated via ceph_msg_new(), and at the time of their allocation they are not associated with any particular connection. Incoming messages are always allocated via ceph_con_in_msg_alloc(), and they are initially associated with the connection from which incoming data will be placed into the message. When an outgoing message gets sent, it becomes associated with a connection and remains that way until the message is successfully sent. The association of an incoming message goes away at the point it is sent to an upper layer via a con->ops->dispatch method. This patch implements reference counting for all ceph messages, such that every message holds a reference (and a pointer) to a connection if and only if it is associated with that connection (as described above). For background, here is an explanation of the ceph message lifecycle, emphasizing when an association exists between a message and a connection. Outgoing Messages An outgoing message is "owned" by its allocator, from the time it is allocated in ceph_msg_new() up to the point it gets queued for sending in ceph_con_send(). Prior to that point the message's msg->con pointer is null; at the point it is queued for sending its message pointer is assigned to refer to the connection. At that time the message is inserted into a connection's out_queue list. When a message on the out_queue list has been sent to the socket layer to be put on the wire, it is transferred out of that list and into the connection's out_sent list. At that point it is still owned by the connection, and will remain so until an acknowledgement is received from the recipient that indicates the message was successfully transferred. When such an acknowledgement is received (in process_ack()), the message is removed from its list (in ceph_msg_remove()), at which point it is no longer associated with the connection. So basically, any time a message is on one of a connection's lists, it is associated with that connection. Reference counting outgoing messages can thus be done at the points a message is added to the out_queue (in ceph_con_send()) and the point it is removed from either its two lists (in ceph_msg_remove())--at which point its connection pointer becomes null. Incoming Messages When an incoming message on a connection is getting read (in read_partial_message()) and there is no message in con->in_msg, a new one is allocated using ceph_con_in_msg_alloc(). At that point the message is associated with the connection. Once that message has been completely and successfully read, it is passed to upper layer code using the connection's con->ops->dispatch method. At that point the association between the message and the connection no longer exists. Reference counting of connections for incoming messages can be done by taking a reference to the connection when the message gets allocated, and releasing that reference when it gets handed off using the dispatch method. We should never fail to get a connection reference for a message--the since the caller should already hold one. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-06-06 09:23:54 -05:00
Alex Elder	38941f8031	libceph: have messages point to their connection When a ceph message is queued for sending it is placed on a list of pending messages (ceph_connection->out_queue). When they are actually sent over the wire, they are moved from that list to another (ceph_connection->out_sent). When acknowledgement for the message is received, it is removed from the sent messages list. During that entire time the message is "in the possession" of a single ceph connection. Keep track of that connection in the message. This will be used in the next patch (and is a helpful bit of information for debugging anyway). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-06-06 09:23:54 -05:00
Alex Elder	1c20f2d267	libceph: tweak ceph_alloc_msg() The function ceph_alloc_msg() is only used to allocate a message that will be assigned to a connection's in_msg pointer. Rename the function so this implied usage is more clear. In addition, make that assignment inside the function (again, since that's precisely what it's intended to be used for). This allows us to return what is now provided via the passed-in address of a "skip" variable. The return type is now Boolean to be explicit that there are only two possible outcomes. Make sure the result of an ->alloc_msg method call always sets the value of *skip properly. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-06-06 09:23:54 -05:00
Alex Elder	1bfd89f4e6	libceph: fully initialize connection in con_init() Move the initialization of a ceph connection's private pointer, operations vector pointer, and peer name information into ceph_con_init(). Rearrange the arguments so the connection pointer is first. Hide the byte-swapping of the peer entity number inside ceph_con_init() Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-06-06 09:23:54 -05:00
Alex Elder	a5988c490e	libceph: set CLOSED state bit in con_init Once a connection is fully initialized, it is really in a CLOSED state, so make that explicit by setting the bit in its state field. It is possible for a connection in NEGOTIATING state to get a failure, leading to ceph_fault() and ultimately ceph_con_close(). Clear that bits if it is set in that case, to reflect that the connection truly is closed and is no longer participating in a connect sequence. Issue a warning if ceph_con_open() is called on a connection that is not in CLOSED state. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-06-01 08:37:56 -05:00
Alex Elder	ce2c8903e7	libceph: start tracking connection socket state Start explicitly keeping track of the state of a ceph connection's socket, separate from the state of the connection itself. Create placeholder functions to encapsulate the state transitions. -------- \| NEW* \| transient initial state -------- \| con_sock_state_init() v ---------- \| CLOSED \| initialized, but no socket (and no ---------- TCP connection) ^ \ \| \ con_sock_state_connecting() \| ---------------------- \| \ + con_sock_state_closed() \ \|\ \ \| \ \ \| ----------- \ \| \| CLOSING \| socket event; \ \| ----------- await close \ \| ^ \| \| \| \| \| + con_sock_state_closing() \| \| / \ \| \| / --------------- \| \| / \ v \| / -------------- \| / -----------------\| CONNECTING \| socket created, TCP \| \| / -------------- connect initiated \| \| \| con_sock_state_connected() \| \| v ------------- \| CONNECTED \| TCP connection established ------------- Make the socket state an atomic variable, reinforcing that it's a distinct transtion with no possible "intermediate/both" states. This is almost certainly overkill at this point, though the transitions into CONNECTED and CLOSING state do get called via socket callback (the rest of the transitions occur with the connection mutex held). We can back out the atomicity later. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil<sage@inktank.com>	2012-06-01 08:37:56 -05:00
Alex Elder	928443cd96	libceph: start separating connection flags from state A ceph_connection holds a mixture of connection state (as in "state machine" state) and connection flags in a single "state" field. To make the distinction more clear, define a new "flags" field and use it rather than the "state" field to hold Boolean flag values. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil<sage@inktank.com>	2012-06-01 08:37:56 -05:00
Alex Elder	15d9882c33	libceph: embed ceph messenger structure in ceph_client A ceph client has a pointer to a ceph messenger structure in it. There is always exactly one ceph messenger for a ceph client, so there is no need to allocate it separate from the ceph client structure. Switch the ceph_client structure to embed its ceph_messenger structure. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-06-01 08:37:56 -05:00
Alex Elder	e22004235a	libceph: rename kvec_reset and kvec_add functions The functions ceph_con_out_kvec_reset() and ceph_con_out_kvec_add() are entirely private functions, so drop the "ceph_" prefix in their name to make them slightly more wieldy. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-06-01 08:37:55 -05:00
Alex Elder	327800bdc2	libceph: rename socket callbacks Change the names of the three socket callback functions to make it more obvious they're specifically associated with a connection's socket (not the ceph connection that uses it). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-06-01 08:37:55 -05:00
Alex Elder	6384bb8b8e	libceph: kill bad_proto ceph connection op No code sets a bad_proto method in its ceph connection operations vector, so just get rid of it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>	2012-06-01 08:37:55 -05:00
Alex Elder	e5e372da9a	libceph: eliminate connection state "DEAD" The ceph connection state "DEAD" is never set and is therefore not needed. Eliminate it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>	2012-06-01 08:37:55 -05:00
Linus Torvalds	af56e0aa35	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph updates from Sage Weil: "There are some updates and cleanups to the CRUSH placement code, a bug fix with incremental maps, several cleanups and fixes from Josh Durgin in the RBD block device code, a series of cleanups and bug fixes from Alex Elder in the messenger code, and some miscellaneous bounds checking and gfp cleanups/fixes." Fix up trivial conflicts in net/ceph/{messenger.c,osdmap.c} due to the networking people preferring "unsigned int" over just "unsigned". * git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (45 commits) libceph: fix pg_temp updates libceph: avoid unregistering osd request when not registered ceph: add auth buf in prepare_write_connect() ceph: rename prepare_connect_authorizer() ceph: return pointer from prepare_connect_authorizer() ceph: use info returned by get_authorizer ceph: have get_authorizer methods return pointers ceph: ensure auth ops are defined before use ceph: messenger: reduce args to create_authorizer ceph: define ceph_auth_handshake type ceph: messenger: check return from get_authorizer ceph: messenger: rework prepare_connect_authorizer() ceph: messenger: check prepare_write_connect() result ceph: don't set WRITE_PENDING too early ceph: drop msgr argument from prepare_write_connect() ceph: messenger: send banner in process_connect() ceph: messenger: reset connection kvec caller libceph: don't reset kvec in prepare_write_banner() ceph: ignore preferred_osd field ceph: fully initialize new layout ...	2012-05-30 11:17:19 -07:00
Alex Elder	3da54776e2	ceph: add auth buf in prepare_write_connect() Move the addition of the authorizer buffer to a connection's out_kvec out of get_connect_authorizer() and into its caller. This way, the caller--prepare_write_connect()--can avoid adding the connect header to out_kvec before it has been fully initialized. Prior to this patch, it was possible for a connect header to be sent over the wire before the authorizer protocol or buffer length fields were initialized. An authorizer buffer associated with that header could also be queued to send only after the connection header that describes it was on the wire. Fixes http://tracker.newdream.net/issues/2424 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-18 17:35:59 -07:00
Alex Elder	dac1e716c6	ceph: rename prepare_connect_authorizer() Change the name of prepare_connect_authorizer(). The next patch is going to make this function no longer add anything to the connection's out_kvec, so it will no longer fit the pattern of the rest of the prepare_connect_*() functions. In addition, pass the address of a variable that will hold the authorization protocol to use. Move the assignment of that to the connection's out_connect structure into prepare_write_connect(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:13 -05:00
Alex Elder	729796be91	ceph: return pointer from prepare_connect_authorizer() Change prepare_connect_authorizer() so it returns a pointer (or pointer-coded error). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:13 -05:00
Alex Elder	8f43fb5389	ceph: use info returned by get_authorizer Rather than passing a bunch of arguments to be filled in with the content of the ceph_auth_handshake buffer now returned by the get_authorizer method, just use the returned information in the caller, and drop the unnecessary arguments. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:13 -05:00
Alex Elder	a3530df33e	ceph: have get_authorizer methods return pointers Have the get_authorizer auth_client method return a ceph_auth pointer rather than an integer, pointer-encoding any returned error value. This is to pave the way for making use of the returned value in an upcoming patch. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:13 -05:00
Alex Elder	ed96af6460	ceph: messenger: check return from get_authorizer In prepare_connect_authorizer(), a connection's get_authorizer method is called but ignores its return value. This function can return an error, so check for it and return it if that ever occurs. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:12 -05:00
Alex Elder	b1c6b9803f	ceph: messenger: rework prepare_connect_authorizer() Change prepare_connect_authorizer() so it returns without dropping the connection mutex if the connection has no get_authorizer method. Use the symbolic CEPH_AUTH_UNKNOWN instead of 0 when assigning authorization protocols. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:12 -05:00
Alex Elder	5a0f8fdd8a	ceph: messenger: check prepare_write_connect() result prepare_write_connect() can return an error, but only one of its callers checks for it. All the rest are in functions that already return errors, so it should be fine to return the error if one gets returned. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:12 -05:00
Alex Elder	e10c758e40	ceph: don't set WRITE_PENDING too early prepare_write_connect() prepares a connect message, then sets WRITE_PENDING on the connection. Then after this, it calls prepare_connect_authorizer(), which updates the content of the connection buffer already queued for sending. It's also possible it will result in prepare_write_connect() returning -EAGAIN despite the WRITE_PENDING big getting set. Fix this by preparing the connect authorizer first, setting the WRITE_PENDING bit only after that is done. Partially addresses http://tracker.newdream.net/issues/2424 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:12 -05:00
Alex Elder	e825a66df9	ceph: drop msgr argument from prepare_write_connect() In all cases, the value passed as the msgr argument to prepare_write_connect() is just con->msgr. Just get the msgr value from the ceph connection and drop the unneeded argument. The only msgr passed to prepare_write_banner() is also therefore just the one from con->msgr, so change that function to drop the msgr argument as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:12 -05:00
Alex Elder	41b90c0085	ceph: messenger: send banner in process_connect() prepare_write_connect() has an argument indicating whether a banner should be sent out before sending out a connection message. It's only ever set in one of its callers, so move the code that arranges to send the banner into that caller and drop the "include_banner" argument from prepare_write_connect(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:12 -05:00
Alex Elder	84fb3adf64	ceph: messenger: reset connection kvec caller Reset a connection's kvec fields in the caller rather than in prepare_write_connect(). This ends up repeating a few lines of code but it's improving the separation between distinct operations on the connection, which we can take advantage of later. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:12 -05:00
Alex Elder	d329156f16	libceph: don't reset kvec in prepare_write_banner() Move the kvec reset for a connection out of prepare_write_banner and into its only caller. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-17 08:18:12 -05:00
Alex Elder	fd51653f78	ceph: messenger: change read_partial() to take "end" arg Make the second argument to read_partial() be the ending input byte position rather than the beginning offset it now represents. This amounts to moving the addition "to + size" into the caller. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-14 12:16:42 -05:00
Alex Elder	e6cee71fac	ceph: messenger: update "to" in read_partial() caller read_partial() always increases whatever "to" value is supplied by adding the requested size to it, and that's the only thing it does with that pointed-to value. Do that pointer advance in the caller (and then only when the updated value will be subsequently used), and change the "to" parameter to be an in-only and non-pointer value. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-14 12:16:42 -05:00
Alex Elder	57dac9d162	ceph: messenger: use read_partial() in read_partial_message() There are two blocks of code in read_partial_message()--those that read the header and footer of the message--that can be replaced by a call to read_partial(). Do that. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-05-14 12:16:41 -05:00
Eric Dumazet	95c9617472	net: cleanup unsigned to unsigned int Use of "unsigned int" is preferred to bare "unsigned" in net tree. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-15 12:44:40 -04:00
Alex Elder	8d63e318c4	libceph: isolate kmap() call in write_partial_msg_pages() In write_partial_msg_pages(), every case now does an identical call to kmap(page). Instead, just call it once inside the CRC-computing block where it's needed. Move the definition of kaddr inside that block, and make it a (char *) to ensure portable pointer arithmetic. We still don't kunmap() it until after the sendpage() call, in case that also ends up needing to use the mapping. Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:52 -05:00
Alex Elder	9bd1966344	libceph: rename "page_shift" variable to something sensible In write_partial_msg_pages() there is a local variable used to track the starting offset within a bio segment to use. Its name, "page_shift" defies the Linux convention of using that name for log-base-2(page size). Since it's only used in the bio case rename it "bio_offset". Use it along with the page_pos field to compute the memory offset when computing CRC's in that function. This makes the bio case match the others more closely. Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:52 -05:00
Alex Elder	0cdf9e6018	libceph: get rid of zero_page_address There's not a lot of benefit to zero_page_address, which basically holds a mapping of the zero page through the life of the messenger module. Even with our own mapping, the sendpage interface where it's used may need to kmap() it again. It's almost certain to be in low memory anyway. So stop treating the zero page specially in write_partial_msg_pages() and just get rid of zero_page_address entirely. Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:52 -05:00
Alex Elder	e36b13cceb	libceph: only call kernel_sendpage() via helper Make ceph_tcp_sendpage() be the only place kernel_sendpage() is used, by using this helper in write_partial_msg_pages(). Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:52 -05:00
Alex Elder	31739139f3	libceph: use kernel_sendpage() for sending zeroes If a message queued for send gets revoked, zeroes are sent over the wire instead of any unsent data. This is done by constructing a message and passing it to kernel_sendmsg() via ceph_tcp_sendmsg(). Since we are already working with a page in this case we can use the sendpage interface instead. Create a new ceph_tcp_sendpage() helper that sets up flags to match the way ceph_tcp_sendmsg() does now. Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:51 -05:00
Alex Elder	37675b0f42	libceph: fix inverted crc option logic CRC's are computed for all messages between ceph entities. The CRC computation for the data portion of message can optionally be disabled using the "nocrc" (common) ceph option. The default is for CRC computation for the data portion to be enabled. Unfortunately, the code that implements this feature interprets the feature flag wrong, meaning that by default the CRC's have not been computed (or checked) for the data portion of messages unless the "nocrc" option was supplied. Fix this, in write_partial_msg_pages() and read_partial_message(). Also change the flag variable in write_partial_msg_pages() to be "no_datacrc" to match the usage elsewhere in the file. This fixes http://tracker.newdream.net/issues/2064 Signed-off-by: Alex Elder <elder@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:51 -05:00
Alex Elder	84495f4961	libceph: some simple changes Nothing too big here. - define the size of the buffer used for consuming ignored incoming data using a symbolic constant - simplify the condition determining whether to unmap the page in write_partial_msg_pages(): do it for crc but not if the page is the zero page Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:51 -05:00
Alex Elder	f42299e6c3	libceph: small refactor in write_partial_kvec() Make a small change in the code that counts down kvecs consumed by a ceph_tcp_sendmsg() call. Same functionality, just blocked out a little differently. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:51 -05:00
Alex Elder	fe3ad593e2	libceph: do crc calculations outside loop Move blocks of code out of loops in read_partial_message_section() and read_partial_message(). They were only was getting called at the end of the last iteration of the loop anyway. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:51 -05:00
Alex Elder	a9a0c51af4	libceph: separate CRC calculation from byte swapping Calculate CRC in a separate step from rearranging the byte order of the result, to improve clarity and readability. Use offsetof() to determine the number of bytes to include in the CRC calculation. In read_partial_message(), switch which value gets byte-swapped, since the just-computed CRC is already likely to be in a register. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:51 -05:00
Alex Elder	bca064d236	libceph: use "do" in CRC-related Boolean variables Change the name (and type) of a few CRC-related Boolean local variables so they contain the word "do", to distingish their purpose from variables used for holding an actual CRC value. Note that in the process of doing this I identified a fairly serious logic error in write_partial_msg_pages(): the value of "do_crc" assigned appears to be the opposite of what it should be. No attempt to fix this is made here; this change preserves the erroneous behavior. The problem I found is documented here: http://tracker.newdream.net/issues/2064 Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:51 -05:00
Alex Elder	d3002b974c	libceph: a few small changes This gathers a number of very minor changes: - use %hu when formatting the a socket address's address family - null out the ceph_msgr_wq pointer after the queue has been destroyed - drop a needless cast in ceph_write_space() - add a WARN() call in ceph_state_change() in the event an unrecognized socket state is encountered - rearrange the logic in ceph_con_get() and ceph_con_put() so that: - the reference counts are only atomically read once - the values displayed via dout() calls are known to be meaningful at the time they are formatted Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:51 -05:00
Alex Elder	41617d0c9c	libceph: make ceph_tcp_connect() return int There is no real need for ceph_tcp_connect() to return the socket pointer it creates, since it already assigns it to con->sock, which is visible to the caller. Instead, have it return an error code, which tidies things up a bit. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:51 -05:00
Alex Elder	6173d1f02f	libceph: encapsulate some messenger cleanup code Define a helper function to perform various cleanup operations. Use it both in the exit routine and in the init routine in the event of an error. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:50 -05:00
Alex Elder	e0f43c9419	libceph: make ceph_msgr_wq private The messenger workqueue has no need to be public. So give it static scope. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:50 -05:00
Alex Elder	859eb79948	libceph: encapsulate connection kvec operations Encapsulate the operation of adding a new chunk of data to the next open slot in a ceph_connection's out_kvec array. Also add a "reset" operation to make subsequent add operations start at the beginning of the array again. Use these routines throughout, avoiding duplicate code and ensuring all calls are handled consistently. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:50 -05:00
Alex Elder	963be4d770	libceph: move prepare_write_banner() One of the arguments to prepare_write_connect() indicates whether it is being called immediately after a call to prepare_write_banner(). Move the prepare_write_banner() call inside prepare_write_connect(), and reinterpret (and rename) the "after_banner" argument so it indicates that prepare_write_connect() should make the call rather than should know it has already been made. This was split out from the next patch to highlight this change in logic. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:50 -05:00
Alex Elder	99f0f3b2c4	ceph: eliminate some abusive casts This fixes some spots where a type cast to (void *) was used as as a universal type hiding mechanism. Instead, properly cast the type to the intended target type. Signed-off-by: Alex Elder <elder@newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:45 -05:00
Alex Elder	bd40614512	ceph: eliminate some needless casts This eliminates type casts in some places where they are not required. Signed-off-by: Alex Elder <elder@newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:45 -05:00
Alex Elder	f64a93172b	ceph: kill addr_str_lock spinlock; use atomic instead A spinlock is used to protect a value used for selecting an array index for a string used for formatting a socket address for human consumption. The index is reset to 0 if it ever reaches the maximum index value. Instead, use an ever-increasing atomic variable as a sequence number, and compute the array index by masking off all but the sequence number's lowest bits. Make the number of entries in the array a power of two to allow the use of such a mask (to avoid jumps in the index value when the sequence number wraps). The length of these strings is somewhat arbitrarily set at 60 bytes. The worst-case length of a string produced is 54 bytes, for an IPv6 address that can't be shortened, e.g.: [1234:5678:9abc:def0:1111:2222:123.234.210.100]:32767 Change it so we arbitrarily use 64 bytes instead; if nothing else it will make the array of these line up better in hex dumps. Rename a few things to reinforce the distinction between the number of strings in the array and the length of individual strings. Signed-off-by: Alex Elder <elder@newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:45 -05:00
Alex Elder	a5bc3129a2	ceph: make use of "else" where appropriate Rearrange ceph_tcp_connect() a bit, making use of "else" rather than re-testing a value with consecutive "if" statements. Don't record a connection's socket pointer unless the connect operation is successful. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:45 -05:00
Alex Elder	5766651971	ceph: use a shared zero page rather than one per messenger Each messenger allocates a page to be used when writing zeroes out in the event of error or other abnormal condition. Instead, use the kernel ZERO_PAGE() for that purpose. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:45 -05:00
Jim Schutt	182fac2689	net/ceph: Only clear SOCK_NOSPACE when there is sufficient space in the socket buffer The Ceph messenger would sometimes queue multiple work items to write data to a socket when the socket buffer was full. Fix this problem by making ceph_write_space() use SOCK_NOSPACE in the same way that net/core/stream.c:sk_stream_write_space() does, i.e., clearing it only when sufficient space is available in the socket buffer. Signed-off-by: Jim Schutt <jaschut@sandia.gov> Reviewed-by: Alex Elder <elder@dreamhost.com>	2012-03-22 10:47:45 -05:00
Paul Gortmaker	bc3b2d7fb9	net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules These files are non modular, but need to export symbols using the macros now living in export.h -- call out the include so that things won't break when we remove the implicit presence of module.h from everywhere. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2011-10-31 19:30:30 -04:00
Noah Watkins	ee3b56f265	ceph: use kernel DNS resolver Change ceph_parse_ips to take either names given as IP addresses or standard hostnames (e.g. localhost). The DNS lookup is done using the dns_resolver facility similar to its use in AFS, NFS, and CIFS. This patch defines CONFIG_CEPH_LIB_USE_DNS_RESOLVER that controls if this feature is on or off. Signed-off-by: Noah Watkins <noahwatkins@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:16 -07:00
Sage Weil	f0ed1b7cef	libceph: warn on msg allocation failures Any non-masked msg allocation failure should generate a warning and stack trace to the console. All of these need to eventually be replaced by safe preallocation or msgpools. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:16 -07:00
Sage Weil	b61c27636f	libceph: don't complain on msgpool alloc failures The pool allocation failures are masked by the pool; there is no need to spam the console about them. (That's the whole point of having the pool in the first place.) Mark msg allocations whose failure is safely handled as such. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:15 -07:00
Jim Schutt	c0d5f9db1c	libceph: initialize ack_stamp to avoid unnecessary connection reset Commit `4cf9d54463` recorded when an outgoing ceph message was ACKed, in order to avoid unnecessary connection resets when an OSD is busy. However, ack_stamp is uninitialized, so there is a window between when the message is sent and when it is ACKed in which handle_timeout() interprets the unitialized value as an expired timeout, and resets the connection unnecessarily. Close the window by initializing ack_stamp. Signed-off-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Sage Weil <sage@newdream.net>	2011-09-16 09:16:22 -07:00
Sage Weil	4cf9d54463	libceph: don't time out osd requests that haven't been received Keep track of when an outgoing message is ACKed (i.e., the server fully received it and, presumably, queued it for processing). Time out OSD requests only if it's been too long since they've been received. This prevents timeouts and connection thrashing when the OSDs are simply busy and are throttling the requests they read off the network. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:27:24 -07:00
Sage Weil	a2a79609c0	libceph: add missing breaks in addr_set_port Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-19 11:25:05 -07:00
Sage Weil	0417788226	libceph: fix TAG_WAIT case If we get a WAIT as a client something went wrong; error out. And don't fall through to an unrelated case. Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-19 11:25:04 -07:00
Sage Weil	12a2f643b0	libceph: use snprintf for unknown addrs Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-19 11:25:03 -07:00
Sage Weil	e8f54ce169	libceph: fix uninitialized value when no get_authorizer method is set If there is no get_authorizer method we set the out_kvec to a bogus pointer. The length is also zero in that case, so it doesn't much matter, but it's better not to add the empty item in the first place. Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-19 11:25:02 -07:00
Sage Weil	0da5d70369	libceph: handle connection reopen race with callbacks If a connection is closed and/or reopened (ceph_con_close, ceph_con_open) it can race with a callback. con_work does various state checks for closed or reopened sockets at the beginning, but drops con->mutex before making callbacks. We need to check for state bit changes after retaking the lock to ensure we restart con_work and execute those CLOSED/OPENING tests or else we may end up operating under stale assumptions. In Jim's case, this was causing 'bad tag' errors. There are four cases where we re-take the con->mutex inside con_work: catch them all and return EAGAIN from try_{read,write} so that we can restart con_work. Reported-by: Jim Schutt <jaschut@sandia.gov> Tested-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-19 11:21:05 -07:00
Henry C Chang	ca20892db7	libceph: fix ceph_msg_new error path If memory allocation failed, calling ceph_msg_put() will cause GPF since some of ceph_msg variables are not initialized first. Fix Bug #970. Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-03 09:28:11 -07:00
Sage Weil	e00de341fd	libceph: fix msgr standby handling The standby logic used to be pretty dependent on the work requeueing behavior that changed when we switched to WQ_NON_REENTRANT. It was also very fragile. Restructure things so that: - We clear WRITE_PENDING when we set STANDBY. This ensures we will requeue work when we wake up later. - con_work backs off if STANDBY is set. There is nothing to do if we are in standby. - clear_standby() helper is called by both con_send() and con_keepalive(), the two actions that can wake us up again. Move the connect_seq++ logic here. Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-04 12:25:05 -08:00
Sage Weil	e76661d0a5	libceph: fix msgr keepalive flag There was some broken keepalive code using a dead variable. Shift to using the proper bit flag. Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-04 12:24:31 -08:00
Sage Weil	60bf8bf881	libceph: fix msgr backoff With commit f363e45f we replaced a bunch of hacky workqueue mutual exclusion logic with the WQ_NON_REENTRANT flag. One pieces of fallout is that the exponential backoff breaks in certain cases: * con_work attempts to connect. * we get an immediate failure, and the socket state change handler queues immediate work. * con_work calls con_fault, we decide to back off, but can't queue delayed work. In this case, we add a BACKOFF bit to make con_work reschedule delayed work next time it runs (which should be immediately). Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-04 12:24:28 -08:00
Sage Weil	692d20f576	libceph: retry after authorization failure If we mark the connection CLOSED we will give up trying to reconnect to this server instance. That is appropriate for things like a protocol version mismatch that won't change until the server is restarted, at which point we'll get a new addr and reconnect. An authorization failure like this is probably due to the server not properly rotating it's secret keys, however, and should be treated as transient so that the normal backoff and retry behavior kicks in. Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-03 13:47:40 -08:00
Sage Weil	42961d2333	libceph: fix socket write error handling Pass errors from writing to the socket up the stack. If we get -EAGAIN, return 0 from the helper to simplify the callers' checks. Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-25 08:19:34 -08:00
Sage Weil	98bdb0aa00	libceph: fix socket read error handling If we get EAGAIN when trying to read from the socket, it is not an error. Return 0 from the helper in this case to simplify the error handling cases in the caller (indirectly, try_read). Fix try_read to pass any error to it's caller (con_work) instead of almost always returning 0. This let's us respond to things like socket disconnects. Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-25 08:17:48 -08:00
Tejun Heo	f363e45fd1	net/ceph: make ceph_msgr_wq non-reentrant ceph messenger code does a rather complex dancing around multithread workqueue to make sure the same work item isn't executed concurrently on different CPUs. This restriction can be provided by workqueue with WQ_NON_REENTRANT. Make ceph_msgr_wq non-reentrant workqueue with the default concurrency level and remove the QUEUED/BUSY logic. * This removes backoff handling in con_work() but it couldn't reliably block execution of con_work() to begin with - queue_con() can be called after the work started but before BUSY is set. It seems that it was an optimization for a rather cold path and can be safely removed. * The number of concurrent work items is bound by the number of connections and connetions are independent from each other. With the default concurrency level, different connections will be executed independently. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Sage Weil <sage@newdream.net> Cc: ceph-devel@vger.kernel.org Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-12 15:15:14 -08:00
Sage Weil	d96c9043d1	ceph: fix msgr_init error path create_workqueue() returns NULL on failure. Signed-off-by: Sage Weil <sage@newdream.net>	2010-12-13 20:30:28 -08:00
Sage Weil	c5c6b19d4b	ceph: explicitly specify page alignment in network messages The alignment used for reading data into or out of pages used to be taken from the data_off field in the message header. This only worked as long as the page alignment matched the object offset, breaking direct io to non-page aligned offsets. Instead, explicitly specify the page alignment next to the page vector in the ceph_msg struct, and use that instead of the message header (which probably shouldn't be trusted). The alloc_msg callback is responsible for filling in this field properly when it sets up the page vector. Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-09 12:43:17 -08:00
Sage Weil	df9f86faf3	ceph: fix small seq message skipping If the client gets out of sync with the server message sequence number, we normally skip low seq messages (ones we already received). The skip code was also incrementing the expected seq, such that all subsequent messages also appeared old and got skipped, and an eventual timeout on the osd connection. This resulted in some lagging requests and console messages like [233480.882885] ceph: skipping osd22 10.138.138.13:6804 seq 2016, expected 2017 [233480.882919] ceph: skipping osd22 10.138.138.13:6804 seq 2017, expected 2018 [233480.882963] ceph: skipping osd22 10.138.138.13:6804 seq 2018, expected 2019 [233480.883488] ceph: skipping osd22 10.138.138.13:6804 seq 2019, expected 2020 [233485.219558] ceph: skipping osd22 10.138.138.13:6804 seq 2020, expected 2021 [233485.906595] ceph: skipping osd22 10.138.138.13:6804 seq 2021, expected 2022 [233490.379536] ceph: skipping osd22 10.138.138.13:6804 seq 2022, expected 2023 [233495.523260] ceph: skipping osd22 10.138.138.13:6804 seq 2023, expected 2024 [233495.923194] ceph: skipping osd22 10.138.138.13:6804 seq 2024, expected 2025 [233500.534614] ceph: tid 6023602 timed out on osd22, will reset osd Reported-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-01 15:49:23 -07:00
Yehuda Sadeh	3d14c5d2b6	ceph: factor out libceph from Ceph file system This factors out protocol and low-level storage parts of ceph into a separate libceph module living in net/ceph and include/linux/ceph. This is mostly a matter of moving files around. However, a few key pieces of the interface change as well: - ceph_client becomes ceph_fs_client and ceph_client, where the latter captures the mon and osd clients, and the fs_client gets the mds client and file system specific pieces. - Mount option parsing and debugfs setup is correspondingly broken into two pieces. - The mon client gets a generic handler callback for otherwise unknown messages (mds map, in this case). - The basic supported/required feature bits can be expanded (and are by ceph_fs_client). No functional change, aside from some subtle error handling cases that got cleaned up in the refactoring process. Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-20 15:37:28 -07:00

1 2 3 4 5

240 Commits