OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Alex Elder	49719778bf	libceph: support raw data requests Allow osd request ops that aren't otherwise structured (not class, extent, or watch ops) to specify "raw" data to be used to hold incoming data for the op. Make use of this capability for the osd STAT op. Prefix the name of the private function osd_req_op_init() with "_", and expose a new function by that (earlier) name whose purpose is to initialize osd ops with (only) implied data. For now we'll just support the use of a page array for an osd op with incoming raw data. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:00 -07:00
Alex Elder	863c7eb590	libceph: clean up osd data field access functions There are a bunch of functions defined to encapsulate getting the address of a data field for a particular op in an osd request. They're all defined the same way, so create a macro to take the place of all of them. Two of these are used outside the osd client code, so preserve them (but convert them to use the new macro internally). Stop exporting the ones that aren't used elsewhere. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:59 -07:00
Alex Elder	406e2c9f92	libceph: kill off osd data write_request parameters In the incremental move toward supporting distinct data items in an osd request some of the functions had "write_request" parameters to indicate, basically, whether the data belonged to in_data or the out_data. Now that we maintain the data fields in the op structure there is no need to indicate the direction, so get rid of the "write_request" parameters. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:58 -07:00
Alex Elder	26be88087a	libceph: change how "safe" callback is used An osd request currently has two callbacks. They inform the initiator of the request when we've received confirmation for the target osd that a request was received, and when the osd indicates all changes described by the request are durable. The only time the second callback is used is in the ceph file system for a synchronous write. There's a race that makes some handling of this case unsafe. This patch addresses this problem. The error handling for this callback is also kind of gross, and this patch changes that as well. In ceph_sync_write(), if a safe callback is requested we want to add the request on the ceph inode's unsafe items list. Because items on this list must have their tid set (by ceph_osd_start_request()), the request added after the call to that function returns. The problem with this is that there's a race between starting the request and adding it to the unsafe items list; the request may already be complete before ceph_sync_write() even begins to put it on the list. To address this, we change the way the "safe" callback is used. Rather than just calling it when the request is "safe", we use it to notify the initiator the bounds (start and end) of the period during which the request is unsafe. So the initiator gets notified just before the request gets sent to the osd (when it is "unsafe"), and again when it's known the results are durable (it's no longer unsafe). The first call will get made in __send_request(), just before the request message gets sent to the messenger for the first time. That function is only called by __send_queued(), which is always called with the osd client's request mutex held. We then have this callback function insert the request on the ceph inode's unsafe list when we're told the request is unsafe. This will avoid the race because this call will be made under protection of the osd client's request mutex. It also nicely groups the setup and cleanup of the state associated with managing unsafe requests. The name of the "safe" callback field is changed to "unsafe" to better reflect its new purpose. It has a Boolean "unsafe" parameter to indicate whether the request is becoming unsafe or is now safe. Because the "msg" parameter wasn't used, we drop that. This resolves the original problem reportedin: http://tracker.ceph.com/issues/4706 Reported-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-05-01 21:18:52 -07:00
Alex Elder	04017e29bb	libceph: make method call data be a separate data item Right now the data for a method call is specified via a pointer and length, and it's copied--along with the class and method name--into a pagelist data item to be sent to the osd. Instead, encode the data in a data item separate from the class and method names. This will allow large amounts of data to be supplied to methods without copying. Only rbd uses the class functionality right now, and when it really needs this it will probably need to use a page array rather than a page list. But this simple implementation demonstrates the functionality on the osd client, and that's enough for now. This resolves: http://tracker.ceph.com/issues/4104 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:35 -07:00
Alex Elder	90af36022a	libceph: add, don't set data for a message Change the names of the functions that put data on a pagelist to reflect that we're adding to whatever's already there rather than just setting it to the one thing. Currently only one data item is ever added to a message, but that's about to change. This resolves: http://tracker.ceph.com/issues/2770 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:34 -07:00
Alex Elder	ca8b3a6917	libceph: implement multiple data items in a message This patch adds support to the messenger for more than one data item in its data list. A message data cursor has two more fields to support this: - a count of the number of bytes left to be consumed across all data items in the list, "total_resid" - a pointer to the head of the list (for validation only) The cursor initialization routine has been split into two parts: the outer one, which initializes the cursor for traversing the entire list of data items; and the inner one, which initializes the cursor to start processing a single data item. When a message cursor is first initialized, the outer initialization routine sets total_resid to the length provided. The data pointer is initialized to the first data item on the list. From there, the inner initialization routine finishes by setting up to process the data item the cursor points to. Advancing the cursor consumes bytes in total_resid. If the resid field reaches zero, it means the current data item is fully consumed. If total_resid indicates there is more data, the cursor is advanced to point to the next data item, and then the inner initialization routine prepares for using that. (A check is made at this point to make sure we don't wrap around the front of the list.) The type-specific init routines are modified so they can be given a length that's larger than what the data item can support. The resid field is initialized to the smaller of the provided length and the length of the entire data item. When total_resid reaches zero, we're done. This resolves: http://tracker.ceph.com/issues/3761 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:33 -07:00
Alex Elder	5240d9f95d	libceph: replace message data pointer with list In place of the message data pointer, use a list head which links through message data items. For now we only support a single entry on that list. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:32 -07:00
Alex Elder	8ae4f4f5c0	libceph: have cursor point to data Rather than having a ceph message data item point to the cursor it's associated with, have the cursor point to a data item. This will allow a message cursor to be used for more than one data item. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:30 -07:00
Alex Elder	36153ec9dd	libceph: move cursor into message A message will only be processing a single data item at a time, so there's no need for each data item to have its own cursor. Move the cursor embedded in the message data structure into the message itself. To minimize the impact, keep the data->cursor field, but make it be a pointer to the cursor in the message. Move the definition of ceph_msg_data above ceph_msg_data_cursor so the cursor can point to the data without a forward definition rather than vice-versa. This and the upcoming patches are part of: http://tracker.ceph.com/issues/3761 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:29 -07:00
Alex Elder	c851c49591	libceph: record bio length The bio is the only data item type that doesn't record its full length. Fix that. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:28 -07:00
Alex Elder	f759ebb968	libceph: skip message if too big to receive We know the length of our message buffers. If we get a message that's too long, just dump it and ignore it. If skip was set then con->in_msg won't be valid, so be careful not to dereference a null pointer in the process. This resolves: http://tracker.ceph.com/issues/4664 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:27 -07:00
Alex Elder	ea96571f7b	libceph: fix possible CONFIG_BLOCK build problem This patch: 15a0d7b libceph: record message data length did not enclose some bio-specific code inside CONFIG_BLOCK as it should have. Fix that. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:26 -07:00
Alex Elder	5476492fba	libceph: kill off osd request r_data_in and r_data_out Finally! Convert the osd op data pointers into real structures, and make the switch over to using them instead of having all ops share the in and/or out data structures in the osd request. Set up a new function to traverse the set of ops and release any data associated with them (pages). This and the patches leading up to it resolve: http://tracker.ceph.com/issues/4657 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:25 -07:00
Alex Elder	ec9123c567	libceph: set the data pointers when encoding ops Still using the osd request r_data_in and r_data_out pointer, but we're basically only referring to it via the data pointers in the osd ops. And we're transferring that information to the request or reply message only when the op indicates it's needed, in osd_req_encode_op(). To avoid a forward reference, ceph_osdc_msg_data_set() was moved up in the file. Don't bother calling ceph_osd_data_init(), in ceph_osd_alloc(), because the ops array will already be zeroed anyway. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:24 -07:00
Alex Elder	a4ce40a9a7	libceph: combine initializing and setting osd data This ends up being a rather large patch but what it's doing is somewhat straightforward. Basically, this is replacing two calls with one. The first of the two calls is initializing a struct ceph_osd_data with data (either a page array, a page list, or a bio list); the second is setting an osd request op so it associates that data with one of the op's parameters. In place of those two will be a single function that initializes the op directly. That means we sort of fan out a set of the needed functions: - extent ops with pages data - extent ops with pagelist data - extent ops with bio list data and - class ops with page data for receiving a response We also have define another one, but it's only used internally: - class ops with pagelist data for request parameters Note that we still haven't gotten rid of the osd request's r_data_in and r_data_out fields. All the osd ops refer to them for their data. For now, these data fields are pointers assigned to the appropriate r_data_* field when these new functions are called. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:23 -07:00
Alex Elder	39b44cbe86	libceph: set message data when building osd request All calls of ceph_osdc_start_request() are preceded (in the case of rbd, almost) immediately by a call to ceph_osdc_build_request(). Move the build calls at the top of ceph_osdc_start_request() out of there and into the ceph_osdc_build_request(). Nothing prevents moving these calls to the top of ceph_osdc_build_request(), either (and we're going to want them there in the next patch) so put them at the top. This and the next patch are related to: http://tracker.ceph.com/issues/4657 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:22 -07:00
Alex Elder	e65550fd94	libceph: move ceph_osdc_build_request() This simply moves ceph_osdc_build_request() later in its source file without any change. Done as a separate patch to facilitate review of the change in the next patch. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:21 -07:00
Alex Elder	5f562df5f5	libceph: format class info at init time An object class method is formatted using a pagelist which contains the class name, the method name, and the data concatenated into an osd request's outbound data. Currently when a class op is initialized in osd_req_op_cls_init(), the lengths of and pointers to these three items are recorded. Later, when the op is getting formatted into the request message, a new pagelist is created and that is when these items get copied into the pagelist. This patch makes it so the pagelist to hold these items is created when the op is initialized instead. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:19 -07:00
Alex Elder	c99d2d4abb	libceph: specify osd op by index in request An osd request now holds all of its source op structures, and every place that initializes one of these is in fact initializing one of the entries in the the osd request's array. So rather than supplying the address of the op to initialize, have caller specify the osd request and an indication of which op it would like to initialize. This better hides the details the op structure (and faciltates moving the data pointers they use). Since osd_req_op_init() is a common routine, and it's not used outside the osd client code, give it static scope. Also make it return the address of the specified op (so all the other init routines don't have to repeat that code). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:15 -07:00
Alex Elder	8c042b0df9	libceph: add data pointers in osd op structures An extent type osd operation currently implies that there will be corresponding data supplied in the data portion of the request (for write) or response (for read) message. Similarly, an osd class method operation implies a data item will be supplied to receive the response data from the operation. Add a ceph_osd_data pointer to each of those structures, and assign it to point to eithre the incoming or the outgoing data structure in the osd message. The data is not always available when an op is initially set up, so add two new functions to allow setting them after the op has been initialized. Begin to make use of the data item pointer available in the osd operation rather than the request data in or out structure in places where it's convenient. Add some assertions to verify pointers are always set the way they're expected to be. This is a sort of stepping stone toward really moving the data into the osd request ops, to allow for some validation before making that jump. This is the first in a series of patches that resolve: http://tracker.ceph.com/issues/4657 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:14 -07:00
Alex Elder	54d5064912	libceph: rename data out field in osd request op There are fields "indata" and "indata_len" defined the ceph osd request op structure. The "in" part is with from the point of view of the osd server, but is a little confusing here on the client side. Change their names to use "request" instead of "in" to indicate that it defines data provided with the request (as opposed the data returned in the response). Rename the local variable in osd_req_encode_op() to match. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:13 -07:00
Alex Elder	79528734f3	libceph: keep source rather than message osd op array An osd request keeps a pointer to the osd operations (ops) array that it builds in its request message. In order to allow each op in the array to have its own distinct data, we will need to keep track of each op's data, and that information does not go over the wire. As long as we're tracking the data we might as well just track the entire (source) op definition for each of the ops. And if we're doing that, we'll have no more need to keep a pointer to the wire-encoded version. This patch makes the array of source ops be kept with the osd request structure, and uses that instead of the version encoded in the message in places where that was previously used. The array will be embedded in the request structure, and the maximum number of ops we ever actually use is currently 2. So reduce CEPH_OSD_MAX_OP to 2 to reduce the size of the structure. The result of doing this sort of ripples back up, and as a result various function parameters and local variables become unnecessary. Make r_num_ops be unsigned, and move the definition of struct ceph_osd_req_op earlier to ensure it's defined where needed. It does not yet add per-op data, that's coming soon. This resolves: http://tracker.ceph.com/issues/4656 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:12 -07:00
Alex Elder	23c08a9cb2	libceph: define ceph_osd_data_length() One more osd data helper, which returns the length of the data item, regardless of its type. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:09 -07:00
Alex Elder	c54d47bfad	libceph: define a few more helpers Define ceph_osd_data_init() and ceph_osd_data_release() to clean up a little code. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:08 -07:00
Alex Elder	43bfe5de9f	libceph: define osd data initialization helpers Define and use functions that encapsulate the initializion of a ceph_osd_data structure. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:06 -07:00
Alex Elder	9fc6e06471	libceph: compute incoming bytes once This is a simple change, extracting the number of incoming data bytes just once in handle_reply(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:05 -07:00
Alex Elder	98fa5dd883	libceph: provide data length when preparing message In prepare_message_data(), the length used to initialize the cursor is taken from the header of the message provided. I'm working toward not using the header data length field to determine length in outbound messages, and this is a step in that direction. For inbound messages this will be set to be the actual number of bytes that are arriving (which may be less than the total size of the data buffer available). This resolves: http://tracker.ceph.com/issues/4589 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:03 -07:00
Alex Elder	e5975c7c8e	ceph: build osd request message later for writepages Hold off building the osd request message in ceph_writepages_start() until just before it will be submitted to the osd client for execution. We'll still create the request and allocate the page pointer array after we learn we have at least one page to write. A local variable will be used to keep track of the allocated array of pages. Wait until just before submitting the request for assigning that page array pointer to the request message. Create ands use a new function osd_req_op_extent_update() whose purpose is to serve this one spot where the length value supplied when an osd request's op was initially formatted might need to get changed (reduced, never increased) before submitting the request. Previously, ceph_writepages_start() assigned the message header's data length because of this update. That's no longer necessary, because ceph_osdc_build_request() will recalculate the right value to use based on the content of the ops in the request. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:02 -07:00
Alex Elder	02ee07d300	libceph: hold off building osd request Defer building the osd request until just before submitting it in all callers except ceph_writepages_start(). (That caller will be handed in the next patch.) Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:01 -07:00
Alex Elder	acead002b2	libceph: don't build request in ceph_osdc_new_request() This patch moves the call to ceph_osdc_build_request() out of ceph_osdc_new_request() and into its caller. This is in order to defer formatting osd operation information into the request message until just before request is started. The only unusual (ab)user of ceph_osdc_build_request() is ceph_writepages_start(), where the final length of write request may change (downward) based on the current inode size or the oldest snapshot context with dirty data for the inode. The remaining callers don't change anything in the request after has been built. This means the ops array is now supplied by the caller. It also means there is no need to pass the mtime to ceph_osdc_new_request() (it gets provided to ceph_osdc_build_request()). And rather than passing a do_sync flag, have the number of ops in the ops array supplied imply adding a second STARTSYNC operation after the READ or WRITE requested. This and some of the patches that follow are related to having the messenger (only) be responsible for filling the content of the message header, as described here: http://tracker.ceph.com/issues/4589 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:58 -07:00
Alex Elder	a193080481	libceph: record message data length Keep track of the length of the data portion for a message in a separate field in the ceph_msg structure. This information has been maintained in wire byte order in the message header, but that's going to change soon. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:57 -07:00
Alex Elder	ace6d3a96f	libceph: drop ceph_osd_request->r_con_filling_msg A field in an osd request keeps track of whether a connection is currently filling the request's reply message. This patch gets rid of that field. An osd request includes two messages--a request and a reply--and they're both associated with the connection that existed to its the target osd at the time the request was created. An osd request can be dropped early, even when it's in flight. And at that time both messages are released. It's possible the reply message has been supplied to its connection to receive an incoming response message at the time the osd request gets dropped. So ceph_osdc_release_request() revokes that message from the connection before releasing it so things get cleaned up properly. Previously this may have caused a problem, because the connection that a message was associated with might have gone away before the revoke request. And to avoid any problems using that connection, the osd client held a reference to it when it supplies its response message. However since this commit: `38941f80` libceph: have messages point to their connection all messages hold a reference to the connection they are associated with whenever the connection is actively operating on the message (i.e. while the message is queued to send or sending, and when it data is being received into it). And if a message has no connection associated with it, ceph_msg_revoke_incoming() won't do anything when asked to revoke it. As a result, there is no need to keep an additional reference to the connection associated with a message when we hand the message to the messenger when it calls our alloc_msg() method to receive something. If the connection were operating on it, it would have its own reference, and if not, there's no work to be done when we need to revoke it. So get rid of the osd request's r_con_filling_msg field. This resolves: http://tracker.ceph.com/issues/4647 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:54 -07:00
Alex Elder	ef4859d647	libceph: define ceph_decode_pgid() only once There are two basically identical definitions of __decode_pgid() in libceph, one in "net/ceph/osdmap.c" and the other in "net/ceph/osd_client.c". Get rid of both, and instead define a single inline version in "include/linux/ceph/osdmap.h". Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:52 -07:00
Alex Elder	8058fd4503	libceph: drop mutex on error in handle_reply() The osd client mutex is acquired just before getting a reference to a request in handle_reply(). However the error paths after that don't drop the mutex before returning as they should. Drop the mutex after dropping the request reference. Also add a bad_mutex label at that point and use it so the failed request lookup case can be handled with the rest. This resolves: http://tracker.ceph.com/issues/4615 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-05-01 21:17:51 -07:00
Alex Elder	b0270324c5	libceph: use osd_req_op_extent_init() Use osd_req_op_extent_init() in ceph_osdc_new_request() to initialize the one or two ops built in that function. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:49 -07:00
Alex Elder	d18d1e2807	libceph: clean up ceph_osd_new_request() All callers of ceph_osd_new_request() pass either CEPH_OSD_OP_READ or CEPH_OSD_OP_WRITE as the opcode value. The function assumes it by filling in the extent fields in the ops array it builds. So just assert that is the case, and don't bother calling op_has_extent() before filling in the first osd operation in the array. Define some local variables to gather the information to fill into the first op, and then fill in the op array all in one place. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:48 -07:00
Alex Elder	a19dadfba9	libceph: don't update op in calc_layout() The ceph_osdc_new_request() an array of osd operations is built up and filled in partially within that function and partially in the called function calc_layout(). Move the latter part back out to ceph_osdc_new_request() so it's all done in one place. This makes it unnecessary to pass the op pointer to calc_layout(), so get rid of that parameter. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:47 -07:00
Alex Elder	75d1c941e5	libceph: pass offset and length out of calc_layout() The purpose of calc_layout() is to determine, given a file offset and length and a layout describing the placement of file data across objects, where in "object space" that data resides. Specifically, it determines which object should hold the first part of the specified range of file data, and the offset and length of data within that object. The length will not exceed the bounds of the object, and the caller is informed of that maximum length. Add two parameters to calc_layout() to allow the object-relative offset and length to be passed back to the caller. This is the first steps toward having ceph_osdc_new_request() build its osd op structure using osd_req_op_extent_init(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:46 -07:00
Alex Elder	33803f3300	libceph: define source request op functions The rbd code has a function that allocates and populates a ceph_osd_req_op structure (the in-core version of an osd request operation). When reviewed, Josh suggested two things: that the big varargs function might be better split into type-specific functions; and that this functionality really belongs in the osd client rather than rbd. This patch implements both of Josh's suggestions. It breaks up the rbd function into separate functions and defines them in the osd client module as exported interfaces. Unlike the rbd version, however, the functions don't allocate an osd_req_op structure; they are provided the address of one and that is initialized instead. The rbd function has been eliminated and calls to it have been replaced by calls to the new routines. The rbd code now now use a stack (struct) variable to hold the op rather than allocating and freeing it each time. For now only the capabilities used by rbd are implemented. Implementing all the other osd op types, and making the rest of the code use it will be done separately, in the next few patches. Note that only the extent, cls, and watch portions of the ceph_osd_req_op structure are currently used. Delete the others (xattr, pgls, and snap) from its definition so nobody thinks it's actually implemented or needed. We can add it back again later if needed, when we know it's been tested. This (and a few follow-on patches) resolves: http://tracker.ceph.com/issues/3861 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:45 -07:00
Alex Elder	a8dd0a37bc	libceph: define osd_req_opcode_valid() Define a separate function to determine the validity of an opcode, and use it inside osd_req_encode_op() in order to unclutter that function. Don't update the destination op at all--and return zero--if an unsupported or unrecognized opcode is seen in osd_req_encode_op(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:44 -07:00
Alex Elder	0baa1bd9b6	libceph: be explicit in masking bottom 16 bits In ceph_osdc_build_request() there is a call to cpu_to_le16() which provides a 64-bit value as its argument. Because of the implied byte swapping going on it looked pretty suspect to me. At the moment it turns out the behavior is well defined, but masking off those bottom bits explicitly eliminates this distraction, and is in fact more directly related to the purpose of the message header's data_off field. This resolves: http://tracker.ceph.com/issues/4125 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:41 -07:00
Alex Elder	56fc565916	libceph: account for alignment in pages cursor When a cursor for a page array data message is initialized it needs to determine the initial value for cursor->last_piece. Currently it just checks if length is less than a page, but that's not correct. The data in the first page in the array will be offset by a page offset based on the alignment recorded for the data. (All pages thereafter will be aligned at the base of the page, so there's no need to account for this except for the first page.) Because this was wrong, there was a case where the length of a piece would be calculated as all of the residual bytes in the message and that plus the page offset could exceed the length of a page. So fix this case. Make sure the sum won't wrap. This resolves a third issue described in: http://tracker.ceph.com/issues/4598 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-05-01 21:17:40 -07:00
Alex Elder	5df521b1ee	libceph: page offset must be less than page size Currently ceph_msg_data_pages_advance() allows the page offset value to be PAGE_SIZE, apparently assuming ceph_msg_data_pages_next() will treat it as 0. But that doesn't happen, and the result led to a helpful assertion failure. Change ceph_msg_data_pages_advance() to truncate the offset to 0 before returning if it reaches PAGE_SIZE. Make a few other minor adjustments in this area (comments and a better assertion) while modifying it. This resolves a second issue described in: http://tracker.ceph.com/issues/4598 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-05-01 21:17:39 -07:00
Alex Elder	1190bf06a6	libceph: fix broken data length assertions It's OK for the result of a read to come back with fewer bytes than were requested. So don't trigger a BUG() in that case when initializing the data cursor. This resolves the first problem described in: http://tracker.ceph.com/issues/4598 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-05-01 21:17:38 -07:00
Alex Elder	6644ed7b7e	libceph: make message data be a pointer Begin the transition from a single message data item to a list of them by replacing the "data" structure in a message with a pointer to a ceph_msg_data structure. A null pointer will indicate the message has no data; replace the use of ceph_msg_has_data() with a simple check for a null pointer. Create functions ceph_msg_data_create() and ceph_msg_data_destroy() to dynamically allocate and free a data item structure of a given type. When a message has its data item "set," allocate one of these to hold the data description, and free it when the last reference to the message is dropped. This partially resolves: http://tracker.ceph.com/issues/4429 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:37 -07:00
Alex Elder	8ea299bcbc	libceph: use only ceph_msg_data_advance() The *_msg_pos_next() functions do little more than call ceph_msg_data_advance(). Replace those wrapper functions with a simple call to ceph_msg_data_advance(). This cleanup is related to: http://tracker.ceph.com/issues/4428 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:36 -07:00
Alex Elder	143334ff44	libceph: don't add to crc unless data sent In write_partial_message_data() we aggregate the crc for the data portion of the message as each new piece of the data item is encountered. Because it was computed before sending the data, if an attempt to send a new piece resulted in 0 bytes being sent, the crc crc across that piece would erroneously get computed again and added to the aggregate result. This would occasionally happen in the evnet of a connection failure. The crc value isn't really needed until the complete value is known after sending all data, so there's no need to compute it before sending. So don't calculate the crc for a piece until after we know at least one byte of it has been sent. That will avoid this problem. This resolves: http://tracker.ceph.com/issues/4450 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-05-01 21:17:35 -07:00
Alex Elder	f5db90bcf2	libceph: kill last of ceph_msg_pos The only remaining field in the ceph_msg_pos structure is did_page_crc. In the new cursor model of things that flag (or something like it) belongs in the cursor. Define a new field "need_crc" in the cursor (which applies to all types of data) and initialize it to true whenever a cursor is initialized. In write_partial_message_data(), the data CRC still will be computed as before, but it will check the cursor->need_crc field to determine whether it's needed. Any time the cursor is advanced to a new piece of a data item, need_crc will be set, and this will cause the crc for that entire piece to be accumulated into the data crc. In write_partial_message_data() the intermediate crc value is now held in a local variable so it doesn't have to be byte-swapped so many times. In read_partial_msg_data() we do something similar (but mainly for consistency there). With that, the ceph_msg_pos structure can go away, and it no longer needs to be passed as an argument to prepare_message_data(). This cleanup is related to: http://tracker.ceph.com/issues/4428 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:34 -07:00
Alex Elder	859a35d552	libceph: kill most of ceph_msg_pos All but one of the fields in the ceph_msg_pos structure are now never used (only assigned), so get rid of them. This allows several small blocks of code to go away. This is cleanup of old code related to: http://tracker.ceph.com/issues/4428 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:33 -07:00
Alex Elder	643c68a4a9	libceph: use cursor resid for loop condition Use the "resid" field of a cursor rather than finding when the message data position has moved up to meet the data length to determine when all data has been sent or received in write_partial_message_data() and read_partial_msg_data(). This is cleanup of old code related to: http://tracker.ceph.com/issues/4428 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:32 -07:00
Alex Elder	4c59b4a278	libceph: collapse all data items into one It turns out that only one of the data item types is ever used at any one time in a single message (currently). - A page array is used by the osd client (on behalf of the file system) and by rbd. Only one osd op (and therefore at most one data item) is ever used at a time by rbd. And the only time the file system sends two, the second op contains no data. - A bio is only used by the rbd client (and again, only one data item per message) - A page list is used by the file system and by rbd for outgoing data, but only one op (and one data item) at a time. We can therefore collapse all three of our data item fields into a single field "data", and depend on the messenger code to properly handle it based on its type. This allows us to eliminate quite a bit of duplicated code. This is related to: http://tracker.ceph.com/issues/4429 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:30 -07:00
Alex Elder	686be20875	libceph: get rid of read helpers Now that read_partial_message_pages() and read_partial_message_bio() are literally identical functions we can factor them out. They're pretty simple as well, so just move their relevant content into read_partial_msg_data(). This is and previous patches together resolve: http://tracker.ceph.com/issues/4428 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:29 -07:00
Alex Elder	61fcdc97c0	libceph: no outbound zero data There is handling in write_partial_message_data() for the case where only the length of--and no other information about--the data to be sent has been specified. It uses the zero page as the source of data to send in this case. This case doesn't occur. All message senders set up a page array, pagelist, or bio describing the data to be sent. So eliminate the block of code that handles this (but check and issue a warning for now, just in case it happens for some reason). This resolves: http://tracker.ceph.com/issues/4426 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:28 -07:00
Alex Elder	878efabd32	libceph: use cursor for inbound data pages The cursor code for a page array selects the right page, page offset, and length to use for a ceph_tcp_recvpage() call, so we can use it to replace a block in read_partial_message_pages(). This partially resolves: http://tracker.ceph.com/issues/4428 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:27 -07:00
Alex Elder	6518be47f9	libceph: kill ceph message bio_iter, bio_seg The bio_iter and bio_seg fields in a message are no longer used, we use the cursor instead. So get rid of them and the functions that operate on them them. This is related to: http://tracker.ceph.com/issues/4428 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:26 -07:00
Alex Elder	463207aa40	libceph: use cursor for bio reads Replace the use of the information in con->in_msg_pos for incoming bio data. The old in_msg_pos and the new cursor mechanism do basically the same thing, just slightly differently. The main functional difference is that in_msg_pos keeps track of the length of the complete bio list, and assumed it was fully consumed when that many bytes had been transferred. The cursor does not assume a length, it simply consumes all bytes in the bio list. Because the only user of bio data is the rbd client, and because the length of a bio list provided by rbd client always matches the number of bytes in the list, both ways of tracking length are equivalent. In addition, for in_msg_pos the initial bio vector is selected as the initial value of the bio->bi_idx, while the cursor assumes this is zero. Again, the rbd client always passes 0 as the initial index so the effect is the same. Other than that, they basically match: in_msg_pos cursor ---------- ------ bio_iter bio bio_seg vec_index page_pos page_offset The in_msg_pos field is initialized by a call to init_bio_iter(). The bio cursor is initialized by ceph_msg_data_cursor_init(). Both now happen in the same spot, in prepare_message_data(). The in_msg_pos field is advanced by a call to in_msg_pos_next(), which updates page_pos and calls iter_bio_next() to move to the next bio vector, or to the next bio in the list. The cursor is advanced by ceph_msg_data_advance(). That isn't currently happening so add a call to that in in_msg_pos_next(). Finally, the next piece of data to use for a read is determined by a bunch of lines in read_partial_message_bio(). Those can be replaced by an equivalent ceph_msg_data_bio_next() call. This partially resolves: http://tracker.ceph.com/issues/4428 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:25 -07:00
Alex Elder	25aff7c559	libceph: record residual bytes for all message data types All of the data types can use this, not just the page array. Until now, only the bio type doesn't have it available, and only the initiator of the request (the rbd client) is able to supply the length of the full request without re-scanning the bio list. Change the cursor init routines so the length is supplied based on the message header "data_len" field, and use that length to intiialize the "resid" field of the cursor. In addition, change the way "last_piece" is defined so it is based on the residual number of bytes in the original request. This is necessary (at least for bio messages) because it is possible for a read request to succeed without consuming all of the space available in the data buffer. This resolves: http://tracker.ceph.com/issues/4427 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:24 -07:00
Alex Elder	28a89ddece	libceph: drop pages parameter The value passed for "pages" in read_partial_message_pages() is always the pages pointer from the incoming message, which can be derived inside that function. So just get rid of the parameter. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:23 -07:00
Alex Elder	888334f966	libceph: initialize data fields on last msg put When the last reference to a ceph message is dropped, ceph_msg_last_put() is called to clean things up. For "normal" messages (allocated via ceph_msg_new() rather than being allocated from a memory pool) it's sufficient to just release resources. But for a mempool-allocated message we actually have to re-initialize the data fields in the message back to initial state so they're ready to go in the event the message gets reused. Some of this was already done; this fleshes it out so it's done more completely. This resolves: http://tracker.ceph.com/issues/4540 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:22 -07:00
Alex Elder	7e2766a113	libceph: send queued requests when starting new one An osd expects the transaction ids of arriving request messages from a given client to a given osd to increase monotonically. So the osd client needs to send its requests in ascending tid order. The transaction id for a request is set at the time it is registered, in __register_request(). This is also where the request gets placed at the end of the osd client's unsent messages list. At the end of ceph_osdc_start_request(), the request message for a newly-mapped osd request is supplied to the messenger to be sent (via __send_request()). If any other messages were present in the osd client's unsent list at that point they would be sent after this new request message. Because those unsent messages have already been registered, their tids would be lower than the newly-mapped request message, and sending that message first can violate the tid ordering rule. Rather than sending the new request only, send all queued requests (including the new one) at that point in ceph_osdc_start_request(). This ensures the tid ordering property is preserved. With this in place, all messages should now be sent in tid order regardless of whether they're being sent for the first time or re-sent as a result of a call to osd_reset(). This resolves: http://tracker.ceph.com/issues/4392 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-off-by: Sage Weil <sage@inktank.com>	2013-05-01 21:17:21 -07:00
Alex Elder	ad885927de	libceph: keep request lists in tid order In __map_request(), when adding a request to an osd client's unsent list, add it to the tail rather than the head. That way the newest entries (with the highest tid value) will be last. Maintain an osd's request list in order of increasing tid also. Finally--to be consistent--maintain an osd client's "notarget" list in that order as well. This partially resolves: http://tracker.ceph.com/issues/4392 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-off-by: Sage Weil <sage@inktank.com>	2013-05-01 21:17:19 -07:00
Alex Elder	e02493c07c	libceph: requeue only sent requests when kicking The osd expects incoming requests for a given object from a given client to arrive in order, with the tid for each request being greater than the tid for requests that have already arrived. This patch fixes two places the osd client might not maintain that ordering. For the osd client, the connection fault method is osd_reset(). That function calls __reset_osd() to close and re-open the connection, then calls __kick_osd_requests() to cause all outstanding requests for the affected osd to be re-sent after the connection has been re-established. When an osd is reset, any in-flight messages will need to be re-sent. An osd client maintains distinct lists for unsent and in-flight messages. Meanwhile, an osd maintains a single list of all its requests (both sent and un-sent). (Each message is linked into two lists--one for the osd client and one list for the osd.) To process an osd "kick" operation, the request list for the osd is traversed, and each request is moved off whichever osd client list it was on (unsent or sent) and placed onto the osd client's unsent list. (It remains where it is on the osd's request list.) When that is done, osd_reset() calls __send_queued() to cause each of the osd client's unsent messages to be sent. OK, with that background... As the osd request list is traversed each request is prepended to the osd client's unsent list in the order they're seen. The effect of this is to reverse the order of these requests as they are put (back) onto the unsent list. Instead, build up a list of only the requests for an osd that have already been sent (by checking their r_sent flag values). Once an unsent request is found, stop examining requests and prepend the requests that need re-sending to the osd client's unsent list. Preserve the original order of requests in the process (previously re-queued requests were reversed in this process). Because they have already been sent, they will have lower tids than any request already present on the unsent list. Just below that, traverse the linger list in forward order as before, but add them to the tail of the list rather than the head. These requests get re-registered, and in the process are give a new (higher) tid, so the should go at the end. This partially resolves: http://tracker.ceph.com/issues/4392 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-off-by: Sage Weil <sage@inktank.com>	2013-05-01 21:17:18 -07:00
Alex Elder	92451b4910	libceph: no more kick_requests() race Since we no longer drop the request mutex between registering and mapping an osd request in ceph_osdc_start_request(), there is no chance of a race with kick_requests(). We can now therefore map and send the new request unconditionally (but we'll issue a warning should it ever occur). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-off-by: Sage Weil <sage@inktank.com>	2013-05-01 21:17:17 -07:00
Alex Elder	dc4b870c97	libceph: slightly defer registering osd request One of the first things ceph_osdc_start_request() does is register the request. It then acquires the osd client's map semaphore and request mutex and proceeds to map and send the request. There is no reason the request has to be registered before acquiring the map semaphore. So hold off doing so until after the map semaphore is held. Since register_request() is nothing more than a wrapper around __register_request(), call the latter function instead, after acquiring the request mutex. That leaves register_request() unused, so get rid of it. This partially resolves: http://tracker.ceph.com/issues/4392 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-off-by: Sage Weil <sage@inktank.com>	2013-05-01 21:17:16 -07:00
Sage Weil	e9966076cd	libceph: wrap auth methods in a mutex The auth code is called from a variety of contexts, include the mon_client (protected by the monc's mutex) and the messenger callbacks (currently protected by nothing). Avoid chaos by protecting all auth state with a mutex. Nothing is blocking, so this should be simple and lightweight. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2013-05-01 21:17:15 -07:00
Sage Weil	27859f9773	libceph: wrap auth ops in wrapper functions Use wrapper functions that check whether the auth op exists so that callers do not need a bunch of conditional checks. Simplifies the external interface. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2013-05-01 21:17:14 -07:00
Sage Weil	0bed9b5c52	libceph: add update_authorizer auth method Currently the messenger calls out to a get_authorizer con op, which will create a new authorizer if it doesn't yet have one. In the meantime, when we rotate our service keys, the authorizer doesn't get updated. Eventually it will be rejected by the server on a new connection attempt and get invalidated, and we will then rebuild a new authorizer, but this is not ideal. Instead, if we do have an authorizer, call a new update_authorizer op that will verify that the current authorizer is using the latest secret. If it is not, we will build a new one that does. This avoids the transient failure. This fixes one of the sorry sequence of events for bug http://tracker.ceph.com/issues/4282 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2013-05-01 21:17:13 -07:00
Sage Weil	4b8e8b5d78	libceph: fix authorizer invalidation We were invalidating the authorizer by removing the ticket handler entirely. This was effective in inducing us to request a new authorizer, but in the meantime it mean that any authorizer we generated would get a new and initialized handler with secret_id=0, which would always be rejected by the server side with a confusing error message: auth: could not find secret_id=0 cephx: verify_authorizer could not get service secret for service osd secret_id=0 Instead, simply clear the validity field. This will still induce the auth code to request a new secret, but will let us continue to use the old ticket in the meantime. The messenger code will probably continue to fail, but the exponential backoff will kick in, and eventually the we will get a new (hopefully more valid) ticket from the mon and be able to continue. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2013-05-01 21:17:12 -07:00
Sage Weil	20e55c4cc7	libceph: clear messenger auth_retry flag when we authenticate We maintain a counter of failed auth attempts to allow us to retry once before failing. However, if the second attempt succeeds, the flag isn't cleared, which makes us think auth failed again later when the connection resets for other reasons (like a socket error). This is one part of the sorry sequence of events in bug http://tracker.ceph.com/issues/4282 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2013-05-01 21:17:11 -07:00
Sage Weil	3a23083bda	libceph: implement RECONNECT_SEQ feature This is an old protocol extension that allows the client and server to avoid resending old messages after a reconnect (following a socket error). Instead, the exchange their sequence numbers during the handshake. This avoids sending a bunch of useless data over the socket. It has been supported in the server code since v0.22 (Sep 2010). Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2013-05-01 21:17:09 -07:00
Alex Elder	8a166d0536	libceph: more cleanup of write_partial_msg_pages() Basically all cases in write_partial_msg_pages() use the cursor, and as a result we can simplify that function quite a bit. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:06 -07:00
Alex Elder	9d2a06c275	libceph: kill message trail The wart that is the ceph message trail can now be removed, because its only user was the osd client, and the previous patch made that no longer the case. The result allows write_partial_msg_pages() to be simplified considerably. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:05 -07:00
Alex Elder	95e072eb38	libceph: kill osd request r_trail The osd trail is a pagelist, used only for a CALL osd operation to hold the class and method names, along with any input data for the call. It is only currently used by the rbd client, and when it's used it is the only bit of outbound data in the osd request. Since we already support (non-trail) pagelist data in a message, we can just save this outbound CALL data in the "normal" pagelist rather than the trail, and get rid of the trail entirely. The existing pagelist support depends on the pagelist being dynamically allocated, and ownership of it is passed to the messenger once it's been attached to a message. (That is to say, the messenger releases and frees the pagelist when it's done with it). That means we need to dynamically allocate the pagelist also. Note that we simply assert that the allocation of a pagelist structure succeeds. Appending to a pagelist might require a dynamic allocation, so we're already assuming we won't run into trouble doing so (we're just ignore any failures--and that should be fixed at some point). This resolves: http://tracker.ceph.com/issues/4407 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:04 -07:00
Alex Elder	9a5e6d09dd	libceph: have osd requests support pagelist data Add support for recording a ceph pagelist as data associated with an osd request. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:03 -07:00
Alex Elder	175face2ba	libceph: let osd ops determine request data length The length of outgoing data in an osd request is dependent on the osd ops that are embedded in that request. Each op is encoded into a request message using osd_req_encode_op(), so that should be used to determine the amount of outgoing data implied by the op as it is encoded. Have osd_req_encode_op() return the number of bytes of outgoing data implied by the op being encoded, and accumulate and use that in ceph_osdc_build_request(). As a result, ceph_osdc_build_request() no longer requires its "len" parameter, so get rid of it. Using the sum of the op lengths rather than the length provided is a valid change because: - The only callers of osd ceph_osdc_build_request() are rbd and the osd client (in ceph_osdc_new_request() on behalf of the file system). - When rbd calls it, the length provided is only non-zero for write requests, and in that case the single op has the same length value as what was passed here. - When called from ceph_osdc_new_request(), (it's not all that easy to see, but) the length passed is also always the same as the extent length encoded in its (single) write op if present. This resolves: http://tracker.ceph.com/issues/4406 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:02 -07:00
Alex Elder	e766d7b55e	libceph: implement pages array cursor Implement and use cursor routines for page array message data items for outbound message data. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:01 -07:00
Alex Elder	6aaa4511de	libceph: implement bio message data item cursor Implement and use cursor routines for bio message data items for outbound message data. (See the previous commit for reasoning in support of the changes in out_msg_pos_next().) Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:59 -07:00
Alex Elder	7fe1e5e57b	libceph: use data cursor for message pagelist Switch to using the message cursor for the (non-trail) outgoing pagelist data item in a message if present. Notes on the logic changes in out_msg_pos_next(): - only the mds client uses a ceph pagelist for message data; - if the mds client ever uses a pagelist, it never uses a page array (or anything else, for that matter) for data in the same message; - only the osd client uses the trail portion of a message data, and when it does, it never uses any other data fields for outgoing data in the same message; and finally - only the rbd client uses bio message data (never pagelist). Therefore out_msg_pos_next() can assume: - if we're in the trail portion of a message, the message data pagelist, data, and bio can be ignored; and - if there is a page list, there will never be any a bio or page array data, and vice-versa. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:58 -07:00
Alex Elder	dd236fcb65	libceph: prepare for other message data item types This just inserts some infrastructure in preparation for handling other types of ceph message data items. No functional changes, just trying to simplify review by separating out some noise. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:57 -07:00
Alex Elder	fe38a2b67b	libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:56 -07:00
Alex Elder	437945094f	libceph: abstract message data Group the types of message data into an abstract structure with a type indicator and a union containing fields appropriate to the type of data it represents. Use this to represent the pages, pagelist, bio, and trail in a ceph message. Verify message data is of type NONE in ceph_msg_data_set_*() routines. Since information about message data of type NONE really should not be interpreted, get rid of the other assertions in those functions. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:55 -07:00
Alex Elder	f9e15777af	libceph: be explicit about message data representation A ceph message has a data payload portion. The memory for that data (either the source of data to send or the location to place data that is received) is specified in several ways. The ceph_msg structure includes fields for all of those ways, but this mispresents the fact that not all of them are used at a time. Specifically, the data in a message can be in: - an array of pages - a list of pages - a list of Linux bios - a second list of pages (the "trail") (The two page lists are currently only ever used for outgoing data.) Impose more structure on the ceph message, making the grouping of some of these fields explicit. Shorten the name of the "page_alignment" field. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:54 -07:00
Alex Elder	97fb1c7f66	libceph: define ceph_msg_has_() data macros Define and use macros ceph_msg_has_() to determine whether to operate on the pages, pagelist, bio, and trail fields of a message. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:53 -07:00
Alex Elder	35b6280899	libceph: define and use ceph_crc32c_page() Factor out a common block of code that updates a CRC calculation over a range of data in a page. This and the preceding patches are related to: http://tracker.ceph.com/issues/4403 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:52 -07:00
Alex Elder	afb3d90e20	libceph: define and use ceph_tcp_recvpage() Define a new function ceph_tcp_recvpage() that behaves in a way comparable to ceph_tcp_sendpage(). Rearrange the code in both read_partial_message_pages() and read_partial_message_bio() so they have matching structure, (similar to what's in write_partial_msg_pages()), and use this new function. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:51 -07:00
Alex Elder	34d2d2006c	libceph: encapsulate reading message data Pull the code that reads the data portion into a message into a separate function read_partial_msg_data(). Rename write_partial_msg_pages() to be write_partial_message_data() to match its read counterpart, and to reflect its more generic purpose. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:50 -07:00
Alex Elder	e387d525b0	libceph: small write_partial_msg_pages() refactor Define local variables page_offset and length to represent the range of bytes within a page that will be sent by ceph_tcp_sendpage() in write_partial_msg_pages(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:48 -07:00
Alex Elder	78625051b5	libceph: consolidate message prep code In prepare_write_message_data(), various fields are initialized in preparation for writing message data out. Meanwhile, in read_partial_message(), there is essentially the same block of code, operating on message variables associated with an incoming message. Generalize prepare_write_message_data() so it works for both incoming and outcoming messages, and use it in both spots. The did_page_crc is not used for input (so it's harmless to initialize it). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:47 -07:00
Alex Elder	bae6acd9c6	libceph: use local variables for message positions There are several places where a message's out_msg_pos or in_msg_pos field is used repeatedly within a function. Use a local pointer variable for this purpose to unclutter the code. This and the upcoming cleanup patches are related to: http://tracker.ceph.com/issues/4403 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:46 -07:00
Alex Elder	98a0370898	libceph: don't clear bio_iter in prepare_write_message() At one time it was necessary to clear a message's bio_iter field to avoid a bad pointer dereference in write_partial_msg_pages(). That no longer seems to be the case. Here's why. The message's bio fields represent (in this case) outgoing data. Between where the bio_iter is made NULL in prepare_write_message() and the call in that function to prepare_message_data(), the bio fields are never used. In prepare_message_data(), init-bio_iter() is called, and the result of that overwrites the value in the message's bio_iter field. Because it gets overwritten anyway, there is no need to set it to NULL. So don't do it. This resolves: http://tracker.ceph.com/issues/4402 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:45 -07:00
Alex Elder	07aa155878	libceph: activate message data assignment checks The mds client no longer tries to assign zero-length message data, and the osd client no longer sets its data info more than once. This allows us to activate assertions in the messenger to verify these things never happen. This resolves both of these: http://tracker.ceph.com/issues/4263 http://tracker.ceph.com/issues/4284 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>	2013-05-01 21:16:44 -07:00
Alex Elder	70636773b7	libceph: set response data fields earlier When an incoming message is destined for the osd client, the messenger calls the osd client's alloc_msg method. That function looks up which request has the tid matching the incoming message, and returns the request message that was preallocated to receive the response. The response message is therefore known before the request is even started. Between the start of the request and the receipt of the response, the request and its data fields will not change, so there's no reason we need to hold off setting them. In fact it's preferable to set them just once because it's more obvious that they're unchanging. So set up the fields describing where incoming data is to land in a response message at the beginning of ceph_osdc_start_request(). Define a helper function that sets these fields, and use it to set the fields for both outgoing data in the request message and incoming data in the response. This resolves: http://tracker.ceph.com/issues/4284 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:43 -07:00
Alex Elder	4a73ef27ad	libceph: record message data byte length Record the number of bytes of data in a page array rather than the number of pages in the array. It can be assumed that the page array is of sufficient size to hold the number of bytes indicated (and offset by the indicated alignment). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:42 -07:00
Alex Elder	ebf18f4709	ceph: only set message data pointers if non-empty Change it so we only assign outgoing data information for messages if there is outgoing data to send. This then allows us to add a few more (currently commented-out) assertions. This is related to: http://tracker.ceph.com/issues/4284 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>	2013-05-01 21:16:41 -07:00
Alex Elder	27fa83852b	libceph: isolate other message data fields Define ceph_msg_data_set_pagelist(), ceph_msg_data_set_bio(), and ceph_msg_data_set_trail() to clearly abstract the assignment of the remaining data-related fields in a ceph message structure. Use the new functions in the osd client and mds client. This partially resolves: http://tracker.ceph.com/issues/4263 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:40 -07:00
Alex Elder	f1baeb2b9f	libceph: set page info with byte length When setting page array information for message data, provide the byte length rather than the page count ceph_msg_data_set_pages(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:39 -07:00
Alex Elder	02afca6ca0	libceph: isolate message page field manipulation Define a function ceph_msg_data_set_pages(), which more clearly abstracts the assignment page-related fields for data in a ceph message structure. Use this new function in the osd client and mds client. Ideally, these fields would never be set more than once (with BUG_ON() calls to guarantee that). At the moment though the osd client sets these every time it receives a message, and in the event of a communication problem this can happen more than once. (This will be resolved shortly, but setting up these helpers first makes it all a bit easier to work with.) Rearrange the field order in a ceph_msg structure to group those that are used to define the possible data payloads. This partially resolves: http://tracker.ceph.com/issues/4263 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:38 -07:00
Alex Elder	e0c594878e	libceph: record byte count not page count Record the byte count for an osd request rather than the page count. The number of pages can always be derived from the byte count (and alignment/offset) but the reverse is not true. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:36 -07:00
Alex Elder	9516e45b25	libceph: simplify new message initialization Rather than explicitly initializing many fields to 0, NULL, or false in a newly-allocated message, just use kzalloc() for allocating new messages. This will become a much more convenient way of doing things anyway for upcoming patches that abstract the data field. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:35 -07:00
Alex Elder	35c7bfbcd4	libceph: advance pagelist with list_rotate_left() While processing an outgoing pagelist (either the data pagelist or trail) in a ceph message, the messenger cycles through each of the pages on the list. This is accomplished in out_msg_pos_next(), if the end of the first page on the list is reached, the first page is moved to the end of the list. There is a list operation, list_rotate_left(), which performs exactly this operation, and by using it, what's really going on becomes more obvious. So replace these two list_move_tail() calls with list_rotate_left(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:34 -07:00
Alex Elder	e788182fa6	libceph: define and use in_msg_pos_next() Define a new function in_msg_pos_next() to match out_msg_pos_next(), and use it in place of code at the end of read_partial_message_pages() and read_partial_message_bio(). Note that the page number is incremented and offset reset under slightly different conditions from before. The result is equivalent, however, as explained below. Each time an incoming message is going to arrive, we find out how much room is left--not surpassing the current page--and provide that as the number of bytes to receive. So the amount we'll use is the lesser of: all that's left of the entire request; and all that's left in the current page. If we received exactly how many were requested, we either reached the end of the request or the end of the page. In the first case, we're done, in the second, we move onto the next page in the array. In all cases but (possibly) on the last page, after adding the number of bytes received, page_pos == PAGE_SIZE. On the last page, it doesn't really matter whether we increment the page number and reset the page position, because we're done and we won't come back here again. The code previously skipped over that last case, basically. The new code handles that case the same as the others, incrementing and resetting. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:33 -07:00
Alex Elder	b3d56fab33	libceph: kill args in read_partial_message_bio() There is only one caller for read_partial_message_bio(), and it always passes &msg->bio_iter and &bio_seg as the second and third arguments. Furthermore, the message in question is always the connection's in_msg, and we can get that inside the called function. So drop those two parameters and use their derived equivalents. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:32 -07:00
Alex Elder	e1dcb128f8	libceph: change type of ceph_tcp_sendpage() "more" Change the type of the "more" parameter from int to bool. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:31 -07:00
Alex Elder	6ebc8b32b3	libceph: minor byte order problems in read_partial_message() Some values printed are not (necessarily) in CPU order. We already have a copy of the converted versions, so use them. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:30 -07:00
Alex Elder	7b11ba3758	libceph: define CEPH_MSG_MAX_MIDDLE_LEN This is probably unnecessary but the code read as if it were wrong in read_partial_message(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:29 -07:00
Alex Elder	4137577ae3	libceph: clean up skipped message logic In ceph_con_in_msg_alloc() it is possible for a connection's alloc_msg method to indicate an incoming message should be skipped. By default, read_partial_message() initializes the skip variable to 0 before it gets provided to ceph_con_in_msg_alloc(). The osd client, mon client, and mds client each supply an alloc_msg method. The mds client always assigns skip to be 0. The other two leave the skip value of as-is, or assigns it to zero, except: - if no (osd or mon) request having the given tid is found, in which case skip is set to 1 and NULL is returned; or - in the osd client, if the data of the reply message is not adequate to hold the message to be read, it assigns skip value 1 and returns NULL. So the returned message pointer will always be NULL if skip is ever non-zero. Clean up the logic a bit in ceph_con_in_msg_alloc() to make this state of affairs more obvious. Add a comment explaining how a null message pointer can mean either a message that should be skipped or a problem allocating a message. This resolves: http://tracker.ceph.com/issues/4324 Reported-by: Greg Farnum <greg@inktank.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>	2013-05-01 21:16:28 -07:00
Alex Elder	0fff87ec79	libceph: separate read and write data An osd request defines information about where data to be read should be placed as well as where data to write comes from. Currently these are represented by common fields. Keep information about data for writing separate from data to be read by splitting these into data_in and data_out fields. This is the key patch in this whole series, in that it actually identifies which osd requests generate outgoing data and which generate incoming data. It's less obvious (currently) that an osd CALL op generates both outgoing and incoming data; that's the focus of some upcoming work. This resolves: http://tracker.ceph.com/issues/4127 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:27 -07:00
Alex Elder	2ac2b7a6d4	libceph: distinguish page and bio requests An osd request uses either pages or a bio list for its data. Use a union to record information about the two, and add a data type tag to select between them. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:25 -07:00
Alex Elder	2794a82a11	libceph: separate osd request data info Pull the fields in an osd request structure that define the data for the request out into a separate structure. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:24 -07:00
Alex Elder	153e5167e0	libceph: don't assign page info in ceph_osdc_new_request() Currently ceph_osdc_new_request() assigns an osd request's r_num_pages and r_alignment fields. The only thing it does after that is call ceph_osdc_build_request(), and that doesn't need those fields to be assigned. Move the assignment of those fields out of ceph_osdc_new_request() and into its caller. As a result, the page_align parameter is no longer used, so get rid of it. Note that in ceph_sync_write(), the value for req->r_num_pages had already been calculated earlier (as num_pages, and fortunately it was computed the same way). So don't bother recomputing it, but because it's not needed earlier, move that calculation after the call to ceph_osdc_new_request(). Hold off making the assignment to r_alignment, doing it instead r_pages and r_num_pages are getting set. Similarly, in start_read(), nr_pages already holds the number of pages in the array (and is calculated the same way), so there's no need to recompute it. Move the assignment of the page alignment down with the others there as well. This and the next few patches are preparation work for: http://tracker.ceph.com/issues/4127 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:23 -07:00
Alex Elder	53ded495c6	libceph: define mds_alloc_msg() method The only user of the ceph messenger that doesn't define an alloc_msg method is the mds client. Define one, such that it works just like it did before, and simplify ceph_con_in_msg_alloc() by assuming the alloc_msg method is always present. This and the next patch resolve: http://tracker.ceph.com/issues/4322 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>	2013-05-01 21:16:19 -07:00
Alex Elder	1d866d1c31	libceph: drop mutex while allocating a message In ceph_con_in_msg_alloc(), if no alloc_msg method is defined for a connection a new message is allocated with ceph_msg_new(). Drop the mutex before making this call, and make sure we're still connected when we get it back again. This is preparing for the next patch, which ensures all connections define an alloc_msg method, and then handles them all the same way. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>	2013-05-01 21:16:18 -07:00
Alex Elder	41766f87f5	libceph: rename ceph_calc_object_layout() The purpose of ceph_calc_object_layout() is to fill in the pool number and seed for a ceph_pg structure provided, based on a given osd map and target object id. Currently that function takes a file layout parameter, but the only thing used out of that is its pool number. Change the function so it takes a pool number rather than the full file layout structure. Only update the ceph_pg if the pool is found in the osd map. Get rid of few useless lines of code from the function while there. Since the function now very clearly just fills in the ceph_pg structure it's provided, rename it ceph_calc_ceph_pg(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:17 -07:00
Alex Elder	ec02a2f2ff	libceph: kill ceph_msg->pagelist_count The pagelist_count field is never actually used, so get rid of it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:16 -07:00
Alex Elder	8f63ca2d23	libceph: fix wrong opcode use in osd_req_encode_op() The new cases added to osd_req_encode_op() caused a new sparse error, which highlighted an existing problem that had been overlooked since it was originally checked in. When an unsupported opcode is found the destination rather than the source opcode was being used in the error message. The two differ in their byte order, and we want to be using the one in the source. Fix the problem in both spots. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:13 -07:00
Alex Elder	0d5af16435	libceph: complete lingering requests only once An osd request marked to linger will be re-submitted in the event a connection to the target osd gets dropped. Currently, if there is a callback function associated with a request it will be called each time a request is submitted--which for lingering requests can be more than once. Change it so a request--including lingering ones--will get completed (from the perspective of the user of the osd client) exactly once. This resolves: http://tracker.ceph.com/issues/3967 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:12 -07:00
Alex Elder	f51a822c31	libceph: set page alignment in start_request() The page alignment field for a request is currently set in ceph_osdc_build_request(). It's not needed at that point nor do either of its callers need that value assigned at any point before they call ceph_osdc_start_request(). So move that assignment into ceph_osdc_start_request(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:14:29 -07:00
Alex Elder	d4b515fa10	libceph: distinguish page array and pagelist count Use distinct fields for tracking the number of pages in a message's page array and in a message's page list. Currently only one or the other is used at a time, but that will be changing soon. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:14:28 -07:00
Alex Elder	60cf5992d9	libceph: don't pass request to calc_layout() The only remaining reason to pass the osd request to calc_layout() is to fill in its r_num_pages and r_page_alignment fields. Once it fills those in, it doesn't do anything more with them. We can therefore move those assignments into the caller, and get rid of the "req" parameter entirely. Note, however, that the only caller is ceph_osdc_new_request(), and that immediately overwrites those fields with values based on its passed-in page offset. So the assignment inside calc_layout() was redundant anyway. This resolves: http://tracker.ceph.com/issues/4262 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:14:27 -07:00
Alex Elder	dbe0fc4188	libceph: format target object name in caller Move the formatting of the object name (oid) to use for an object request into the caller of calc_layout(). This makes the "vino" parameter no longer necessary, so get rid of it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:14:26 -07:00
Alex Elder	47a05811b6	libceph: pass object number back to calc_layout() caller Have calc_layout() pass the computed object number back to its caller. (This is a small step to simplify review.) Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:14:25 -07:00
Alex Elder	07c09b7255	libceph: make ceph_msg->bio_seg be unsigned The bio_seg field is used by the ceph messenger in iterating through a bio. It should never have a negative value, so make it an unsigned. (I contemplated making it unsigned short to match the struct bio definition, but it offered no benefit.) Change variables used to hold bio_seg values to all be unsigned as well. Change two variable names in init_bio_iter() to match the convention used everywhere else. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:14:23 -07:00
Alex Elder	3ff5f385b1	libceph: fix a osd request memory leak If an invalid layout is provided to ceph_osdc_new_request(), its call to calc_layout() might return an error. At that point in the function we've already allocated an osd request structure, so we need to free it (drop a reference) in the event such an error occurs. The only other value calc_layout() will return is 0, so make that explicit in the successful case. This resolves: http://tracker.ceph.com/issues/4240 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:14:22 -07:00
Linus Torvalds	20b4fb4852	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS updates from Al Viro, Misc cleanups all over the place, mainly wrt /proc interfaces (switch create_proc_entry to proc_create(), get rid of the deprecated create_proc_read_entry() in favor of using proc_create_data() and seq_file etc). 7kloc removed. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits) don't bother with deferred freeing of fdtables proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h proc: Make the PROC_I() and PDE() macros internal to procfs proc: Supply a function to remove a proc entry by PDE take cgroup_open() and cpuset_open() to fs/proc/base.c ppc: Clean up scanlog ppc: Clean up rtas_flash driver somewhat hostap: proc: Use remove_proc_subtree() drm: proc: Use remove_proc_subtree() drm: proc: Use minor->index to label things, not PDE->name drm: Constify drm_proc_list[] zoran: Don't print proc_dir_entry data in debug reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show() proc: Supply an accessor for getting the data from a PDE's parent airo: Use remove_proc_subtree() rtl8192u: Don't need to save device proc dir PDE rtl8187se: Use a dir under /proc/net/r8180/ proc: Add proc_mkdir_data() proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h} proc: Move PDE_NET() to fs/proc/proc_net.c ...	2013-05-01 17:51:54 -07:00
David Howells	a8ca16ea7b	proc: Supply a function to remove a proc entry by PDE Supply a function (proc_remove()) to remove a proc entry (and any subtree rooted there) by proc_dir_entry pointer rather than by name and (optionally) root dir entry pointer. This allows us to eliminate all remaining pde->name accesses outside of procfs. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Grant Likely <grant.likely@linaro.or> cc: linux-acpi@vger.kernel.org cc: openipmi-developer@lists.sourceforge.net cc: devicetree-discuss@lists.ozlabs.org cc: linux-pci@vger.kernel.org cc: netdev@vger.kernel.org cc: netfilter-devel@vger.kernel.org cc: alsa-devel@alsa-project.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-05-01 17:29:46 -04:00
David Howells	0bb80f2405	proc: Split the namespace stuff out into linux/proc_ns.h Split the proc namespace stuff out into linux/proc_ns.h. Signed-off-by: David Howells <dhowells@redhat.com> cc: netdev@vger.kernel.org cc: Serge E. Hallyn <serge.hallyn@ubuntu.com> cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-05-01 17:29:39 -04:00
David Howells	271a15eabe	proc: Supply PDE attribute setting accessor functions Supply accessor functions to set attributes in proc_dir_entry structs. The following are supplied: proc_set_size() and proc_set_user(). Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> cc: linuxppc-dev@lists.ozlabs.org cc: linux-media@vger.kernel.org cc: netdev@vger.kernel.org cc: linux-wireless@vger.kernel.org cc: linux-pci@vger.kernel.org cc: netfilter-devel@vger.kernel.org cc: alsa-devel@alsa-project.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-05-01 17:29:18 -04:00
Linus Torvalds	73287a43cc	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking updates from David Miller: "Highlights (1721 non-merge commits, this has to be a record of some sort): 1) Add 'random' mode to team driver, from Jiri Pirko and Eric Dumazet. 2) Make it so that any driver that supports configuration of multiple MAC addresses can provide the forwarding database add and del calls by providing a default implementation and hooking that up if the driver doesn't have an explicit set of handlers. From Vlad Yasevich. 3) Support GSO segmentation over tunnels and other encapsulating devices such as VXLAN, from Pravin B Shelar. 4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton. 5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita Dukkipati. 6) In the PHY layer, allow supporting wake-on-lan in situations where the PHY registers have to be written for it to be configured. Use it to support wake-on-lan in mv643xx_eth. From Michael Stapelberg. 7) Significantly improve firewire IPV6 support, from YOSHIFUJI Hideaki. 8) Allow multiple packets to be sent in a single transmission using network coding in batman-adv, from Martin Hundebøll. 9) Add support for T5 cxgb4 chips, from Santosh Rastapur. 10) Generalize the VXLAN forwarding tables so that there is more flexibility in configurating various aspects of the endpoints. From David Stevens. 11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver, from Dmitry Kravkov. 12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo Neira Ayuso. 13) Start adding networking selftests. 14) In situations of overload on the same AF_PACKET fanout socket, or per-cpu packet receive queue, minimize drop by distributing the load to other cpus/fanouts. From Willem de Bruijn and Eric Dumazet. 15) Add support for new payload offset BPF instruction, from Daniel Borkmann. 16) Convert several drivers over to mdoule_platform_driver(), from Sachin Kamat. 17) Provide a minimal BPF JIT image disassembler userspace tool, from Daniel Borkmann. 18) Rewrite F-RTO implementation in TCP to match the final specification of it in RFC4138 and RFC5682. From Yuchung Cheng. 19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear you like netlink, so I implemented netlink dumping of netlink sockets.") From Andrey Vagin. 20) Remove ugly passing of rtnetlink attributes into rtnl_doit functions, from Thomas Graf. 21) Allow userspace to be able to see if a configuration change occurs in the middle of an address or device list dump, from Nicolas Dichtel. 22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes Frederic Sowa. 23) Increase accuracy of packet length used by packet scheduler, from Jason Wang. 24) Beginning set of changes to make ipv4/ipv6 fragment handling more scalable and less susceptible to overload and locking contention, from Jesper Dangaard Brouer. 25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_() instead. From Hong Zhiguo. 26) Optimize route usage in IPVS by avoiding reference counting where possible, from Julian Anastasov. 27) Convert IPVS schedulers to RCU, also from Julian Anastasov. 28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger Eitzenberger. 29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG, nfnetlink_log, and nfnetlink_queue. From Gao feng. 30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa. 31) Support several new r8169 chips, from Hayes Wang. 32) Support tokenized interface identifiers in ipv6, from Daniel Borkmann. 33) Use usbnet_link_change() helper in USB net driver, from Ming Lei. 34) Add 802.1ad vlan offload support, from Patrick McHardy. 35) Support mmap() based netlink communication, also from Patrick McHardy. 36) Support HW timestamping in mlx4 driver, from Amir Vadai. 37) Rationalize AF_PACKET packet timestamping when transmitting, from Willem de Bruijn and Daniel Borkmann. 38) Bring parity to what's provided by /proc/net/packet socket dumping and the info provided by netlink socket dumping of AF_PACKET sockets. From Nicolas Dichtel. 39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin Poirier" git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits) filter: fix va_list build error af_unix: fix a fatal race with bit fields bnx2x: Prevent memory leak when cnic is absent bnx2x: correct reading of speed capabilities net: sctp: attribute printl with __printf for gcc fmt checks netlink: kconfig: move mmap i/o into netlink kconfig netpoll: convert mutex into a semaphore netlink: Fix skb ref counting. net_sched: act_ipt forward compat with xtables mlx4_en: fix a build error on 32bit arches Revert "bnx2x: allow nvram test to run when device is down" bridge: avoid OOPS if root port not found drivers: net: cpsw: fix kernel warn on cpsw irq enable sh_eth: use random MAC address if no valid one supplied 3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA) tg3: fix to append hardware time stamping flags unix/stream: fix peeking with an offset larger than data in queue unix/dgram: fix peeking with an offset larger than data in queue unix/dgram: peek beyond 0-sized skbs openvswitch: Remove unneeded ovs_netdev_get_ifindex() ...	2013-05-01 14:08:52 -07:00
Eric Dumazet	60bc851ae5	af_unix: fix a fatal race with bit fields Using bit fields is dangerous on ppc64/sparc64, as the compiler [1] uses 64bit instructions to manipulate them. If the 64bit word includes any atomic_t or spinlock_t, we can lose critical concurrent changes. This is happening in af_unix, where unix_sk(sk)->gc_candidate/ gc_maybe_cycle/lock share the same 64bit word. This leads to fatal deadlock, as one/several cpus spin forever on a spinlock that will never be available again. A safer way would be to use a long to store flags. This way we are sure compiler/arch wont do bad things. As we own unix_gc_lock spinlock when clearing or setting bits, we can use the non atomic __set_bit()/__clear_bit(). recursion_level can share the same 64bit location with the spinlock, as it is set only with this spinlock held. [1] bug fixed in gcc-4.8.0 : http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080 Reported-by: Ambrose Feinstein <ambrose@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-01 15:13:49 -04:00
Daniel Borkmann	be3e45810b	net: sctp: attribute printl with __printf for gcc fmt checks Let GCC check for format string errors in sctp's probe printl function. This patch fixes the warning when compiled with W=1: net/sctp/probe.c:73:2: warning: function might be possible candidate for 'gnu_printf' format attribute [-Wmissing-format-attribute] Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-01 15:04:10 -04:00
Daniel Borkmann	ee1bec9b3b	netlink: kconfig: move mmap i/o into netlink kconfig Currently, in menuconfig, Netlink's new mmaped IO is the very first entry under the ``Networking support'' item and comes even before ``Networking options'': [ ] Netlink: mmaped IO Networking options ---> ... Lets move this into ``Networking options'' under netlink's Kconfig, since this might be more appropriate. Introduced by commit `ccdfcc398` (``netlink: mmaped netlink: ring setup''). Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-01 15:02:42 -04:00
Neil Horman	bd7c4b604a	netpoll: convert mutex into a semaphore Bart Van Assche recently reported a warning to me: <IRQ> [<ffffffff8103d79f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff8103d7fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff814761dd>] mutex_trylock+0x16d/0x180 [<ffffffff813968c9>] netpoll_poll_dev+0x49/0xc30 [<ffffffff8136a2d2>] ? __alloc_skb+0x82/0x2a0 [<ffffffff81397715>] netpoll_send_skb_on_dev+0x265/0x410 [<ffffffff81397c5a>] netpoll_send_udp+0x28a/0x3a0 [<ffffffffa0541843>] ? write_msg+0x53/0x110 [netconsole] [<ffffffffa05418bf>] write_msg+0xcf/0x110 [netconsole] [<ffffffff8103eba1>] call_console_drivers.constprop.17+0xa1/0x1c0 [<ffffffff8103fb76>] console_unlock+0x2d6/0x450 [<ffffffff8104011e>] vprintk_emit+0x1ee/0x510 [<ffffffff8146f9f6>] printk+0x4d/0x4f [<ffffffffa0004f1d>] scsi_print_command+0x7d/0xe0 [scsi_mod] This resulted from my commit `ca99ca14c` which introduced a mutex_trylock operation in a path that could execute in interrupt context. When mutex debugging is enabled, the above warns the user when we are in fact exectuting in interrupt context interrupt context. After some discussion, It seems that a semaphore is the proper mechanism to use here. While mutexes are defined to be unusable in interrupt context, no such condition exists for semaphores (save for the fact that the non blocking api calls, like up and down_trylock must be used when in irq context). Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Bart Van Assche <bvanassche@acm.org> CC: Bart Van Assche <bvanassche@acm.org> CC: David Miller <davem@davemloft.net> CC: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-01 15:00:24 -04:00
Pravin B Shelar	ae6164adeb	netlink: Fix skb ref counting. Commit `f9c2288837` (netlink: implement memory mapped recvmsg) increamented skb->users ref count twice for a dump op which does not look right. Following patch fixes that. CC: Patrick McHardy <kaber@trash.net> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-01 14:57:03 -04:00
Jamal Hadi Salim	0dcffd0964	net_sched: act_ipt forward compat with xtables Deal with changes in newer xtables while maintaining backward compatibility. Thanks to Jan Engelhardt for suggestions. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-01 13:19:19 -04:00
Wei Yongjun	1eb6d6223a	svcauth_gss: fix error return code in rsc_parse() Fix to return a negative error code from the error handling case instead of 0, as returned elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-04-30 18:14:15 -04:00
stephen hemminger	91bc033c4d	bridge: avoid OOPS if root port not found Bridge can crash while trying to send topology change packet. This happens if root port can't be found. This was reported by user but currently unable to reproduce it easily. The STP conditions that cause this are not known yet, but the problem doesn't have to be fatal. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-30 15:51:08 -04:00
Linus Torvalds	8728f986fe	NFS client bugfixes and cleanups for 3.10 - NLM: stable fix for NFSv2/v3 blocking locks - NFSv4.x: stable fixes for the delegation recall error handling code - NFSv4.x: Security flavour negotiation fixes and cleanups by Chuck Lever - SUNRPC: A number of RPCSEC_GSS fixes and cleanups also from Chuck - NFSv4.x assorted state management and reboot recovery bugfixes - NFSv4.1: In cases where we have already looked up a file, and hold a valid filehandle, use the new open-by-filehandle operation instead of opening by name. - Allow the NFSv4.1 callback thread to freeze - NFSv4.x: ensure that file unlock waits for readahead to complete - NFSv4.1: ensure that the RPC layer doesn't override the NFS session table size negotiation by limiting the number of slots. - NFSv4.x: Fix SETATTR spec compatibility issues -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) iQIcBAABAgAGBQJRfu+cAAoJEGcL54qWCgDylxkP/24cXOLHMKMYnIab0cQYIW2m SQgADGE+MqgTVlWjGVWublVMDY1R51iINsksAjxMtXYt50FdBJEqV2uxIGi4VnbR nR9eppqQ6vk5e6r5+aZyVmWKoLFnJ4MBF6OpPUZB5mf8iH/fiixmSYLvseopPdDj bjHwCxg+xEgew5EhQF/xqkEfkAp2NN84xUksTWb9uDIW2c3SJweY/ZVR2Zsqpugm oqYVtrSLvNKqINQG8OP10s+mRWULwoqapF+kEHlxNbRy26C0zlbXPaneSgYzqHsY OyRkAT7uJJqStYlqdW7k+DhyNMB+T33WAGJpWQlfJGYk5d/n0rtBJDVo0hfhCSQr VkOXiO9J08NMbelCu4+0CJii7h5GCaqpuJEEmNL6AlF/TJVkIQJuRaG2+WDmEtO2 oYd4UfXlAbUuts1SW7u/yyN/yrjVTm1tZYRBqn2VJdqh1s8dMxEWPct2Yn314mpS ODAnbDkEhtWlc9cloSRnwKec5WcxMZb19IJeK9ZvHm7PfIu/QHtj6Ren8s1//bZI OMQxC/Vf/wcjMdNtr7QdMNxWG1aK8DL9mYP5XwCrkZ560LIrtxmhyqYeoGAfgO5u +K/gKmQwjsaPhEa8jbP2/wI0II9yKPWj/fVwqhbhqaBUx5GA2iAKcdpPP6JAMAti +PXkLTtkyrIgSNwzl63S =Hgot -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes and cleanups from Trond Myklebust: - NLM: stable fix for NFSv2/v3 blocking locks - NFSv4.x: stable fixes for the delegation recall error handling code - NFSv4.x: Security flavour negotiation fixes and cleanups by Chuck Lever - SUNRPC: A number of RPCSEC_GSS fixes and cleanups also from Chuck - NFSv4.x assorted state management and reboot recovery bugfixes - NFSv4.1: In cases where we have already looked up a file, and hold a valid filehandle, use the new open-by-filehandle operation instead of opening by name. - Allow the NFSv4.1 callback thread to freeze - NFSv4.x: ensure that file unlock waits for readahead to complete - NFSv4.1: ensure that the RPC layer doesn't override the NFS session table size negotiation by limiting the number of slots. - NFSv4.x: Fix SETATTR spec compatibility issues * tag 'nfs-for-3.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (67 commits) NFSv4: Warn once about servers that incorrectly apply open mode to setattr NFSv4: Servers should only check SETATTR stateid open mode on size change NFSv4: Don't recheck permissions on open in case of recovery cached open NFSv4.1: Don't do a delegated open for NFS4_OPEN_CLAIM_DELEG_CUR_FH modes NFSv4.1: Use the more efficient open_noattr call for open-by-filehandle NFS: Retry SETCLIENTID with AUTH_SYS instead of AUTH_NONE NFSv4: Ensure that we clear the NFS_OPEN_STATE flag when appropriate LOCKD: Ensure that nlmclnt_block resets block->b_status after a server reboot NFSv4: Ensure the LOCK call cannot use the delegation stateid NFSv4: Use the open stateid if the delegation has the wrong mode nfs: Send atime and mtime as a 64bit value NFSv4: Record the OPEN create mode used in the nfs4_opendata structure NFSv4.1: Set the RPC_CLNT_CREATE_INFINITE_SLOTS flag for NFSv4.1 transports SUNRPC: Allow rpc_create() to request that TCP slots be unlimited SUNRPC: Fix a livelock problem in the xprt->backlog queue NFSv4: Fix handling of revoked delegations by setattr NFSv4 release the sequence id in the return on close case nfs: remove unnecessary check for NULL inode->i_flock from nfs_delegation_claim_locks NFS: Ensure that NFS file unlock waits for readahead to complete NFS: Add functionality to allow waiting on all outstanding reads to complete ...	2013-04-30 11:28:08 -07:00
Linus Torvalds	5d434fcb25	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "Usual stuff, mostly comment fixes, typo fixes, printk fixes and small code cleanups" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (45 commits) mm: Convert print_symbol to %pSR gfs2: Convert print_symbol to %pSR m32r: Convert print_symbol to %pSR iostats.txt: add easy-to-find description for field 6 x86 cmpxchg.h: fix wrong comment treewide: Fix typo in printk and comments doc: devicetree: Fix various typos docbook: fix 8250 naming in device-drivers pata_pdc2027x: Fix compiler warning treewide: Fix typo in printks mei: Fix comments in drivers/misc/mei treewide: Fix typos in kernel messages pm44xx: Fix comment for "CONFIG_CPU_IDLE" doc: Fix typo "CONFIG_CGROUP_CGROUP_MEMCG_SWAP" mmzone: correct "pags" to "pages" in comment. kernel-parameters: remove outdated 'noresidual' parameter Remove spurious _H suffixes from ifdef comments sound: Remove stray pluses from Kconfig file radio-shark: Fix printk "CONFIG_LED_CLASS" doc: put proper reference to CONFIG_MODULE_SIG_ENFORCE ...	2013-04-30 09:36:50 -07:00
David S. Miller	58717686cf	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c drivers/net/ethernet/emulex/benet/be.h include/net/tcp.h net/mac802154/mac802154.h Most conflicts were minor overlapping stuff. The be2net driver brought in some fixes that added __vlan_put_tag calls, which in net-next take an additional argument. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-30 03:55:20 -04:00
Benjamin Poirier	79f632c71b	unix/stream: fix peeking with an offset larger than data in queue Currently, peeking on a unix stream socket with an offset larger than len of the data in the sk receive queue returns immediately with bogus data. This patch fixes this so that the behavior is the same as peeking with no offset on an empty queue: the caller blocks. Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-30 00:43:54 -04:00
Benjamin Poirier	39cc86130b	unix/dgram: fix peeking with an offset larger than data in queue Currently, peeking on a unix datagram socket with an offset larger than len of the data in the sk receive queue returns immediately with bogus data. That's because *off is not reset between each skb_queue_walk(). This patch fixes this so that the behavior is the same as peeking with no offset on an empty queue: the caller blocks. Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-30 00:43:54 -04:00
Benjamin Poirier	add05ad4e9	unix/dgram: peek beyond 0-sized skbs "77c1090 net: fix infinite loop in __skb_recv_datagram()" (v3.8) introduced a regression: After that commit, recv can no longer peek beyond a 0-sized skb in the queue. __skb_recv_datagram() instead stops at the first skb with len == 0 and results in the system call failing with -EFAULT via skb_copy_datagram_iovec(). When peeking at an offset with 0-sized skb(s), each one of those is received only once, in sequence. The offset starts moving forward again after receiving datagrams with len > 0. Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-30 00:43:54 -04:00
Thomas Graf	cff63a5292	openvswitch: Remove unneeded ovs_netdev_get_ifindex() The only user is get_dpifindex(), no need to redirect via the port operations. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-30 00:19:11 -04:00
Sridhar Samudrala	0c772159d1	net: Use consume_skb() to free gso segmented skb Use consume_skb() to free the original skb that is successfully transmitted as gso segmented skbs so that it is not treated as a drop due to an error. Signed-off-by: Sridhar Samudrala <sri@us.ibm.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-30 00:18:13 -04:00
Akinobu Mita	70e3ba72ba	net/core: remove duplicate statements by do-while loop Remove duplicate statements by using do-while loop instead of while loop. - A; - while (e) { + do { A; - } + } while (e); Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 18:28:43 -07:00
Akinobu Mita	33d7c5e539	net/core: rename random32() to prandom_u32() Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 18:28:43 -07:00
Akinobu Mita	ca3d41a588	net/netfilter: rename random32() to prandom_u32() Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Patrick McHardy <kaber@trash.net> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 18:28:43 -07:00
Akinobu Mita	5106f3bd81	net/sched: rename random32() to prandom_u32() Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 18:28:43 -07:00
Akinobu Mita	c86d2ddec7	net/sunrpc: rename random32() to prandom_u32() Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 18:28:43 -07:00
Jeff Layton	713e00a324	sctp: convert sctp_assoc_set_id() to use idr_alloc_cyclic() Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Sridhar Samudrala <sri@us.ibm.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 18:28:41 -07:00
Andy Shevchenko	2e0fb404c8	lib, net: make isodigit() public and use it There are at least two users of isodigit(). Let's make it a public function of ctype.h. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 18:28:19 -07:00
J. Bruce Fields	d28fcc830c	svcrpc: fix gss-proxy to respect user namespaces Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-04-29 18:21:29 -04:00
Fengguang Wu	6278b62aa8	SUNRPC: gssp_procedures[] can be static Cc: Simo Sorce <simo@redhat.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>	2013-04-29 17:19:48 -04:00
J. Bruce Fields	0ff3bab530	SUNRPC: define {create,destroy}_use_gss_proxy_proc_entry in !PROC case Though I wonder whether we should really just depend on CONFIG_PROC_FS at some point. Reported-by: kbuild test robot <fengguang.wu@intel.com>	2013-04-29 17:16:26 -04:00
J. Bruce Fields	b1df763723	Merge branch 'nfs-for-next' of git://linux-nfs.org/~trondmy/nfs-2.6 into for-3.10 Note conflict: Chuck's patches modified (and made static) gss_mech_get_by_OID, which is still needed by gss-proxy patches. The conflict resolution is a bit minimal; we may want some more cleanup.	2013-04-29 16:23:34 -04:00
David Howells	6bbefe8679	hostap: Don't use create_proc_read_entry() Don't use create_proc_read_entry() as that is deprecated, but rather use proc_create_data() and seq_file instead. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> cc: Jouni Malinen <j@w1.fi> cc: John W. Linville <linville@tuxdriver.com> cc: Johannes Berg <johannes@sipsolutions.net> cc: linux-wireless@vger.kernel.org cc: netdev@vger.kernel.org cc: devel@driverdev.osuosl.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-04-29 15:41:56 -04:00
Al Viro	14b872f02e	xt_hashlimit: allocate a copy of name explicitly, don't rely on procfs guts Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-04-29 15:41:49 -04:00
Al Viro	0e2bcaae83	sock_close() couldn't have been called with NULL inode since at least 2.1.early ... if not since 0.99 or so. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-04-29 15:41:43 -04:00
John W. Linville	17a2911f33	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem	2013-04-29 15:31:57 -04:00
Linus Torvalds	507ffe4f38	TTY/Serial driver update for 3.10-rc1 Here's the big tty/serial driver merge request for 3.10-rc1 Once again, Jiri has a number of TTY driver fixes and cleanups, and Peter Hurley came through with a bunch of ldisc fixes that resolve a number of reported issues. There are some other serial driver cleanups as well. All of these have been in the linux-next tree for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iEYEABECAAYFAlF+nSMACgkQMUfUDdst+ymy9QCfRmYn0MC0W+Q1D3Spz87gVsuo cqEAniu1BEkYZpjAz7ZlIN07Ao0jbQOR =Osu/ -----END PGP SIGNATURE----- Merge tag 'tty-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial driver update from Greg Kroah-Hartman: "Here's the big tty/serial driver merge request for 3.10-rc1 Once again, Jiri has a number of TTY driver fixes and cleanups, and Peter Hurley came through with a bunch of ldisc fixes that resolve a number of reported issues. There are some other serial driver cleanups as well. All of these have been in the linux-next tree for a while" * tag 'tty-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (117 commits) tty/serial/sirf: fix MODULE_DEVICE_TABLE serial: mxs: drop superfluous {get\|put}_device serial: mxs: fix buffer overflow ARM: PL011: add support for extended FIFO-size of PL011-r1p5 serial_core.c: add put_device() after device_find_child() tty: Fix unsafe bit ops in tty_throttle_safe/unthrottle_safe serial: sccnxp: Replace pdata.init/exit with regulator API serial: sccnxp: Do not override device name TTY: pty, fix compilation warning TTY: rocket, fix compilation warning TTY: ircomm: fix DTR being raised on hang up TTY: synclinkmp: fix DTR being raised on hang up TTY: synclink_gt: fix DTR being raised on hang up TTY: synclink: fix DTR being raised on hang up serial: 8250_dw: Fix the stub for dw8250_probe_acpi() serial: 8250_dw: Convert to devm_ioremap() serial: 8250_dw: Set port capabilities based on CPR register serial: 8250_dw: Let ACPI code extract the DMA client info serial: 8250_dw: Support clk framework also with ACPI serial: 8250_dw: Enable runtime PM ...	2013-04-29 12:16:17 -07:00
Yuchung Cheng	cd75eff64d	tcp: reset timer after any SYNACK retransmit Linux immediately returns SYNACK on (spurious) SYN retransmits, but keeps the SYNACK timer running independently. Thus the timer may fire right after the SYNACK retransmit and causes a SYN-SYNACK cross-fire burst. Adopt the fast retransmit/recovery idea in established state by re-arming the SYNACK timer after the fast (SYNACK) retransmit. The timer may fire late up to 500ms due to the current SYNACK timer wheel, but it's OK to be conservative when network is congested. Eric's new listener design should address this issue. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-29 15:14:03 -04:00
Eric Dumazet	6a5dc9e598	net: Add MIB counters for checksum errors Add MIB counters for checksum errors in IP layer, and TCP/UDP/ICMP layers, to help diagnose problems. $ nstat -a \| grep Csum IcmpInCsumErrors 72 0.0 TcpInCsumErrors 382 0.0 UdpInCsumErrors 463221 0.0 Icmp6InCsumErrors 75 0.0 Udp6InCsumErrors 173442 0.0 IpExtInCsumErrors 10884 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-29 15:14:03 -04:00
Eric Dumazet	aebda156a5	net: defer net_secret[] initialization Instead of feeding net_secret[] at boot time, defer the init at the point first socket is created. This permits some platforms to use better entropy sources than the ones available at boot time. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-29 15:14:02 -04:00
John W. Linville	a8a48e60a4	With this one we have: - One patch for moving the LLCP code into net/nfc. It fixes a build annoyance reported by Dave Miller caused by the fact that the LLCP code object targets are not in the same directory as the Makefile trying to build them is. It prevents us from doing e.g. make net/nfc/llcp/sock.o Moving the LLCP code into net/nfc and not making it optional anymore makes sense as LLCP is a fundamental piece of the NFC specifications and thus should be in the core NFC directory. - One patch that fixes the missing dependency against RFKILL. Without it NFC fails to properly build when it's builtin and CONFIG_RFKILL=m. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJRewn5AAoJEIqAPN1PVmxKeDsQAJPGgwKPymgMwRceB6bwSQnx hlbHkYVxgJd7wTK3Kz5oXf1i/Krz3K39845482qetS0I02/YwmAkrMHJ0YpbAvpq +TCjKk8sYvJniq157GV05zHG5mCjMr+rh2yQ1D/2GQwiI39KHnXaJsnKV0hhlwl9 5/Z+RB1y83Qaz/ug5Y/ZlFmRxlVTIjIs/+KncC2v1KpKw050Amst+C8ScuAfpDny ywvweCDUXh4aUteN1gC9+nelwczJDq2rbeP4oSoZctsoUmcO2WLtBuVc/lKSJBSE 8Ttje4IDRcSyn1zPAAwWV8O/zx0urBhskuyv7DNIuNDRjTIK3KdpVuNwEnUG0bO1 +DJ3GfsGQLSwZCi2YQ1ABUQc2kqlUFM10UvVObRrbD2Gwq0BbpvrkKsvbJUWpqWh 9zQe8G9lLHW//MF/sAvXcR+AezU8lZ/uU/qNVse/Yz0vBFUMYXthOd67ajybaSnW cHf139buAn6kaW9LnxJs4vbHv1qKlVP6a4aAWC2elaHy8nv+UBx62vlrEJg5Jyle kO7kHzesTK7NnC5Up0xQWbxQfvuoqINNhBI6IyMqb49L27MFDci+yN3aojfu+xGt i/AH1fwltopRYd5bK+BIjpeo+G3n1FonxW/w1OOaVHr+4aEjSfAerUk59yoC+nMo kdk7eBpn8uXVbYNjc+Ij =8VFM -----END PGP SIGNATURE----- Merge tag 'nfc-next-3.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next Samuel Ortiz <sameo@linux.intel.com> says: "With this one we have: - One patch for moving the LLCP code into net/nfc. It fixes a build annoyance reported by Dave Miller caused by the fact that the LLCP code object targets are not in the same directory as the Makefile trying to build them is. It prevents us from doing e.g. make net/nfc/llcp/sock.o Moving the LLCP code into net/nfc and not making it optional anymore makes sense as LLCP is a fundamental piece of the NFC specifications and thus should be in the core NFC directory. - One patch that fixes the missing dependency against RFKILL. Without it NFC fails to properly build when it's builtin and CONFIG_RFKILL=m." Signed-off-by: John W. Linville <linville@tuxdriver.com>	2013-04-29 15:08:47 -04:00
David S. Miller	14d3692f04	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== The following patchset contains relevant updates for the Netfilter tree, they are: * Enhancements for ipset: Add the counter extension for sets, this information can be used from the iptables set match, to change the matching behaviour. Jozsef required to add the extension infrastructure and moved the existing timeout support upon it. This also includes a change in net/sched/em_ipset to adapt it to the new extension structure. * Enhancements for performance boosting in nfnetlink_queue: Add new configuration flags that allows user-space to receive big packets (GRO) and to disable checksumming calculation. This were proposed by Eric Dumazet during the Netfilter Workshop 2013 in Copenhagen. Florian Westphal was kind enough to find the time to materialize the proposal. * A sparse fix from Simon, he noticed it in the SCTP NAT helper, the fix required a change in the interface of sctp_end_cksum. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-29 14:29:06 -04:00
Simon Horman	eee1d5a147	sctp: Correct type and usage of sctp_end_cksum() Change the type of the crc32 parameter of sctp_end_cksum() from __be32 to __u32 to reflect that fact that it is passed to cpu_to_le32(). There are five in-tree users of sctp_end_cksum(). The following four had warnings flagged by sparse which are no longer present with this change. net/netfilter/ipvs/ip_vs_proto_sctp.c:sctp_nat_csum() net/netfilter/ipvs/ip_vs_proto_sctp.c:sctp_csum_check() net/sctp/input.c:sctp_rcv_checksum() net/sctp/output.c:sctp_packet_transmit() The fifth user is net/netfilter/nf_nat_proto_sctp.c:sctp_manip_pkt(). It has been updated to pass a __u32 instead of a __be32, the value in question was already calculated in cpu byte-order. net/netfilter/nf_nat_proto_sctp.c:sctp_manip_pkt() has also been updated to assign the return value of sctp_end_cksum() directly to a variable of type __le32, matching the type of the return value. Previously the return value was assigned to a variable of type __be32 and then that variable was finally assigned to another variable of type __le32. Problems flagged by sparse. Compile and sparse tested only. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:09:08 +02:00
Florian Westphal	00bd1cc24a	netfilter: nfnetlink_queue: avoid expensive gso segmentation and checksum fixup Userspace can now indicate that it can cope with larger-than-mtu sized packets and packets that have invalid ipv4/tcp checksums. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:09:07 +02:00
Florian Westphal	7237190df8	netfilter: nfnetlink_queue: add skb info attribute Once we allow userspace to receive gso/gro packets, userspace needs to be able to determine when checksums appear to be broken, but are not. NFQA_SKB_CSUMNOTREADY means 'checksums will be fixed in kernel later, pretend they are ok'. NFQA_SKB_GSO could be used for statistics, or to determine when packet size exceeds mtu. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:09:06 +02:00
Florian Westphal	a5fedd43d5	netfilter: move skb_gso_segment into nfnetlink_queue module skb_gso_segment is expensive, so it would be nice if we could avoid it in the future. However, userspace needs to be prepared to receive larger-than-mtu-packets (which will also have incorrect l3/l4 checksums), so we cannot simply remove it. The plan is to add a per-queue feature flag that userspace can set when binding the queue. The problem is that in nf_queue, we only have a queue number, not the queue context/configuration settings. This patch should have no impact other than the skb_gso_segment call now being in a function that has access to the queue config data. A new size attribute in nf_queue_entry is needed so nfnetlink_queue can duplicate the entry of the gso skb when segmenting the skb while also copying the route key. The follow up patch adds switch to disable skb_gso_segment when queue config says so. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:09:05 +02:00
Florian Westphal	4bd60443cc	netfilter: nf_queue: move device refcount bump to extra function required by future patch that will need to duplicate the nf_queue_entry, bumping refcounts of the copy. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:09:04 +02:00
Jozsef Kadlecsik	6e01781d1c	netfilter: ipset: set match: add support to match the counters The new revision of the set match supports to match the counters and to suppress updating the counters at matching too. At the set:list types, the updating of the subcounters can be suppressed as well. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:09:03 +02:00
Jozsef Kadlecsik	de76303c5a	netfilter: ipset: The list:set type with counter support Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:09:02 +02:00
Jozsef Kadlecsik	00d71b270e	netfilter: ipset: The hash types with counter support Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:09:01 +02:00
Jozsef Kadlecsik	f48d19db12	netfilter: ipset: The bitmap types with counter support Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:09:00 +02:00
Jozsef Kadlecsik	34d666d489	netfilter: ipset: Introduce the counter extension in the core Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:08:59 +02:00
Jozsef Kadlecsik	7d47d972b5	netfilter: ipset: list:set type using the extension interface Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:08:58 +02:00
Jozsef Kadlecsik	5d50e1d883	netfilter: ipset: Hash types using the unified code base Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:08:57 +02:00
Jozsef Kadlecsik	1feab10d7e	netfilter: ipset: Unified hash type generation Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:08:56 +02:00
Jozsef Kadlecsik	b0da3905bb	netfilter: ipset: Bitmap types using the unified code base Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:08:55 +02:00
Jozsef Kadlecsik	4d73de38c2	netfilter: ipset: Unified bitmap type generation Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:08:54 +02:00
Jozsef Kadlecsik	075e64c041	netfilter: ipset: Introduce extensions to elements in the core Introduce extensions to elements in the core and prepare timeout as the first one. This patch also modifies the em_ipset classifier to use the new extension struct layout. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:08:54 +02:00
Jozsef Kadlecsik	8672d4d1a0	netfilter: ipset: Move often used IPv6 address masking function to header file Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:08:50 +02:00
Jozsef Kadlecsik	43c56e595b	netfilter: ipset: Make possible to test elements marked with nomatch Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-29 20:08:44 +02:00
Pravin B Shelar	5f5624cf15	ipv6: Kill ipv6 dependency of icmpv6_send(). Following patch adds icmp-registration module for ipv6. It allows ipv6 protocol to register icmp_sender which is used for sending ipv6 icmp msgs. This extra layer allows us to kill ipv6 dependency for sending icmp packets. This patch also fixes ip_tunnel compilation problem when ip_tunnel is statically compiled in kernel but ipv6 is module Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-29 13:54:36 -04:00
Nicolas Dichtel	e8d9612c18	sock_diag: allow to dump bpf filters This patch allows to dump BPF filters attached to a socket with SO_ATTACH_FILTER. Note that we check CAP_SYS_ADMIN before allowing to dump this info. For now, only AF_PACKET sockets use this feature. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-29 13:21:30 -04:00
Nicolas Dichtel	76d0eeb1a1	packet_diag: disclose meminfo values sk_rmem_alloc is disclosed via /proc/net/packet but not via netlink messages. The goal is to have the same level of information. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-29 13:21:30 -04:00
Nicolas Dichtel	626419038a	packet_diag: disclose uid value This value is disclosed via /proc/net/packet but not via netlink messages. The goal is to have the same level of information. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-29 13:21:30 -04:00
Chen Gang	2c1bbbffa0	net: mac802154: comparision issue of type cast, finding by EXTRA_CFLAGS=-W Change MAC802154_CHAN_NONE from ~(u8)0 to 0xff, or the comparison in mac802154_wpan_xmit() for ``chan == MAC802154_CHAN_NONE'' will not succeed. This bug can be boiled down to ``u8 foo = 0xff; if (foo == ~(u8)0) [...] else [...]'' where the condition will always take the else branch. Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-29 12:29:04 -04:00
roopa	b0a397fb35	bridge: Add fdb dst check during fdb update Current bridge fdb update code does not seem to update the port during fdb update. This patch adds a check for fdb dst (port) change during fdb update. Also rearranges the call to fdb_notify to send only one notification for create and update. Changelog: v2 - Change notify flag to bool Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-29 11:40:26 -04:00
Hans Schillstrom	f7a1dd6e3a	ipvs: ip_vs_sip_fill_param() BUG: bad check of return value The reason for this patch is crash in kmemdup caused by returning from get_callid with uniialized matchoff and matchlen. Removing Zero check of matchlen since it's done by ct_sip_get_header() BUG: unable to handle kernel paging request at ffff880457b5763f IP: [<ffffffff810df7fc>] kmemdup+0x2e/0x35 PGD 27f6067 PUD 0 Oops: 0000 [#1] PREEMPT SMP Modules linked in: xt_state xt_helper nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_mangle xt_connmark xt_conntrack ip6_tables nf_conntrack_ftp ip_vs_ftp nf_nat xt_tcpudp iptable_mangle xt_mark ip_tables x_tables ip_vs_rr ip_vs_lblcr ip_vs_pe_sip ip_vs nf_conntrack_sip nf_conntrack bonding igb i2c_algo_bit i2c_core CPU 5 Pid: 0, comm: swapper/5 Not tainted 3.9.0-rc5+ #5 /S1200KP RIP: 0010:[<ffffffff810df7fc>] [<ffffffff810df7fc>] kmemdup+0x2e/0x35 RSP: 0018:ffff8803fea03648 EFLAGS: 00010282 RAX: ffff8803d61063e0 RBX: 0000000000000003 RCX: 0000000000000003 RDX: 0000000000000003 RSI: ffff880457b5763f RDI: ffff8803d61063e0 RBP: ffff8803fea03658 R08: 0000000000000008 R09: 0000000000000011 R10: 0000000000000011 R11: 00ffffffff81a8a3 R12: ffff880457b5763f R13: ffff8803d67f786a R14: ffff8803fea03730 R15: ffffffffa0098e90 FS: 0000000000000000(0000) GS:ffff8803fea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff880457b5763f CR3: 0000000001a0c000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper/5 (pid: 0, threadinfo ffff8803ee18c000, task ffff8803ee18a480) Stack: ffff8803d822a080 000000000000001c ffff8803fea036c8 ffffffffa000937a ffffffff81f0d8a0 000000038135fdd5 ffff880300000014 ffff880300110000 ffffffff150118ac ffff8803d7e8a000 ffff88031e0118ac 0000000000000000 Call Trace: <IRQ> [<ffffffffa000937a>] ip_vs_sip_fill_param+0x13a/0x187 [ip_vs_pe_sip] [<ffffffffa007b209>] ip_vs_sched_persist+0x2c6/0x9c3 [ip_vs] [<ffffffff8107dc53>] ? __lock_acquire+0x677/0x1697 [<ffffffff8100972e>] ? native_sched_clock+0x3c/0x7d [<ffffffff8100972e>] ? native_sched_clock+0x3c/0x7d [<ffffffff810649bc>] ? sched_clock_cpu+0x43/0xcf [<ffffffffa007bb1e>] ip_vs_schedule+0x181/0x4ba [ip_vs] ... Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-29 11:35:30 -04:00
Wei Yongjun	50754d2188	genetlink: fix possible memory leak in genl_family_rcv_msg() 'attrbuf' is malloced in genl_family_rcv_msg() when family->maxattr && family->parallel_ops, thus should be freed before leaving from the error handling cases, otherwise it will cause memory leak. Introduced by commit `def3117493` (genl: Allow concurrent genl callbacks.) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-26 23:25:39 -04:00
Marcel Holtmann	c204ea092e	NFC: Add missing RFKILL dependency for Kconfig Since the NFC subsystem gained RFKILL support, it needs to be able to build properly with whatever option for RFKILL has been selected. on i386: net/built-in.o: In function `nfc_unregister_device': (.text+0x6a36d): undefined reference to `rfkill_unregister' net/built-in.o: In function `nfc_unregister_device': (.text+0x6a378): undefined reference to `rfkill_destroy' net/built-in.o: In function `nfc_register_device': (.text+0x6a493): undefined reference to `rfkill_alloc' net/built-in.o: In function `nfc_register_device': (.text+0x6a4a4): undefined reference to `rfkill_register' net/built-in.o: In function `nfc_register_device': (.text+0x6a4b3): undefined reference to `rfkill_destroy' net/built-in.o: In function `nfc_dev_up': (.text+0x6a8e8): undefined reference to `rfkill_blocked' when CONFIG_RFKILL=m but NFC is builtin. Reported-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>	2013-04-27 01:02:46 +02:00
Simo Sorce	030d794bf4	SUNRPC: Use gssproxy upcall for server RPCGSS authentication. The main advantge of this new upcall mechanism is that it can handle big tickets as seen in Kerberos implementations where tickets carry authorization data like the MS-PAC buffer with AD or the Posix Authorization Data being discussed in IETF on the krbwg working group. The Gssproxy program is used to perform the accept_sec_context call on the kernel's behalf. The code is changed to also pass the input buffer straight to upcall mechanism to avoid allocating and copying many pages as tokens can be as big (potentially more in future) as 64KiB. Signed-off-by: Simo Sorce <simo@redhat.com> [bfields: containerization, negotiation api] Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-04-26 11:41:28 -04:00
Simo Sorce	1d658336b0	SUNRPC: Add RPC based upcall mechanism for RPCGSS auth This patch implements a sunrpc client to use the services of the gssproxy userspace daemon. In particular it allows to perform calls in user space using an RPC call instead of custom hand-coded upcall/downcall messages. Currently only accept_sec_context is implemented as that is all is needed for the server case. File server modules like NFS and CIFS can use full gssapi services this way, once init_sec_context is also implemented. For the NFS server case this code allow to lift the limit of max 2k krb5 tickets. This limit is prevents legitimate kerberos deployments from using krb5 authentication with the Linux NFS server as they have normally ticket that are many kilobytes large. It will also allow to lift the limitation on the size of the credential set (uid,gid,gids) passed down from user space for users that have very many groups associated. Currently the downcall mechanism used by rpc.svcgssd is limited to around 2k secondary groups of the 65k allowed by kernel structures. Signed-off-by: Simo Sorce <simo@redhat.com> [bfields: containerization, concurrent upcalls, misc. fixes and cleanup] Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-04-26 11:41:27 -04:00
Simo Sorce	400f26b542	SUNRPC: conditionally return endtime from import_sec_context We expose this parameter for a future caller. It will be used to extract the endtime from the gss-proxy upcall mechanism, in order to set the rsc cache expiration time. Signed-off-by: Simo Sorce <simo@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-04-26 11:41:27 -04:00
J. Bruce Fields	33d90ac058	SUNRPC: allow disabling idle timeout In the gss-proxy case we don't want to have to reconnect at random--we want to connect only on gss-proxy startup when we can steal gss-proxy's context to do the connect in the right namespace. So, provide a flag that allows the rpc_create caller to turn off the idle timeout. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-04-26 11:41:26 -04:00
J. Bruce Fields	7073ea8799	SUNRPC: attempt AF_LOCAL connect on setup In the gss-proxy case, setup time is when I know I'll have the right namespace for the connect. In other cases, it might be useful to get any connection errors earlier--though actually in practice it doesn't make any difference for rpcbind. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2013-04-26 11:41:25 -04:00
J. Bruce Fields	c85b03ab20	Merge Trond's nfs-for-next Merging Trond's nfs-for-next branch, mainly to get `b7993cebb8` "SUNRPC: Allow rpc_create() to request that TCP slots be unlimited", which a small piece of the gss-proxy work depends on.	2013-04-26 11:37:43 -04:00
John W. Linville	375e875c69	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next	2013-04-26 08:36:00 -04:00

... 2 3 4 5 6 ...

28303 Commits