All calls of ceph_osdc_start_request() are preceded (in the case of
rbd, almost) immediately by a call to ceph_osdc_build_request().
Move the build calls at the top of ceph_osdc_start_request() out of
there and into the ceph_osdc_build_request(). Nothing prevents
moving these calls to the top of ceph_osdc_build_request(), either
(and we're going to want them there in the next patch) so put them
at the top.
This and the next patch are related to:
http://tracker.ceph.com/issues/4657
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This simply moves ceph_osdc_build_request() later in its source
file without any change. Done as a separate patch to facilitate
review of the change in the next patch.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An object class method is formatted using a pagelist which contains
the class name, the method name, and the data concatenated into an
osd request's outbound data.
Currently when a class op is initialized in osd_req_op_cls_init(),
the lengths of and pointers to these three items are recorded.
Later, when the op is getting formatted into the request message, a
new pagelist is created and that is when these items get copied into
the pagelist.
This patch makes it so the pagelist to hold these items is created
when the op is initialized instead.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request now holds all of its source op structures, and every
place that initializes one of these is in fact initializing one
of the entries in the the osd request's array.
So rather than supplying the address of the op to initialize, have
caller specify the osd request and an indication of which op it
would like to initialize. This better hides the details the
op structure (and faciltates moving the data pointers they use).
Since osd_req_op_init() is a common routine, and it's not used
outside the osd client code, give it static scope. Also make
it return the address of the specified op (so all the other
init routines don't have to repeat that code).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An extent type osd operation currently implies that there will
be corresponding data supplied in the data portion of the request
(for write) or response (for read) message. Similarly, an osd class
method operation implies a data item will be supplied to receive
the response data from the operation.
Add a ceph_osd_data pointer to each of those structures, and assign
it to point to eithre the incoming or the outgoing data structure in
the osd message. The data is not always available when an op is
initially set up, so add two new functions to allow setting them
after the op has been initialized.
Begin to make use of the data item pointer available in the osd
operation rather than the request data in or out structure in
places where it's convenient. Add some assertions to verify
pointers are always set the way they're expected to be.
This is a sort of stepping stone toward really moving the data
into the osd request ops, to allow for some validation before
making that jump.
This is the first in a series of patches that resolve:
http://tracker.ceph.com/issues/4657
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There are fields "indata" and "indata_len" defined the ceph osd
request op structure. The "in" part is with from the point of view
of the osd server, but is a little confusing here on the client
side. Change their names to use "request" instead of "in" to
indicate that it defines data provided with the request (as opposed
the data returned in the response).
Rename the local variable in osd_req_encode_op() to match.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request keeps a pointer to the osd operations (ops) array
that it builds in its request message.
In order to allow each op in the array to have its own distinct
data, we will need to keep track of each op's data, and that
information does not go over the wire.
As long as we're tracking the data we might as well just track the
entire (source) op definition for each of the ops. And if we're
doing that, we'll have no more need to keep a pointer to the
wire-encoded version.
This patch makes the array of source ops be kept with the osd
request structure, and uses that instead of the version encoded in
the message in places where that was previously used. The array
will be embedded in the request structure, and the maximum number of
ops we ever actually use is currently 2. So reduce CEPH_OSD_MAX_OP
to 2 to reduce the size of the structure.
The result of doing this sort of ripples back up, and as a result
various function parameters and local variables become unnecessary.
Make r_num_ops be unsigned, and move the definition of struct
ceph_osd_req_op earlier to ensure it's defined where needed.
It does not yet add per-op data, that's coming soon.
This resolves:
http://tracker.ceph.com/issues/4656
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
One more osd data helper, which returns the length of the
data item, regardless of its type.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define ceph_osd_data_init() and ceph_osd_data_release() to clean up
a little code.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define and use functions that encapsulate the initializion of a
ceph_osd_data structure.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This is a simple change, extracting the number of incoming data
bytes just once in handle_reply().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In prepare_message_data(), the length used to initialize the cursor
is taken from the header of the message provided. I'm working
toward not using the header data length field to determine length in
outbound messages, and this is a step in that direction. For
inbound messages this will be set to be the actual number of bytes
that are arriving (which may be less than the total size of the data
buffer available).
This resolves:
http://tracker.ceph.com/issues/4589
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Hold off building the osd request message in ceph_writepages_start()
until just before it will be submitted to the osd client for
execution.
We'll still create the request and allocate the page pointer array
after we learn we have at least one page to write. A local variable
will be used to keep track of the allocated array of pages. Wait
until just before submitting the request for assigning that page
array pointer to the request message.
Create ands use a new function osd_req_op_extent_update() whose
purpose is to serve this one spot where the length value supplied
when an osd request's op was initially formatted might need to get
changed (reduced, never increased) before submitting the request.
Previously, ceph_writepages_start() assigned the message header's
data length because of this update. That's no longer necessary,
because ceph_osdc_build_request() will recalculate the right
value to use based on the content of the ops in the request.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Defer building the osd request until just before submitting it in
all callers except ceph_writepages_start(). (That caller will be
handed in the next patch.)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This patch moves the call to ceph_osdc_build_request() out of
ceph_osdc_new_request() and into its caller.
This is in order to defer formatting osd operation information into
the request message until just before request is started.
The only unusual (ab)user of ceph_osdc_build_request() is
ceph_writepages_start(), where the final length of write request may
change (downward) based on the current inode size or the oldest
snapshot context with dirty data for the inode.
The remaining callers don't change anything in the request after has
been built.
This means the ops array is now supplied by the caller. It also
means there is no need to pass the mtime to ceph_osdc_new_request()
(it gets provided to ceph_osdc_build_request()). And rather than
passing a do_sync flag, have the number of ops in the ops array
supplied imply adding a second STARTSYNC operation after the READ or
WRITE requested.
This and some of the patches that follow are related to having the
messenger (only) be responsible for filling the content of the
message header, as described here:
http://tracker.ceph.com/issues/4589
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Keep track of the length of the data portion for a message in a
separate field in the ceph_msg structure. This information has
been maintained in wire byte order in the message header, but
that's going to change soon.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
A field in an osd request keeps track of whether a connection is
currently filling the request's reply message. This patch gets rid
of that field.
An osd request includes two messages--a request and a reply--and
they're both associated with the connection that existed to its
the target osd at the time the request was created.
An osd request can be dropped early, even when it's in flight.
And at that time both messages are released. It's possible the
reply message has been supplied to its connection to receive
an incoming response message at the time the osd request gets
dropped. So ceph_osdc_release_request() revokes that message
from the connection before releasing it so things get cleaned up
properly.
Previously this may have caused a problem, because the connection
that a message was associated with might have gone away before the
revoke request. And to avoid any problems using that connection,
the osd client held a reference to it when it supplies its response
message.
However since this commit:
38941f80 libceph: have messages point to their connection
all messages hold a reference to the connection they are associated
with whenever the connection is actively operating on the message
(i.e. while the message is queued to send or sending, and when it
data is being received into it). And if a message has no connection
associated with it, ceph_msg_revoke_incoming() won't do anything
when asked to revoke it.
As a result, there is no need to keep an additional reference to the
connection associated with a message when we hand the message to the
messenger when it calls our alloc_msg() method to receive something.
If the connection *were* operating on it, it would have its own
reference, and if not, there's no work to be done when we need to
revoke it.
So get rid of the osd request's r_con_filling_msg field.
This resolves:
http://tracker.ceph.com/issues/4647
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There are two basically identical definitions of __decode_pgid()
in libceph, one in "net/ceph/osdmap.c" and the other in
"net/ceph/osd_client.c". Get rid of both, and instead define
a single inline version in "include/linux/ceph/osdmap.h".
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The osd client mutex is acquired just before getting a reference to
a request in handle_reply(). However the error paths after that
don't drop the mutex before returning as they should.
Drop the mutex after dropping the request reference. Also add a
bad_mutex label at that point and use it so the failed request
lookup case can be handled with the rest.
This resolves:
http://tracker.ceph.com/issues/4615
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Use osd_req_op_extent_init() in ceph_osdc_new_request() to
initialize the one or two ops built in that function.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
All callers of ceph_osd_new_request() pass either CEPH_OSD_OP_READ
or CEPH_OSD_OP_WRITE as the opcode value. The function assumes it
by filling in the extent fields in the ops array it builds. So just
assert that is the case, and don't bother calling op_has_extent()
before filling in the first osd operation in the array.
Define some local variables to gather the information to fill into
the first op, and then fill in the op array all in one place.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The ceph_osdc_new_request() an array of osd operations is built up
and filled in partially within that function and partially in the
called function calc_layout(). Move the latter part back out to
ceph_osdc_new_request() so it's all done in one place. This makes
it unnecessary to pass the op pointer to calc_layout(), so get rid
of that parameter.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The purpose of calc_layout() is to determine, given a file offset
and length and a layout describing the placement of file data across
objects, where in "object space" that data resides.
Specifically, it determines which object should hold the first part
of the specified range of file data, and the offset and length of
data within that object. The length will not exceed the bounds
of the object, and the caller is informed of that maximum length.
Add two parameters to calc_layout() to allow the object-relative
offset and length to be passed back to the caller.
This is the first steps toward having ceph_osdc_new_request() build
its osd op structure using osd_req_op_extent_init().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The rbd code has a function that allocates and populates a
ceph_osd_req_op structure (the in-core version of an osd request
operation). When reviewed, Josh suggested two things: that the
big varargs function might be better split into type-specific
functions; and that this functionality really belongs in the osd
client rather than rbd.
This patch implements both of Josh's suggestions. It breaks
up the rbd function into separate functions and defines them
in the osd client module as exported interfaces. Unlike the
rbd version, however, the functions don't allocate an osd_req_op
structure; they are provided the address of one and that is
initialized instead.
The rbd function has been eliminated and calls to it have been
replaced by calls to the new routines. The rbd code now now use a
stack (struct) variable to hold the op rather than allocating and
freeing it each time.
For now only the capabilities used by rbd are implemented.
Implementing all the other osd op types, and making the rest of the
code use it will be done separately, in the next few patches.
Note that only the extent, cls, and watch portions of the
ceph_osd_req_op structure are currently used. Delete the others
(xattr, pgls, and snap) from its definition so nobody thinks it's
actually implemented or needed. We can add it back again later
if needed, when we know it's been tested.
This (and a few follow-on patches) resolves:
http://tracker.ceph.com/issues/3861
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define a separate function to determine the validity of an opcode,
and use it inside osd_req_encode_op() in order to unclutter that
function.
Don't update the destination op at all--and return zero--if an
unsupported or unrecognized opcode is seen in osd_req_encode_op().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In ceph_osdc_build_request() there is a call to cpu_to_le16() which
provides a 64-bit value as its argument. Because of the implied
byte swapping going on it looked pretty suspect to me.
At the moment it turns out the behavior is well defined, but masking
off those bottom bits explicitly eliminates this distraction, and is
in fact more directly related to the purpose of the message header's
data_off field.
This resolves:
http://tracker.ceph.com/issues/4125
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When a cursor for a page array data message is initialized it needs
to determine the initial value for cursor->last_piece. Currently it
just checks if length is less than a page, but that's not correct.
The data in the first page in the array will be offset by a page
offset based on the alignment recorded for the data. (All pages
thereafter will be aligned at the base of the page, so there's
no need to account for this except for the first page.)
Because this was wrong, there was a case where the length of a piece
would be calculated as all of the residual bytes in the message and
that plus the page offset could exceed the length of a page.
So fix this case. Make sure the sum won't wrap.
This resolves a third issue described in:
http://tracker.ceph.com/issues/4598
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Currently ceph_msg_data_pages_advance() allows the page offset value
to be PAGE_SIZE, apparently assuming ceph_msg_data_pages_next() will
treat it as 0. But that doesn't happen, and the result led to a
helpful assertion failure.
Change ceph_msg_data_pages_advance() to truncate the offset to 0
before returning if it reaches PAGE_SIZE.
Make a few other minor adjustments in this area (comments and a
better assertion) while modifying it.
This resolves a second issue described in:
http://tracker.ceph.com/issues/4598
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
It's OK for the result of a read to come back with fewer bytes than
were requested. So don't trigger a BUG() in that case when
initializing the data cursor.
This resolves the first problem described in:
http://tracker.ceph.com/issues/4598
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Begin the transition from a single message data item to a list of
them by replacing the "data" structure in a message with a pointer
to a ceph_msg_data structure.
A null pointer will indicate the message has no data; replace the
use of ceph_msg_has_data() with a simple check for a null pointer.
Create functions ceph_msg_data_create() and ceph_msg_data_destroy()
to dynamically allocate and free a data item structure of a given type.
When a message has its data item "set," allocate one of these to
hold the data description, and free it when the last reference to
the message is dropped.
This partially resolves:
http://tracker.ceph.com/issues/4429
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The *_msg_pos_next() functions do little more than call
ceph_msg_data_advance(). Replace those wrapper functions with
a simple call to ceph_msg_data_advance().
This cleanup is related to:
http://tracker.ceph.com/issues/4428
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In write_partial_message_data() we aggregate the crc for the data
portion of the message as each new piece of the data item is
encountered. Because it was computed *before* sending the data, if
an attempt to send a new piece resulted in 0 bytes being sent, the
crc crc across that piece would erroneously get computed again and
added to the aggregate result. This would occasionally happen in
the evnet of a connection failure.
The crc value isn't really needed until the complete value is known
after sending all data, so there's no need to compute it before
sending.
So don't calculate the crc for a piece until *after* we know at
least one byte of it has been sent. That will avoid this problem.
This resolves:
http://tracker.ceph.com/issues/4450
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The only remaining field in the ceph_msg_pos structure is
did_page_crc. In the new cursor model of things that flag (or
something like it) belongs in the cursor.
Define a new field "need_crc" in the cursor (which applies to all
types of data) and initialize it to true whenever a cursor is
initialized.
In write_partial_message_data(), the data CRC still will be computed
as before, but it will check the cursor->need_crc field to determine
whether it's needed. Any time the cursor is advanced to a new piece
of a data item, need_crc will be set, and this will cause the crc
for that entire piece to be accumulated into the data crc.
In write_partial_message_data() the intermediate crc value is now
held in a local variable so it doesn't have to be byte-swapped so
many times. In read_partial_msg_data() we do something similar
(but mainly for consistency there).
With that, the ceph_msg_pos structure can go away, and it no longer
needs to be passed as an argument to prepare_message_data().
This cleanup is related to:
http://tracker.ceph.com/issues/4428
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
All but one of the fields in the ceph_msg_pos structure are now
never used (only assigned), so get rid of them. This allows
several small blocks of code to go away.
This is cleanup of old code related to:
http://tracker.ceph.com/issues/4428
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Use the "resid" field of a cursor rather than finding when the
message data position has moved up to meet the data length to
determine when all data has been sent or received in
write_partial_message_data() and read_partial_msg_data().
This is cleanup of old code related to:
http://tracker.ceph.com/issues/4428
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
It turns out that only one of the data item types is ever used at
any one time in a single message (currently).
- A page array is used by the osd client (on behalf of the file
system) and by rbd. Only one osd op (and therefore at most
one data item) is ever used at a time by rbd. And the only
time the file system sends two, the second op contains no
data.
- A bio is only used by the rbd client (and again, only one
data item per message)
- A page list is used by the file system and by rbd for outgoing
data, but only one op (and one data item) at a time.
We can therefore collapse all three of our data item fields into a
single field "data", and depend on the messenger code to properly
handle it based on its type.
This allows us to eliminate quite a bit of duplicated code.
This is related to:
http://tracker.ceph.com/issues/4429
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Now that read_partial_message_pages() and read_partial_message_bio()
are literally identical functions we can factor them out. They're
pretty simple as well, so just move their relevant content into
read_partial_msg_data().
This is and previous patches together resolve:
http://tracker.ceph.com/issues/4428
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is handling in write_partial_message_data() for the case where
only the length of--and no other information about--the data to be
sent has been specified. It uses the zero page as the source of
data to send in this case.
This case doesn't occur. All message senders set up a page array,
pagelist, or bio describing the data to be sent. So eliminate the
block of code that handles this (but check and issue a warning for
now, just in case it happens for some reason).
This resolves:
http://tracker.ceph.com/issues/4426
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The cursor code for a page array selects the right page, page
offset, and length to use for a ceph_tcp_recvpage() call, so
we can use it to replace a block in read_partial_message_pages().
This partially resolves:
http://tracker.ceph.com/issues/4428
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The bio_iter and bio_seg fields in a message are no longer used, we
use the cursor instead. So get rid of them and the functions that
operate on them them.
This is related to:
http://tracker.ceph.com/issues/4428
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Replace the use of the information in con->in_msg_pos for incoming
bio data. The old in_msg_pos and the new cursor mechanism do
basically the same thing, just slightly differently.
The main functional difference is that in_msg_pos keeps track of the
length of the complete bio list, and assumed it was fully consumed
when that many bytes had been transferred. The cursor does not assume
a length, it simply consumes all bytes in the bio list. Because the
only user of bio data is the rbd client, and because the length of a
bio list provided by rbd client always matches the number of bytes
in the list, both ways of tracking length are equivalent.
In addition, for in_msg_pos the initial bio vector is selected as
the initial value of the bio->bi_idx, while the cursor assumes this
is zero. Again, the rbd client always passes 0 as the initial index
so the effect is the same.
Other than that, they basically match:
in_msg_pos cursor
---------- ------
bio_iter bio
bio_seg vec_index
page_pos page_offset
The in_msg_pos field is initialized by a call to init_bio_iter().
The bio cursor is initialized by ceph_msg_data_cursor_init().
Both now happen in the same spot, in prepare_message_data().
The in_msg_pos field is advanced by a call to in_msg_pos_next(),
which updates page_pos and calls iter_bio_next() to move to the next
bio vector, or to the next bio in the list. The cursor is advanced
by ceph_msg_data_advance(). That isn't currently happening so
add a call to that in in_msg_pos_next().
Finally, the next piece of data to use for a read is determined
by a bunch of lines in read_partial_message_bio(). Those can be
replaced by an equivalent ceph_msg_data_bio_next() call.
This partially resolves:
http://tracker.ceph.com/issues/4428
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
All of the data types can use this, not just the page array. Until
now, only the bio type doesn't have it available, and only the
initiator of the request (the rbd client) is able to supply the
length of the full request without re-scanning the bio list. Change
the cursor init routines so the length is supplied based on the
message header "data_len" field, and use that length to intiialize
the "resid" field of the cursor.
In addition, change the way "last_piece" is defined so it is based
on the residual number of bytes in the original request. This is
necessary (at least for bio messages) because it is possible for
a read request to succeed without consuming all of the space
available in the data buffer.
This resolves:
http://tracker.ceph.com/issues/4427
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The value passed for "pages" in read_partial_message_pages() is
always the pages pointer from the incoming message, which can be
derived inside that function. So just get rid of the parameter.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When the last reference to a ceph message is dropped,
ceph_msg_last_put() is called to clean things up.
For "normal" messages (allocated via ceph_msg_new() rather than
being allocated from a memory pool) it's sufficient to just release
resources. But for a mempool-allocated message we actually have to
re-initialize the data fields in the message back to initial state
so they're ready to go in the event the message gets reused.
Some of this was already done; this fleshes it out so it's done
more completely.
This resolves:
http://tracker.ceph.com/issues/4540
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd expects the transaction ids of arriving request messages from
a given client to a given osd to increase monotonically. So the osd
client needs to send its requests in ascending tid order.
The transaction id for a request is set at the time it is
registered, in __register_request(). This is also where the request
gets placed at the end of the osd client's unsent messages list.
At the end of ceph_osdc_start_request(), the request message for a
newly-mapped osd request is supplied to the messenger to be sent
(via __send_request()). If any other messages were present in the
osd client's unsent list at that point they would be sent *after*
this new request message.
Because those unsent messages have already been registered, their
tids would be lower than the newly-mapped request message, and
sending that message first can violate the tid ordering rule.
Rather than sending the new request only, send all queued requests
(including the new one) at that point in ceph_osdc_start_request().
This ensures the tid ordering property is preserved.
With this in place, all messages should now be sent in tid order
regardless of whether they're being sent for the first time or
re-sent as a result of a call to osd_reset().
This resolves:
http://tracker.ceph.com/issues/4392
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
In __map_request(), when adding a request to an osd client's unsent
list, add it to the tail rather than the head. That way the newest
entries (with the highest tid value) will be last.
Maintain an osd's request list in order of increasing tid also.
Finally--to be consistent--maintain an osd client's "notarget" list
in that order as well.
This partially resolves:
http://tracker.ceph.com/issues/4392
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
The osd expects incoming requests for a given object from a given
client to arrive in order, with the tid for each request being
greater than the tid for requests that have already arrived. This
patch fixes two places the osd client might not maintain that
ordering.
For the osd client, the connection fault method is osd_reset().
That function calls __reset_osd() to close and re-open the
connection, then calls __kick_osd_requests() to cause all
outstanding requests for the affected osd to be re-sent after
the connection has been re-established.
When an osd is reset, any in-flight messages will need to be
re-sent. An osd client maintains distinct lists for unsent and
in-flight messages. Meanwhile, an osd maintains a single list of
all its requests (both sent and un-sent). (Each message is linked
into two lists--one for the osd client and one list for the osd.)
To process an osd "kick" operation, the request list for the *osd*
is traversed, and each request is moved off whichever osd *client*
list it was on (unsent or sent) and placed onto the osd client's
unsent list. (It remains where it is on the osd's request list.)
When that is done, osd_reset() calls __send_queued() to cause each
of the osd client's unsent messages to be sent.
OK, with that background...
As the osd request list is traversed each request is prepended to
the osd client's unsent list in the order they're seen. The effect
of this is to reverse the order of these requests as they are put
(back) onto the unsent list.
Instead, build up a list of only the requests for an osd that have
already been sent (by checking their r_sent flag values). Once an
unsent request is found, stop examining requests and prepend the
requests that need re-sending to the osd client's unsent list.
Preserve the original order of requests in the process (previously
re-queued requests were reversed in this process). Because they
have already been sent, they will have lower tids than any request
already present on the unsent list.
Just below that, traverse the linger list in forward order as
before, but add them to the *tail* of the list rather than the head.
These requests get re-registered, and in the process are give a new
(higher) tid, so the should go at the end.
This partially resolves:
http://tracker.ceph.com/issues/4392
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
Since we no longer drop the request mutex between registering and
mapping an osd request in ceph_osdc_start_request(), there is no
chance of a race with kick_requests().
We can now therefore map and send the new request unconditionally
(but we'll issue a warning should it ever occur).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
One of the first things ceph_osdc_start_request() does is register
the request. It then acquires the osd client's map semaphore and
request mutex and proceeds to map and send the request.
There is no reason the request has to be registered before acquiring
the map semaphore. So hold off doing so until after the map
semaphore is held.
Since register_request() is nothing more than a wrapper around
__register_request(), call the latter function instead, after
acquiring the request mutex.
That leaves register_request() unused, so get rid of it.
This partially resolves:
http://tracker.ceph.com/issues/4392
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
The auth code is called from a variety of contexts, include the mon_client
(protected by the monc's mutex) and the messenger callbacks (currently
protected by nothing). Avoid chaos by protecting all auth state with a
mutex. Nothing is blocking, so this should be simple and lightweight.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Use wrapper functions that check whether the auth op exists so that callers
do not need a bunch of conditional checks. Simplifies the external
interface.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Currently the messenger calls out to a get_authorizer con op, which will
create a new authorizer if it doesn't yet have one. In the meantime, when
we rotate our service keys, the authorizer doesn't get updated. Eventually
it will be rejected by the server on a new connection attempt and get
invalidated, and we will then rebuild a new authorizer, but this is not
ideal.
Instead, if we do have an authorizer, call a new update_authorizer op that
will verify that the current authorizer is using the latest secret. If it
is not, we will build a new one that does. This avoids the transient
failure.
This fixes one of the sorry sequence of events for bug
http://tracker.ceph.com/issues/4282
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
We were invalidating the authorizer by removing the ticket handler
entirely. This was effective in inducing us to request a new authorizer,
but in the meantime it mean that any authorizer we generated would get a
new and initialized handler with secret_id=0, which would always be
rejected by the server side with a confusing error message:
auth: could not find secret_id=0
cephx: verify_authorizer could not get service secret for service osd secret_id=0
Instead, simply clear the validity field. This will still induce the auth
code to request a new secret, but will let us continue to use the old
ticket in the meantime. The messenger code will probably continue to fail,
but the exponential backoff will kick in, and eventually the we will get a
new (hopefully more valid) ticket from the mon and be able to continue.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
We maintain a counter of failed auth attempts to allow us to retry once
before failing. However, if the second attempt succeeds, the flag isn't
cleared, which makes us think auth failed again later when the connection
resets for other reasons (like a socket error).
This is one part of the sorry sequence of events in bug
http://tracker.ceph.com/issues/4282
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
This is an old protocol extension that allows the client and server to
avoid resending old messages after a reconnect (following a socket error).
Instead, the exchange their sequence numbers during the handshake. This
avoids sending a bunch of useless data over the socket.
It has been supported in the server code since v0.22 (Sep 2010).
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Basically all cases in write_partial_msg_pages() use the cursor, and
as a result we can simplify that function quite a bit.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The wart that is the ceph message trail can now be removed, because
its only user was the osd client, and the previous patch made that
no longer the case.
The result allows write_partial_msg_pages() to be simplified
considerably.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The osd trail is a pagelist, used only for a CALL osd operation
to hold the class and method names, along with any input data for
the call.
It is only currently used by the rbd client, and when it's used it
is the only bit of outbound data in the osd request. Since we
already support (non-trail) pagelist data in a message, we can
just save this outbound CALL data in the "normal" pagelist rather
than the trail, and get rid of the trail entirely.
The existing pagelist support depends on the pagelist being
dynamically allocated, and ownership of it is passed to the
messenger once it's been attached to a message. (That is to say,
the messenger releases and frees the pagelist when it's done with
it). That means we need to dynamically allocate the pagelist also.
Note that we simply assert that the allocation of a pagelist
structure succeeds. Appending to a pagelist might require a dynamic
allocation, so we're already assuming we won't run into trouble
doing so (we're just ignore any failures--and that should be fixed
at some point).
This resolves:
http://tracker.ceph.com/issues/4407
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Add support for recording a ceph pagelist as data associated with an
osd request.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The length of outgoing data in an osd request is dependent on the
osd ops that are embedded in that request. Each op is encoded into
a request message using osd_req_encode_op(), so that should be used
to determine the amount of outgoing data implied by the op as it
is encoded.
Have osd_req_encode_op() return the number of bytes of outgoing data
implied by the op being encoded, and accumulate and use that in
ceph_osdc_build_request().
As a result, ceph_osdc_build_request() no longer requires its "len"
parameter, so get rid of it.
Using the sum of the op lengths rather than the length provided is
a valid change because:
- The only callers of osd ceph_osdc_build_request() are
rbd and the osd client (in ceph_osdc_new_request() on
behalf of the file system).
- When rbd calls it, the length provided is only non-zero for
write requests, and in that case the single op has the
same length value as what was passed here.
- When called from ceph_osdc_new_request(), (it's not all that
easy to see, but) the length passed is also always the same
as the extent length encoded in its (single) write op if
present.
This resolves:
http://tracker.ceph.com/issues/4406
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Implement and use cursor routines for page array message data items
for outbound message data.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Implement and use cursor routines for bio message data items for
outbound message data.
(See the previous commit for reasoning in support of the changes
in out_msg_pos_next().)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Switch to using the message cursor for the (non-trail) outgoing
pagelist data item in a message if present.
Notes on the logic changes in out_msg_pos_next():
- only the mds client uses a ceph pagelist for message data;
- if the mds client ever uses a pagelist, it never uses a page
array (or anything else, for that matter) for data in the same
message;
- only the osd client uses the trail portion of a message data,
and when it does, it never uses any other data fields for
outgoing data in the same message; and finally
- only the rbd client uses bio message data (never pagelist).
Therefore out_msg_pos_next() can assume:
- if we're in the trail portion of a message, the message data
pagelist, data, and bio can be ignored; and
- if there is a page list, there will never be any a bio or page
array data, and vice-versa.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This just inserts some infrastructure in preparation for handling
other types of ceph message data items. No functional changes,
just trying to simplify review by separating out some noise.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This patch lays out the foundation for using generic routines to
manage processing items of message data.
For simplicity, we'll start with just the trail portion of a
message, because it stands alone and is only present for outgoing
data.
First some basic concepts. We'll use the term "data item" to
represent one of the ceph_msg_data structures associated with a
message. There are currently four of those, with single-letter
field names p, l, b, and t. A data item is further broken into
"pieces" which always lie in a single page. A data item will
include a "cursor" that will track state as the memory defined by
the item is consumed by sending data from or receiving data into it.
We define three routines to manipulate a data item's cursor: the
"init" routine; the "next" routine; and the "advance" routine. The
"init" routine initializes the cursor so it points at the beginning
of the first piece in the item. The "next" routine returns the
page, page offset, and length (limited by both the page and item
size) of the next unconsumed piece in the item. It also indicates
to the caller whether the piece being returned is the last one in
the data item.
The "advance" routine consumes the requested number of bytes in the
item (advancing the cursor). This is used to record the number of
bytes from the current piece that were actually sent or received by
the network code. It returns an indication of whether the result
means the current piece has been fully consumed. This is used by
the message send code to determine whether it should calculate the
CRC for the next piece processed.
The trail of a message is implemented as a ceph pagelist. The
routines defined for it will be usable for non-trail pagelist data
as well.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Group the types of message data into an abstract structure with a
type indicator and a union containing fields appropriate to the
type of data it represents. Use this to represent the pages,
pagelist, bio, and trail in a ceph message.
Verify message data is of type NONE in ceph_msg_data_set_*()
routines. Since information about message data of type NONE really
should not be interpreted, get rid of the other assertions in those
functions.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
A ceph message has a data payload portion. The memory for that data
(either the source of data to send or the location to place data
that is received) is specified in several ways. The ceph_msg
structure includes fields for all of those ways, but this
mispresents the fact that not all of them are used at a time.
Specifically, the data in a message can be in:
- an array of pages
- a list of pages
- a list of Linux bios
- a second list of pages (the "trail")
(The two page lists are currently only ever used for outgoing data.)
Impose more structure on the ceph message, making the grouping of
some of these fields explicit. Shorten the name of the
"page_alignment" field.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define and use macros ceph_msg_has_*() to determine whether to
operate on the pages, pagelist, bio, and trail fields of a message.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Factor out a common block of code that updates a CRC calculation
over a range of data in a page.
This and the preceding patches are related to:
http://tracker.ceph.com/issues/4403
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define a new function ceph_tcp_recvpage() that behaves in a way
comparable to ceph_tcp_sendpage().
Rearrange the code in both read_partial_message_pages() and
read_partial_message_bio() so they have matching structure,
(similar to what's in write_partial_msg_pages()), and use
this new function.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull the code that reads the data portion into a message into
a separate function read_partial_msg_data().
Rename write_partial_msg_pages() to be write_partial_message_data()
to match its read counterpart, and to reflect its more generic
purpose.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define local variables page_offset and length to represent the range
of bytes within a page that will be sent by ceph_tcp_sendpage() in
write_partial_msg_pages().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In prepare_write_message_data(), various fields are initialized in
preparation for writing message data out. Meanwhile, in
read_partial_message(), there is essentially the same block of code,
operating on message variables associated with an incoming message.
Generalize prepare_write_message_data() so it works for both
incoming and outcoming messages, and use it in both spots. The
did_page_crc is not used for input (so it's harmless to initialize
it).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There are several places where a message's out_msg_pos or in_msg_pos
field is used repeatedly within a function. Use a local pointer
variable for this purpose to unclutter the code.
This and the upcoming cleanup patches are related to:
http://tracker.ceph.com/issues/4403
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
At one time it was necessary to clear a message's bio_iter field to
avoid a bad pointer dereference in write_partial_msg_pages().
That no longer seems to be the case. Here's why.
The message's bio fields represent (in this case) outgoing data.
Between where the bio_iter is made NULL in prepare_write_message()
and the call in that function to prepare_message_data(), the
bio fields are never used.
In prepare_message_data(), init-bio_iter() is called, and the result
of that overwrites the value in the message's bio_iter field.
Because it gets overwritten anyway, there is no need to set it to
NULL. So don't do it.
This resolves:
http://tracker.ceph.com/issues/4402
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The mds client no longer tries to assign zero-length message data,
and the osd client no longer sets its data info more than once.
This allows us to activate assertions in the messenger to verify
these things never happen.
This resolves both of these:
http://tracker.ceph.com/issues/4263http://tracker.ceph.com/issues/4284
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
When an incoming message is destined for the osd client, the
messenger calls the osd client's alloc_msg method. That function
looks up which request has the tid matching the incoming message,
and returns the request message that was preallocated to receive the
response. The response message is therefore known before the
request is even started.
Between the start of the request and the receipt of the response,
the request and its data fields will not change, so there's no
reason we need to hold off setting them. In fact it's preferable
to set them just once because it's more obvious that they're
unchanging.
So set up the fields describing where incoming data is to land in a
response message at the beginning of ceph_osdc_start_request().
Define a helper function that sets these fields, and use it to
set the fields for both outgoing data in the request message and
incoming data in the response.
This resolves:
http://tracker.ceph.com/issues/4284
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Record the number of bytes of data in a page array rather than the
number of pages in the array. It can be assumed that the page array
is of sufficient size to hold the number of bytes indicated (and
offset by the indicated alignment).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Change it so we only assign outgoing data information for messages
if there is outgoing data to send.
This then allows us to add a few more (currently commented-out)
assertions.
This is related to:
http://tracker.ceph.com/issues/4284
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Define ceph_msg_data_set_pagelist(), ceph_msg_data_set_bio(), and
ceph_msg_data_set_trail() to clearly abstract the assignment of the
remaining data-related fields in a ceph message structure. Use the
new functions in the osd client and mds client.
This partially resolves:
http://tracker.ceph.com/issues/4263
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When setting page array information for message data, provide the
byte length rather than the page count ceph_msg_data_set_pages().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define a function ceph_msg_data_set_pages(), which more clearly
abstracts the assignment page-related fields for data in a ceph
message structure. Use this new function in the osd client and mds
client.
Ideally, these fields would never be set more than once (with
BUG_ON() calls to guarantee that). At the moment though the osd
client sets these every time it receives a message, and in the event
of a communication problem this can happen more than once. (This
will be resolved shortly, but setting up these helpers first makes
it all a bit easier to work with.)
Rearrange the field order in a ceph_msg structure to group those
that are used to define the possible data payloads.
This partially resolves:
http://tracker.ceph.com/issues/4263
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Record the byte count for an osd request rather than the page count.
The number of pages can always be derived from the byte count (and
alignment/offset) but the reverse is not true.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Rather than explicitly initializing many fields to 0, NULL, or false
in a newly-allocated message, just use kzalloc() for allocating new
messages. This will become a much more convenient way of doing
things anyway for upcoming patches that abstract the data field.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
While processing an outgoing pagelist (either the data pagelist or
trail) in a ceph message, the messenger cycles through each of the
pages on the list. This is accomplished in out_msg_pos_next(), if
the end of the first page on the list is reached, the first page is
moved to the end of the list.
There is a list operation, list_rotate_left(), which performs
exactly this operation, and by using it, what's really going on
becomes more obvious.
So replace these two list_move_tail() calls with list_rotate_left().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define a new function in_msg_pos_next() to match out_msg_pos_next(),
and use it in place of code at the end of read_partial_message_pages()
and read_partial_message_bio().
Note that the page number is incremented and offset reset under
slightly different conditions from before. The result is
equivalent, however, as explained below.
Each time an incoming message is going to arrive, we find out how
much room is left--not surpassing the current page--and provide that
as the number of bytes to receive. So the amount we'll use is the
lesser of: all that's left of the entire request; and all that's
left in the current page.
If we received exactly how many were requested, we either reached
the end of the request or the end of the page. In the first case,
we're done, in the second, we move onto the next page in the array.
In all cases but (possibly) on the last page, after adding the
number of bytes received, page_pos == PAGE_SIZE. On the last page,
it doesn't really matter whether we increment the page number and
reset the page position, because we're done and we won't come back
here again. The code previously skipped over that last case,
basically. The new code handles that case the same as the others,
incrementing and resetting.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is only one caller for read_partial_message_bio(), and it
always passes &msg->bio_iter and &bio_seg as the second and third
arguments. Furthermore, the message in question is always the
connection's in_msg, and we can get that inside the called function.
So drop those two parameters and use their derived equivalents.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Change the type of the "more" parameter from int to bool.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Some values printed are not (necessarily) in CPU order. We already
have a copy of the converted versions, so use them.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This is probably unnecessary but the code read as if it were wrong
in read_partial_message().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In ceph_con_in_msg_alloc() it is possible for a connection's
alloc_msg method to indicate an incoming message should be skipped.
By default, read_partial_message() initializes the skip variable
to 0 before it gets provided to ceph_con_in_msg_alloc().
The osd client, mon client, and mds client each supply an alloc_msg
method. The mds client always assigns skip to be 0.
The other two leave the skip value of as-is, or assigns it to zero,
except:
- if no (osd or mon) request having the given tid is found, in
which case skip is set to 1 and NULL is returned; or
- in the osd client, if the data of the reply message is not
adequate to hold the message to be read, it assigns skip
value 1 and returns NULL.
So the returned message pointer will always be NULL if skip is ever
non-zero.
Clean up the logic a bit in ceph_con_in_msg_alloc() to make this
state of affairs more obvious. Add a comment explaining how a null
message pointer can mean either a message that should be skipped or
a problem allocating a message.
This resolves:
http://tracker.ceph.com/issues/4324
Reported-by: Greg Farnum <greg@inktank.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
An osd request defines information about where data to be read
should be placed as well as where data to write comes from.
Currently these are represented by common fields.
Keep information about data for writing separate from data to be
read by splitting these into data_in and data_out fields.
This is the key patch in this whole series, in that it actually
identifies which osd requests generate outgoing data and which
generate incoming data. It's less obvious (currently) that an osd
CALL op generates both outgoing and incoming data; that's the focus
of some upcoming work.
This resolves:
http://tracker.ceph.com/issues/4127
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request uses either pages or a bio list for its data. Use a
union to record information about the two, and add a data type
tag to select between them.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull the fields in an osd request structure that define the data for
the request out into a separate structure.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently ceph_osdc_new_request() assigns an osd request's
r_num_pages and r_alignment fields. The only thing it does
after that is call ceph_osdc_build_request(), and that doesn't
need those fields to be assigned.
Move the assignment of those fields out of ceph_osdc_new_request()
and into its caller. As a result, the page_align parameter is no
longer used, so get rid of it.
Note that in ceph_sync_write(), the value for req->r_num_pages had
already been calculated earlier (as num_pages, and fortunately
it was computed the same way). So don't bother recomputing it,
but because it's not needed earlier, move that calculation after the
call to ceph_osdc_new_request(). Hold off making the assignment to
r_alignment, doing it instead r_pages and r_num_pages are
getting set.
Similarly, in start_read(), nr_pages already holds the number of
pages in the array (and is calculated the same way), so there's no
need to recompute it. Move the assignment of the page alignment
down with the others there as well.
This and the next few patches are preparation work for:
http://tracker.ceph.com/issues/4127
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The only user of the ceph messenger that doesn't define an alloc_msg
method is the mds client. Define one, such that it works just like
it did before, and simplify ceph_con_in_msg_alloc() by assuming the
alloc_msg method is always present.
This and the next patch resolve:
http://tracker.ceph.com/issues/4322
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
In ceph_con_in_msg_alloc(), if no alloc_msg method is defined for a
connection a new message is allocated with ceph_msg_new().
Drop the mutex before making this call, and make sure we're still
connected when we get it back again.
This is preparing for the next patch, which ensures all connections
define an alloc_msg method, and then handles them all the same way.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
The purpose of ceph_calc_object_layout() is to fill in the pool
number and seed for a ceph_pg structure provided, based on a given
osd map and target object id.
Currently that function takes a file layout parameter, but the only
thing used out of that is its pool number.
Change the function so it takes a pool number rather than the full
file layout structure. Only update the ceph_pg if the pool is found
in the osd map. Get rid of few useless lines of code from the
function while there.
Since the function now very clearly just fills in the ceph_pg
structure it's provided, rename it ceph_calc_ceph_pg().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The pagelist_count field is never actually used, so get rid of it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The new cases added to osd_req_encode_op() caused a new sparse
error, which highlighted an existing problem that had been
overlooked since it was originally checked in. When an unsupported
opcode is found the destination rather than the source opcode was
being used in the error message. The two differ in their byte
order, and we want to be using the one in the source.
Fix the problem in both spots.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request marked to linger will be re-submitted in the event
a connection to the target osd gets dropped. Currently, if there
is a callback function associated with a request it will be called
each time a request is submitted--which for lingering requests can
be more than once.
Change it so a request--including lingering ones--will get completed
(from the perspective of the user of the osd client) exactly once.
This resolves:
http://tracker.ceph.com/issues/3967
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The page alignment field for a request is currently set in
ceph_osdc_build_request(). It's not needed at that point
nor do either of its callers need that value assigned at
any point before they call ceph_osdc_start_request().
So move that assignment into ceph_osdc_start_request().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Use distinct fields for tracking the number of pages in a message's
page array and in a message's page list. Currently only one or the
other is used at a time, but that will be changing soon.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The only remaining reason to pass the osd request to calc_layout()
is to fill in its r_num_pages and r_page_alignment fields. Once it
fills those in, it doesn't do anything more with them.
We can therefore move those assignments into the caller, and get rid
of the "req" parameter entirely.
Note, however, that the only caller is ceph_osdc_new_request(),
and that immediately overwrites those fields with values based on
its passed-in page offset. So the assignment inside calc_layout()
was redundant anyway.
This resolves:
http://tracker.ceph.com/issues/4262
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Move the formatting of the object name (oid) to use for an object
request into the caller of calc_layout(). This makes the "vino"
parameter no longer necessary, so get rid of it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Have calc_layout() pass the computed object number back to its
caller. (This is a small step to simplify review.)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The bio_seg field is used by the ceph messenger in iterating through
a bio. It should never have a negative value, so make it an
unsigned. (I contemplated making it unsigned short to match the
struct bio definition, but it offered no benefit.)
Change variables used to hold bio_seg values to all be unsigned as
well. Change two variable names in init_bio_iter() to match the
convention used everywhere else.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
If an invalid layout is provided to ceph_osdc_new_request(), its
call to calc_layout() might return an error. At that point in the
function we've already allocated an osd request structure, so we
need to free it (drop a reference) in the event such an error
occurs.
The only other value calc_layout() will return is 0, so make that
explicit in the successful case.
This resolves:
http://tracker.ceph.com/issues/4240
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull Ceph fix from Sage Weil:
"This fixes a bug in the new message decoding that just went in during
the last window."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: fix decoding of pgids
In 4f6a7e5ee1 we effectively dropped support
for the legacy encoding for the OSDMap and incremental. However, we didn't
fix the decoding for the pgid.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Pull Ceph updates from Sage Weil:
"A few groups of patches here. Alex has been hard at work improving
the RBD code, layout groundwork for understanding the new formats and
doing layering. Most of the infrastructure is now in place for the
final bits that will come with the next window.
There are a few changes to the data layout. Jim Schutt's patch fixes
some non-ideal CRUSH behavior, and a set of patches from me updates
the client to speak a newer version of the protocol and implement an
improved hashing strategy across storage nodes (when the server side
supports it too).
A pair of patches from Sam Lang fix the atomicity of open+create
operations. Several patches from Yan, Zheng fix various mds/client
issues that turned up during multi-mds torture tests.
A final set of patches expose file layouts via virtual xattrs, and
allow the policies to be set on directories via xattrs as well
(avoiding the awkward ioctl interface and providing a consistent
interface for both kernel mount and ceph-fuse users)."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (143 commits)
libceph: add support for HASHPSPOOL pool flag
libceph: update osd request/reply encoding
libceph: calculate placement based on the internal data types
ceph: update support for PGID64, PGPOOL3, OSDENC protocol features
ceph: update "ceph_features.h"
libceph: decode into cpu-native ceph_pg type
libceph: rename ceph_pg -> ceph_pg_v1
rbd: pass length, not op for osd completions
rbd: move rbd_osd_trivial_callback()
libceph: use a do..while loop in con_work()
libceph: use a flag to indicate a fault has occurred
libceph: separate non-locked fault handling
libceph: encapsulate connection backoff
libceph: eliminate sparse warnings
ceph: eliminate sparse warnings in fs code
rbd: eliminate sparse warnings
libceph: define connection flag helpers
rbd: normalize dout() calls
rbd: barriers are hard
rbd: ignore zero-length requests
...
The legacy behavior adds the pgid seed and pool together as the input for
CRUSH. That is problematic because each pool's PGs end up mapping to the
same OSDs: 1.5 == 2.4 == 3.3 == ...
Instead, if the HASHPSPOOL flag is set, we has the ps and pool together and
feed that into CRUSH. This ensures that two adjacent pools will map to
an independent pseudorandom set of OSDs.
Advertise our support for this via a protocol feature flag.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Use the new version of the encoding for osd requests and replies. In the
process, update the way we are tracking request ops and reply lengths and
results in the struct ceph_osd_request. Update the rbd and fs/ceph users
appropriately.
The main changes are:
- we keep pointers into the request memory for fields we need to update
each time the request is sent out over the wire
- we keep information about the result in an array in the request struct
where the users can easily get at it.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Instead of using the old ceph_object_layout struct, update our internal
ceph_calc_object_layout method to use the ceph_pg type. This allows us to
pass the full 32-bit precision of the pgid.seed to the callers. It also
allows some callers to avoid reaching into the request structures for the
struct ceph_object_layout fields.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Support (and require) the PGID64, PGPOOL3, and OSDENC protocol features.
These have been present in ceph.git since v0.42, Feb 2012. Require these
features to simplify support; nobody is running older userspace.
Note that the new request and reply encoding is still not in place, so the new
code is not yet functional.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Always decode data into our cpu-native ceph_pg type that has the correct
field widths. Limit any remaining uses of ceph_pg_v1 to dealing with the
legacy protocol.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Rename the old version this type to distinguish it from the new version.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Pull user namespace and namespace infrastructure changes from Eric W Biederman:
"This set of changes starts with a few small enhnacements to the user
namespace. reboot support, allowing more arbitrary mappings, and
support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
user namespace root.
I do my best to document that if you care about limiting your
unprivileged users that when you have the user namespace support
enabled you will need to enable memory control groups.
There is a minor bug fix to prevent overflowing the stack if someone
creates way too many user namespaces.
The bulk of the changes are a continuation of the kuid/kgid push down
work through the filesystems. These changes make using uids and gids
typesafe which ensures that these filesystems are safe to use when
multiple user namespaces are in use. The filesystems converted for
3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
changes for these filesystems were a little more involved so I split
the changes into smaller hopefully obviously correct changes.
XFS is the only filesystem that remains. I was hoping I could get
that in this release so that user namespace support would be enabled
with an allyesconfig or an allmodconfig but it looks like the xfs
changes need another couple of days before it they are ready."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
cifs: Enable building with user namespaces enabled.
cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
cifs: Convert struct cifs_sb_info to use kuids and kgids
cifs: Modify struct smb_vol to use kuids and kgids
cifs: Convert struct cifsFileInfo to use a kuid
cifs: Convert struct cifs_fattr to use kuid and kgids
cifs: Convert struct tcon_link to use a kuid.
cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
cifs: Convert from a kuid before printing current_fsuid
cifs: Use kuids and kgids SID to uid/gid mapping
cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
cifs: Override unmappable incoming uids and gids
nfsd: Enable building with user namespaces enabled.
nfsd: Properly compare and initialize kuids and kgids
nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
nfsd: Modify nfsd4_cb_sec to use kuids and kgids
nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
nfsd: Convert nfsxdr to use kuids and kgids
nfsd: Convert nfs3xdr to use kuids and kgids
...
This just converts a manually-implemented loop into a do..while loop
in con_work(). It also moves handling of EAGAIN inside the blocks
where it's already been determined an error code was returned.
Also update a few dout() calls near the affected code for
consistency.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This just rearranges the logic in con_work() a little bit so that a
flag is used to indicate a fault has occurred. This allows both the
fault and non-fault case to be handled the same way and avoids a
couple of nearly consecutive gotos.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An error occurring on a ceph connection is treated as a fault,
causing the connection to be reset. The initial part of this fault
handling has to be done while holding the connection mutex, but
it must then be dropped for the last part.
Separate the part of this fault handling that executes without the
lock into its own function, con_fault_finish(). Move the call to
this new function, as well as call that drops the connection mutex,
into ceph_fault(). Rename that function con_fault() to reflect that
it's only handling the connection part of the fault handling.
The motivation for this was a warning from sparse about the locking
being done here. Rearranging things this way keeps all the mutex
manipulation within ceph_fault(), and this stops sparse from
complaining.
This partially resolves:
http://tracker.ceph.com/issues/4184
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Collect the code that tests for and implements a backoff delay for a
ceph connection into a new function, ceph_backoff().
Make the debug output messages in that part of the code report
things consistently by reporting a message in the socket closed
case, and by making the one for PREOPEN state report the connection
pointer like the rest.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Eliminate most of the problems in the libceph code that cause sparse
to issue warnings.
- Convert functions that are never referenced externally to have
static scope.
- Pass NULL rather than 0 for a pointer argument in one spot in
ceph_monc_delete_snapid()
This partially resolves:
http://tracker.ceph.com/issues/4184
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define and use functions that encapsulate operations performed on
a connection's flags.
This resolves:
http://tracker.ceph.com/issues/4234
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The return values provided for ceph_copy_to_page_vector() and
ceph_copy_from_page_vector() serve no purpose, so get rid of them.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The functions used for working with ceph page vectors are defined
with char pointers, but they're really intended to operate on
untyped data. Change the types of these function parameters
to (void *) to reflect this.
(Note that the functions now assume void pointer arithmetic works
like arithmetic on char pointers.)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Add support for CEPH_OSD_OP_STAT operations in the osd client
and in rbd.
This operation sends no data to the osd; everything required is
encoded in identity of the target object.
The result will be ENOENT if the object doesn't exist. If it does
exist and no other error occurs the server returns the size and last
modification time of the target object as output data (in little
endian format). The size is a 64 bit unsigned and the time is
ceph_timespec structure (two unsigned 32-bit integers, representing
a seconds and nanoseconds value).
This resolves:
http://tracker.ceph.com/issues/4007
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Simplify the way the data length recorded in a message header is
calculated in ceph_osdc_build_request().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In osd_req_encode_op() there are a few cases that handle osd
opcodes that are never used in the kernel. The presence of
this code gives the impression it's correct (which really can't
be assumed), and may impose some unnecessary restrictions on
some upcoming refactoring of this code.
So delete this effectively dead code, and report uses of the
previously handled cases as unsupported.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
If osd_req_encode_op() is given any opcode it doesn't recognize
it reports an error.
This patch fleshes out that routine to distinguish between
well-defined but unsupported values and values that are simply
bogus.
This and the next commit are related to:
http://tracker.ceph.com/issues/4126
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Update ceph_osd_op_name() to include the newly-added definitions in
"rados.h", and to match its counterpart in the user space code.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Add the definition of ceph_osd_state_name(), to match its
counterpart in user space.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There are no actual users of ceph_osdc_wait_event(). This would
have been one-shot events, but we no longer support those so just
get rid of this function.
Since this leaves nothing else that waits for the completion of an
event, we can get rid of the completion in a struct ceph_osd_event.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is only one caller of ceph_osdc_create_event(), and it
provides 0 as its "one_shot" argument. Get rid of that argument and
just use 0 in its place.
Replace the code in handle_watch_notify() that executes if one_shot
is nonzero in the event with a BUG_ON() call.
While modifying "osd_client.c", give handle_watch_notify() static
scope.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is no caller of ceph_calc_raw_layout() outside of libceph, so
there's no need to export from the module.
Furthermore, there is only one caller, in calc_layout(), and it
is not much more than a simple wrapper for that function.
So get rid of ceph_calc_raw_layout() and embed it instead within
calc_layout().
While touching "osd_client.c", get rid of the unnecessary forward
declaration of __send_request().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The only callers of ceph_osdc_init() and ceph_osdc_stop()
ceph_create_client() and ceph_destroy_client() (respectively)
and they are in the same kernel module as those two functions.
There's therefore no need to export those interfaces, so don't.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Two of the three callers of the osd client's send_queued() function
already hold the osd client mutex and drop it before the call.
Change send_queued() so it assumes the caller holds the mutex, and
update all callers accordingly. Rename it __send_queued() to match
the convention used elsewhere in the file with respect to the lock.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The "num_reply" parameter to ceph_osdc_new_request() is never
used inside that function, so get rid of it.
Note that ceph_sync_write() passes 2 for that argument, while all
other callers pass 1. It doesn't matter, but perhaps someone should
verify this doesn't indicate a problem.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is only one caller of ceph_osdc_writepages(), and it always
passes 0 as its "flags" argument. Get rid of that argument and
replace its use in ceph_osdc_writepages() with 0.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is only one caller of ceph_osdc_writepages(), and it always
passes 0 as its "dosync" argument. Get rid of that argument and
replace its use in ceph_osdc_writepages() with 0.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is only one caller of ceph_osdc_writepages(), and it always
passes the value true as its "nofail" argument. Get rid of that
argument and replace its use in ceph_osdc_writepages() with the
constant value true.
This and a number of cleanup patches that follow resolve:
http://tracker.ceph.com/issues/4126
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is a check in the completion path for osd requests that
ensures the number of pages allocated is enough to hold the amount
of incoming data expected.
For bio requests coming from rbd the "number of pages" is not really
meaningful (although total length would be). So stop requiring that
nr_pages be supplied for bio requests. This is done by checking
whether the pages pointer is null before checking the value of
nr_pages.
Note that this value is passed on to the messenger, but there it's
only used for debugging--it's never used for validation.
While here, change another spot that used r_pages in a debug message
inappropriately, and also invalidate the r_con_filling_msg pointer
after dropping a reference to it.
This resolves:
http://tracker.ceph.com/issues/3875
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently, if the OSD client finds an osd request has had a bio list
attached to it, it drops a reference to it (or rather, to the first
entry on that list) when the request is released.
The code that added that reference (i.e., the rbd client) is
therefore required to take an extra reference to that first bio
structure.
The osd client doesn't really do anything with the bio pointer other
than transfer it from the osd request structure to outgoing (for
writes) and ingoing (for reads) messages. So it really isn't the
right place to be taking or dropping references.
Furthermore, the rbd client already holds references to all bio
structures it passes to the osd client, and holds them until the
request is completed. So there's no need for this extra reference
whatsoever.
So remove the bio_put() call in ceph_osdc_release_request(), as
well as its matching bio_get() call in rbd_osd_req_create().
This change could lead to a crash if old libceph.ko was used with
new rbd.ko. Add a compatibility check at rbd initialization time to
avoid this possibilty.
This resolves:
http://tracker.ceph.com/issues/3798 and
http://tracker.ceph.com/issues/3799
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An upcoming change implements semantic change that could lead to
a crash if an old version of the libceph kernel module is used with
a new version of the rbd kernel module.
In order to preclude that possibility, this adds a compatibilty
check interface. If this interface doesn't exist, the modules are
obviously not compatible. But if it does exist, this provides a way
of letting the caller know whether it will operate properly with
this libceph module.
Perhaps confusingly, it returns false right now. The semantic
change mentioned above will make it return true.
This resolves:
http://tracker.ceph.com/issues/3800
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The ceph messenger has a few spots that are only used when
bio messages are supported, and that's only when CONFIG_BLOCK
is defined. This surrounds a couple of spots with #ifdef's
that would cause a problem if CONFIG_BLOCK were not present
in the kernel configuration.
This resolves:
http://tracker.ceph.com/issues/3976
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Today ceph opens tcp sockets from a delayed work callback. Delayed
work happens from kernel threads which are always in the initial
network namespace. Therefore fail early if someone attempts
to mount a ceph filesystem from something other than the initial
network namespace.
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The variable "str" is used as both the source and destination in
function snprintf(), which is undefined behavior based on C11. The
original description in C11 is:
"If copying takes place between objects that
overlap, the behavior is undefined."
And, the function of ceph_osdmap_state_str() is to return the osdmap
state, so it should return "doesn't exist" when all the conditions
are not satisfied. I fix it in this patch.
[elder@inktank.com: shortened the commit message]
Signed-off-by: Cong Ding <dinggnu@gmail.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Both ceph_osdc_alloc_request() and ceph_osdc_build_request() are
provided an array of ceph osd request operations. Rather than just
passing the number of operations in the array, the caller is
required append an additional zeroed operation structure to signal
the end of the array.
All callers know the number of operations at the time these
functions are called, so drop the silly zero entry and supply that
number directly. As a result, get_num_ops() is no longer needed.
This also means that ceph_osdc_alloc_request() never uses its ops
argument, so that can be dropped.
Also rbd_create_rw_ops() no longer needs to add one to reserve room
for the additional op.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Only one of the two callers of ceph_osdc_alloc_request() provides
page or bio data for its payload. And essentially all that function
was doing with those arguments was assigning them to fields in the
osd request structure.
Simplify ceph_osdc_alloc_request() by having the caller take care of
making those assignments
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The only thing ceph_osdc_alloc_request() really does with the
flags value it is passed is assign it to the newly-created
osd request structure. Do that in the caller instead.
Both callers subsequently call ceph_osdc_build_request(), so have
that function (instead of ceph_osdc_alloc_request()) issue a warning
if a request comes through with neither the read nor write flags set.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The osdc parameter to ceph_calc_raw_layout() is not used, so get rid
of it. Consequently, the corresponding parameter in calc_layout()
becomes unused, so get rid of that as well.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
A snapshot id must be provided to ceph_calc_raw_layout() even though
it is not needed at all for calculating the layout.
Where the snapshot id *is* needed is when building the request
message for an osd operation.
Drop the snapid parameter from ceph_calc_raw_layout() and pass
that value instead in ceph_osdc_build_request().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ceph_calc_file_object_mapping() takes (among other things) a "file"
offset and length, and based on the layout, determines the object
number ("bno") backing the affected portion of the file's data and
the offset into that object where the desired range begins. It also
computes the size that should be used for the request--either the
amount requested or something less if that would exceed the end of
the object.
This patch changes the input length parameter in this function so it
is used only for input. That is, the argument will be passed by
value rather than by address, so the value provided won't get
updated by the function.
The value would only get updated if the length would surpass the
current object, and in that case the value it got updated to would
be exactly that returned in *oxlen.
Only one of the two callers is affected by this change. Update
ceph_calc_raw_layout() so it records any updated value.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The len argument to ceph_osdc_build_request() is set up to be
passed by address, but that function never updates its value
so there's no need to do this. Tighten up the interface by
passing the length directly.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Since every osd message is now prepared to include trailing data,
there's no need to check ahead of time whether any operations will
make use of the trail portion of the message.
We can drop the second argument to get_num_ops(), and as a result we
can also get rid of op_needs_trail() which is no longer used.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request structure contains an optional trail portion, which
if present will contain data to be passed in the payload portion of
the message containing the request. The trail field is a
ceph_pagelist pointer, and if null it indicates there is no trail.
A ceph_pagelist structure contains a length field, and it can
legitimately hold value 0. Make use of this to change the
interpretation of the "trail" of an osd request so that every osd
request has trailing data, it just might have length 0.
This means we change the r_trail field in a ceph_osd_request
structure from a pointer to a structure that is always initialized.
Note that in ceph_osdc_start_request(), the trail pointer (or now
address of that structure) is assigned to a ceph message's trail
field. Here's why that's still OK (looking at net/ceph/messenger.c):
- What would have resulted in a null pointer previously will now
refer to a 0-length page list. That message trail pointer
is used in two functions, write_partial_msg_pages() and
out_msg_pos_next().
- In write_partial_msg_pages(), a null page list pointer is
handled the same as a message with 0-length trail, and both
result in a "in_trail" variable set to false. The trail
pointer is only used if in_trail is true.
- The only other place the message trail pointer is used is
out_msg_pos_next(). That function is only called by
write_partial_msg_pages() and only touches the trail pointer
if the in_trail value it is passed is true.
Therefore a null ceph_msg->trail pointer is equivalent to a non-null
pointer referring to a 0-length page list structure.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The last two parameters to ceph_osd_build_request() describe the
object id, but the values passed always come from the osd request
structure whose address is also provided. Get rid of those last
two parameters.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reformat __reset_osd() into three distinct blocks of code
handling the three return cases.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This saves us some cycles, but does not affect the placement result at
all.
This corresponds to ceph.git commit 4abb53d4f.
Signed-off-by: Sage Weil <sage@inktank.com>
Add libceph support for a new CRUSH tunable recently added to Ceph servers.
Consider the CRUSH rule
step chooseleaf firstn 0 type <node_type>
This rule means that <n> replicas will be chosen in a manner such that
each chosen leaf's branch will contain a unique instance of <node_type>.
When an object is re-replicated after a leaf failure, if the CRUSH map uses
a chooseleaf rule the remapped replica ends up under the <node_type> bucket
that held the failed leaf. This causes uneven data distribution across the
storage cluster, to the point that when all the leaves but one fail under a
particular <node_type> bucket, that remaining leaf holds all the data from
its failed peers.
This behavior also limits the number of peers that can participate in the
re-replication of the data held by the failed leaf, which increases the
time required to re-replicate after a failure.
For a chooseleaf CRUSH rule, the tree descent has two steps: call them the
inner and outer descents.
If the tree descent down to <node_type> is the outer descent, and the descent
from <node_type> down to a leaf is the inner descent, the issue is that a
down leaf is detected on the inner descent, so only the inner descent is
retried.
In order to disperse re-replicated data as widely as possible across a
storage cluster after a failure, we want to retry the outer descent. So,
fix up crush_choose() to allow the inner descent to return immediately on
choosing a failed leaf. Wire this up as a new CRUSH tunable.
Note that after this change, for a chooseleaf rule, if the primary OSD
in a placement group has failed, choosing a replacement may result in
one of the other OSDs in the PG colliding with the new primary. This
requires that OSD's data for that PG to need moving as well. This
seems unavoidable but should be relatively rare.
This corresponds to ceph.git commit 88f218181a9e6d2292e2697fc93797d0f6d6e5dc.
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Sage Weil <sage@inktank.com>
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.
CC: Sage Weil <sage@inktank.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Sage Weil <sage@inktank.com>
Pull Ceph fixes from Sage Weil:
"Two of Alex's patches deal with a race when reseting server
connections for open RBD images, one demotes some non-fatal BUGs to
WARNs, and my patch fixes a protocol feature bit failure path."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: fix protocol feature mismatch failure path
libceph: WARN, don't BUG on unexpected connection states
libceph: always reset osds when kicking
libceph: move linger requests sooner in kick_requests()
We should not set con->state to CLOSED here; that happens in
ceph_fault() in the caller, where it first asserts that the state
is not yet CLOSED. Avoids a BUG when the features don't match.
Since the fail_protocol() has become a trivial wrapper, replace
calls to it with direct calls to reset_connection().
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
A number of assertions in the ceph messenger are implemented with
BUG_ON(), killing the system if connection's state doesn't match
what's expected. At this point our state model is (evidently) not
well understood enough for these assertions to trigger a BUG().
Convert all BUG_ON(con->state...) calls to be WARN_ON(con->state...)
so we learn about these issues without killing the machine.
We now recognize that a connection fault can occur due to a socket
closure at any time, regardless of the state of the connection. So
there is really nothing we can assert about the state of the
connection at that point so eliminate that assertion.
Reported-by: Ugis <ugis22@gmail.com>
Tested-by: Ugis <ugis22@gmail.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
When ceph_osdc_handle_map() is called to process a new osd map,
kick_requests() is called to ensure all affected requests are
updated if necessary to reflect changes in the osd map. This
happens in two cases: whenever an incremental map update is
processed; and when a full map update (or the last one if there is
more than one) gets processed.
In the former case, the kick_requests() call is followed immediately
by a call to reset_changed_osds() to ensure any connections to osds
affected by the map change are reset. But for full map updates
this isn't done.
Both cases should be doing this osd reset.
Rather than duplicating the reset_changed_osds() call, move it into
the end of kick_requests().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The kick_requests() function is called by ceph_osdc_handle_map()
when an osd map change has been indicated. Its purpose is to
re-queue any request whose target osd is different from what it
was when it was originally sent.
It is structured as two loops, one for incomplete but registered
requests, and a second for handling completed linger requests.
As a special case, in the first loop if a request marked to linger
has not yet completed, it is moved from the request list to the
linger list. This is as a quick and dirty way to have the second
loop handle sending the request along with all the other linger
requests.
Because of the way it's done now, however, this quick and dirty
solution can result in these incomplete linger requests never
getting re-sent as desired. The problem lies in the fact that
the second loop only arranges for a linger request to be sent
if it appears its target osd has changed. This is the proper
handling for *completed* linger requests (it avoids issuing
the same linger request twice to the same osd).
But although the linger requests added to the list in the first loop
may have been sent, they have not yet completed, so they need to be
re-sent regardless of whether their target osd has changed.
The first required fix is we need to avoid calling __map_request()
on any incomplete linger request. Otherwise the subsequent
__map_request() call in the second loop will find the target osd
has not changed and will therefore not re-send the request.
Second, we need to be sure that a sent but incomplete linger request
gets re-sent. If the target osd is the same with the new osd map as
it was when the request was originally sent, this won't happen.
This can be fixed through careful handling when we move these
requests from the request list to the linger list, by unregistering
the request *before* it is registered as a linger request. This
works because a side-effect of unregistering the request is to make
the request's r_osd pointer be NULL, and *that* will ensure the
second loop actually re-sends the linger request.
Processing of such a request is done at that point, so continue with
the next one once it's been moved.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Pull Ceph update from Sage Weil:
"There are a few different groups of commits here. The largest is
Alex's ongoing work to enable the coming RBD features (cloning,
striping). There is some cleanup in libceph that goes along with it.
Cyril and David have fixed some problems with NFS reexport (leaking
dentries and page locks), and there is a batch of patches from Yan
fixing problems with the fs client when running against a clustered
MDS. There are a few bug fixes mixed in for good measure, many of
which will be going to the stable trees once they're upstream.
My apologies for the late pull. There is still a gremlin in the rbd
map/unmap code and I was hoping to include the fix for that as well,
but we haven't been able to confirm the fix is correct yet; I'll send
that in a separate pull once it's nailed down."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits)
rbd: get rid of rbd_{get,put}_dev()
libceph: register request before unregister linger
libceph: don't use rb_init_node() in ceph_osdc_alloc_request()
libceph: init event->node in ceph_osdc_create_event()
libceph: init osd->o_node in create_osd()
libceph: report connection fault with warning
libceph: socket can close in any connection state
rbd: don't use ENOTSUPP
rbd: remove linger unconditionally
rbd: get rid of RBD_MAX_SEG_NAME_LEN
libceph: avoid using freed osd in __kick_osd_requests()
ceph: don't reference req after put
rbd: do not allow remove of mounted-on image
libceph: Unlock unprocessed pages in start_read() error path
ceph: call handle_cap_grant() for cap import message
ceph: Fix __ceph_do_pending_vmtruncate
ceph: Don't add dirty inode to dirty list if caps is in migration
ceph: Fix infinite loop in __wake_requests
ceph: Don't update i_max_size when handling non-auth cap
bdi_register: add __printf verification, fix arg mismatch
...
In kick_requests(), we need to register the request before we
unregister the linger request. Otherwise the unregister will
reset the request's osd pointer to NULL.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The red-black node in the ceph osd request structure is initialized
in ceph_osdc_alloc_request() using rbd_init_node(). We do need to
initialize this, because in __unregister_request() we call
RB_EMPTY_NODE(), which expects the node it's checking to have
been initialized. But rb_init_node() is apparently overkill, and
may in fact be on its way out. So use RB_CLEAR_NODE() instead.
For a little more background, see this commit:
4c199a93 rbtree: empty nodes have no color"
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The red-black node node in the ceph osd event structure is not
initialized in create_osdc_create_event(). Because this node can
be the subject of a RB_EMPTY_NODE() call later on, we should ensure
the node is initialized properly for that.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The red-black node node in the ceph osd structure is not initialized
in create_osd(). Because this node can be the subject of a
RB_EMPTY_NODE() call later on, we should ensure the node is
initialized properly for that. Add a call to RB_CLEAR_NODE()
initialize it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
When a connection's socket disconnects, or if there's a protocol
error of some kind on the connection, a fault is signaled and
the connection is reset (closed and reopened, basically). We
currently get an error message on the log whenever this occurs.
A ceph connection will attempt to reestablish a socket connection
repeatedly if a fault occurs. This means that these error messages
will get repeatedly added to the log, which is undesirable.
Change the error message to be a warning, so they don't get
logged by default.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
A connection's socket can close for any reason, independent of the
state of the connection (and without irrespective of the connection
mutex). As a result, the connectino can be in pretty much any state
at the time its socket is closed.
Handle those other cases at the top of con_work(). Pull this whole
block of code into a separate function to reduce the clutter.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
In __unregister_linger_request(), the request is being removed
from the osd client's req_linger list only when the request
has a non-null osd pointer. It should be done whether or not
the request currently has an osd.
This is most likely a non-issue because I believe the request
will always have an osd when this function is called.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
If an osd has no requests and no linger requests, __reset_osd()
will just remove it with a call to __remove_osd(). That drops
a reference to the osd, and therefore the osd may have been free
by the time __reset_osd() returns. That function offers no
indication this may have occurred, and as a result the osd will
continue to be used even when it's no longer valid.
Change__reset_osd() so it returns an error (ENODEV) when it
deletes the osd being reset. And change __kick_osd_requests() so it
returns immediately (before referencing osd again) if __reset_osd()
returns *any* error.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
In __unregister_request(), there is a call to list_del_init()
referencing a request that was the subject of a call to
ceph_osdc_put_request() on the previous line. This is not
safe, because the request structure could have been freed
by the time we reach the list_del_init().
Fix this by reversing the order of these lines.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
This would reset a connection with any OSD that had an outstanding
request that was taking more than N seconds. The idea was that if the
OSD was buggy, the client could compensate by resending the request.
In reality, this only served to hide server bugs, and we haven't
actually seen such a bug in quite a while. Moreover, the userspace
client code never did this.
More importantly, often the request is taking a long time because the
OSD is trying to recover, or overloaded, and killing the connection
and retrying would only make the situation worse by giving the OSD
more work to do.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Define and export function ceph_pg_pool_name_by_id() to supply
the name of a pg pool whose id is given. This will be used by
the next patch.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Ensure that we set the err value correctly so that we do not pass a 0
value to ERR_PTR and confuse the calling code. (In particular,
osd_client.c handle_map() will BUG(!newmap)).
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Pull Ceph fixes form Sage Weil:
"There are two fixes in the messenger code, one that can trigger a NULL
dereference, and one that error in refcounting (extra put). There is
also a trivial fix that in the fs client code that is triggered by NFS
reexport."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: fix dentry reference leak in encode_fh()
libceph: avoid NULL kref_put when osd reset races with alloc_msg
rbd: reset BACKOFF if unable to re-queue
The ceph_on_in_msg_alloc() method calls the ->alloc_msg() helper which
may return NULL. It also drops con->mutex while it allocates a message,
which means that the connection state may change (e.g., get closed). If
that happens, we clean up and bail out. Avoid calling ceph_msg_put() on
a NULL return value and triggering a crash.
This was observed when an ->alloc_msg() call races with a timeout that
resends a zillion messages and resets the connection, and ->alloc_msg()
returns NULL (because the request was resent to another target).
Fixes http://tracker.newdream.net/issues/3342
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
The ceph_on_in_msg_alloc() method drops con->mutex while it allocates a
message. If that races with a timeout that resends a zillion messages and
resets the connection, and the ->alloc_msg() method returns a NULL message,
it will call ceph_msg_put(NULL) and BUG.
Fix by only calling put if msg is non-NULL.
Fixes http://tracker.newdream.net/issues/3142
Signed-off-by: Sage Weil <sage@inktank.com>
Pull module signing support from Rusty Russell:
"module signing is the highlight, but it's an all-over David Howells frenzy..."
Hmm "Magrathea: Glacier signing key". Somebody has been reading too much HHGTTG.
* 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (37 commits)
X.509: Fix indefinite length element skip error handling
X.509: Convert some printk calls to pr_devel
asymmetric keys: fix printk format warning
MODSIGN: Fix 32-bit overflow in X.509 certificate validity date checking
MODSIGN: Make mrproper should remove generated files.
MODSIGN: Use utf8 strings in signer's name in autogenerated X.509 certs
MODSIGN: Use the same digest for the autogen key sig as for the module sig
MODSIGN: Sign modules during the build process
MODSIGN: Provide a script for generating a key ID from an X.509 cert
MODSIGN: Implement module signature checking
MODSIGN: Provide module signing public keys to the kernel
MODSIGN: Automatically generate module signing keys if missing
MODSIGN: Provide Kconfig options
MODSIGN: Provide gitignore and make clean rules for extra files
MODSIGN: Add FIPS policy
module: signature checking hook
X.509: Add a crypto key parser for binary (DER) X.509 certificates
MPILIB: Provide a function to read raw data into an MPI
X.509: Add an ASN.1 decoder
X.509: Add simple ASN.1 grammar compiler
...
This patch defines a single function, queue_con_delay() to call
queue_delayed_work() for a connection. It basically generalizes
what was previously queue_con() by adding the delay argument.
queue_con() is now a simple helper that passes 0 for its delay.
queue_con_delay() returns 0 if it queued work or an errno if it
did not for some reason.
If con_work() finds the BACKOFF flag set for a connection, it now
calls queue_con_delay() to handle arranging to start again after a
delay.
Note about connection reference counts: con_work() only ever gets
called as a work item function. At the time that work is scheduled,
a reference to the connection is acquired, and the corresponding
con_work() call is then responsible for dropping that reference
before it returns.
Previously, the backoff handling inside con_work() silently handed
off its reference to delayed work it scheduled. Now that
queue_con_delay() is used, a new reference is acquired for the
newly-scheduled work, and the original reference is dropped by the
con->ops->put() call at the end of the function.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Both ceph_fault() and con_work() include handling for imposing a
delay before doing further processing on a faulted connection.
The latter is used only if ceph_fault() is unable to.
Instead, just let con_work() always be responsible for implementing
the delay. After setting up the delay value, set the BACKOFF flag
on the connection unconditionally and call queue_con() to ensure
con_work() will get called to handle it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
If ceph_fault() is unable to queue work after a delay, it sets the
BACKOFF connection flag so con_work() will attempt to do so.
In con_work(), when BACKOFF is set, if queue_delayed_work() doesn't
result in newly-queued work, it simply ignores this condition and
proceeds as if no backoff delay were desired. There are two
problems with this--one of which is a bug.
The first problem is simply that the intended behavior is to back
off, and if we aren't able queue the work item to run after a delay
we're not doing that.
The only reason queue_delayed_work() won't queue work is if the
provided work item is already queued. In the messenger, this
means that con_work() is already scheduled to be run again. So
if we simply set the BACKOFF flag again when this occurs, we know
the next con_work() call will again attempt to hold off activity
on the connection until after the delay.
The second problem--the bug--is a leak of a reference count. If
queue_delayed_work() returns 0 in con_work(), con->ops->put() drops
the connection reference held on entry to con_work(). However,
processing is (was) allowed to continue, and at the end of the
function a second con->ops->put() is called.
This patch fixes both problems.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Empty nodes have no color. We can make use of this property to simplify
the code emitted by the RB_EMPTY_NODE and RB_CLEAR_NODE macros. Also,
we can get rid of the rb_init_node function which had been introduced by
commit 88d19cf379 ("timers: Add rb_init_node() to allow for stack
allocated rb nodes") to avoid some issue with the empty node's color not
being initialized.
I'm not sure what the RB_EMPTY_NODE checks in rb_prev() / rb_next() are
doing there, though. axboe introduced them in commit 10fd48f237
("rbtree: fixed reversed RB_EMPTY_NODE and rb_next/prev"). The way I
see it, the 'empty node' abstraction is only used by rbtree users to
flag nodes that they haven't inserted in any rbtree, so asking the
predecessor or successor of such nodes doesn't make any sense.
One final rb_init_node() caller was recently added in sysctl code to
implement faster sysctl name lookups. This code doesn't make use of
RB_EMPTY_NODE at all, and from what I could see it only called
rb_init_node() under the mistaken assumption that such initialization was
required before node insertion.
[sfr@canb.auug.org.au: fix net/ceph/osd_client.c build]
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Give the key type the opportunity to preparse the payload prior to the
instantiation and update routines being called. This is done with the
provision of two new key type operations:
int (*preparse)(struct key_preparsed_payload *prep);
void (*free_preparse)(struct key_preparsed_payload *prep);
If the first operation is present, then it is called before key creation (in
the add/update case) or before the key semaphore is taken (in the update and
instantiate cases). The second operation is called to clean up if the first
was called.
preparse() is given the opportunity to fill in the following structure:
struct key_preparsed_payload {
char *description;
void *type_data[2];
void *payload;
const void *data;
size_t datalen;
size_t quotalen;
};
Before the preparser is called, the first three fields will have been cleared,
the payload pointer and size will be stored in data and datalen and the default
quota size from the key_type struct will be stored into quotalen.
The preparser may parse the payload in any way it likes and may store data in
the type_data[] and payload fields for use by the instantiate() and update()
ops.
The preparser may also propose a description for the key by attaching it as a
string to the description field. This can be used by passing a NULL or ""
description to the add_key() system call or the key_create_or_update()
function. This cannot work with request_key() as that required the description
to tell the upcall about the key to be created.
This, for example permits keys that store PGP public keys to generate their own
name from the user ID and public key fingerprint in the key.
The instantiate() and update() operations are then modified to look like this:
int (*instantiate)(struct key *key, struct key_preparsed_payload *prep);
int (*update)(struct key *key, struct key_preparsed_payload *prep);
and the new payload data is passed in *prep, whether or not it was preparsed.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we are creating an osd request and get an invalid layout, return
an EINVAL to the caller. We switch up the return to have an error
code instead of NULL implying -ENOMEM.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
If we encounter an invalid (e.g., zeroed) mapping, return an error
and avoid a divide by zero.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Using list_move_tail() instead of list_del() + list_add_tail().
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Sage Weil <sage@inktank.com>
Make ceph_monc_do_poolop() static to remove the following sparse warning:
* net/ceph/mon_client.c:616:5: warning: symbol 'ceph_monc_do_poolop' was not
declared. Should it be static?
Also drops the 'ceph_monc_' prefix, now being a private function.
Signed-off-by: Iulius Curt <icurt@ixiacom.com>
Signed-off-by: Sage Weil <sage@inktank.com>
In write_partial_msg_pages(), pages need to be kmapped in order to
perform a CRC-32c calculation on them. As an artifact of the way
this code used to be structured, the kunmap() call was separated
from the kmap() call and both were done conditionally. But the
conditions under which the kmap() and kunmap() calls were made
differed, so there was a chance a kunmap() call would be done on a
page that had not been mapped.
The symptom of this was tripping a BUG() in kunmap_high() when
pkmap_count[nr] became 0.
Reported-by: Bryan K. Wright <bryan@virginia.edu>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Because the Ceph client messenger uses a non-blocking connect, it is
possible for the sending of the client banner to race with the
arrival of the banner sent by the peer.
When ceph_sock_state_change() notices the connect has completed, it
schedules work to process the socket via con_work(). During this
time the peer is writing its banner, and arrival of the peer banner
races with con_work().
If con_work() calls try_read() before the peer banner arrives, there
is nothing for it to do, after which con_work() calls try_write() to
send the client's banner. In this case Ceph's protocol negotiation
can complete succesfully.
The server-side messenger immediately sends its banner and addresses
after accepting a connect request, *before* actually attempting to
read or verify the banner from the client. As a result, it is
possible for the banner from the server to arrive before con_work()
calls try_read(). If that happens, try_read() will read the banner
and prepare protocol negotiation info via prepare_write_connect().
prepare_write_connect() calls con_out_kvec_reset(), which discards
the as-yet-unsent client banner. Next, con_work() calls
try_write(), which sends the protocol negotiation info rather than
the banner that the peer is expecting.
The result is that the peer sees an invalid banner, and the client
reports "negotiation failed".
Fix this by moving con_out_kvec_reset() out of
prepare_write_connect() to its callers at all locations except the
one where the banner might still need to be sent.
[elder@inktak.com: added note about server-side behavior]
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@inktank.com>
The debugfs directory includes the cluster fsid and our unique global_id.
We need to delay the initialization of the debug entry until we have
learned both the fsid and our global_id from the monitor or else the
second client can't create its debugfs entry and will fail (and multiple
client instances aren't properly reflected in debugfs).
Reported by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Avoid crashing if the crypto key payload was NULL, as when it was not correctly
allocated and initialized. Also, avoid leaking it.
Signed-off-by: Sylvain Munaut <tnt@246tNt.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Pull Ceph changes from Sage Weil:
"Lots of stuff this time around:
- lots of cleanup and refactoring in the libceph messenger code, and
many hard to hit races and bugs closed as a result.
- lots of cleanup and refactoring in the rbd code from Alex Elder,
mostly in preparation for the layering functionality that will be
coming in 3.7.
- some misc rbd cleanups from Josh Durgin that are finally going
upstream
- support for CRUSH tunables (used by newer clusters to improve the
data placement)
- some cleanup in our use of d_parent that Al brought up a while back
- a random collection of fixes across the tree
There is another patch coming that fixes up our ->atomic_open()
behavior, but I'm going to hammer on it a bit more before sending it."
Fix up conflicts due to commits that were already committed earlier in
drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c}
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits)
rbd: create rbd_refresh_helper()
rbd: return obj version in __rbd_refresh_header()
rbd: fixes in rbd_header_from_disk()
rbd: always pass ops array to rbd_req_sync_op()
rbd: pass null version pointer in add_snap()
rbd: make rbd_create_rw_ops() return a pointer
rbd: have __rbd_add_snap_dev() return a pointer
libceph: recheck con state after allocating incoming message
libceph: change ceph_con_in_msg_alloc convention to be less weird
libceph: avoid dropping con mutex before fault
libceph: verify state after retaking con lock after dispatch
libceph: revoke mon_client messages on session restart
libceph: fix handling of immediate socket connect failure
ceph: update MAINTAINERS file
libceph: be less chatty about stray replies
libceph: clear all flags on con_close
libceph: clean up con flags
libceph: replace connection state bits with states
libceph: drop unnecessary CLOSED check in socket state change callback
libceph: close socket directly from ceph_con_close()
...
We drop the lock when calling the ->alloc_msg() con op, which means
we need to (a) not clobber con->in_msg without the mutex held, and (b)
we need to verify that we are still in the OPEN state when we retake
it to avoid causing any mayhem. If the state does change, -EAGAIN
will get us back to con_work() and loop.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
This function's calling convention is very limiting. In particular,
we can't return any error other than ENOMEM (and only implicitly),
which is a problem (see next patch).
Instead, return an normal 0 or error code, and make the skip a pointer
output parameter. Drop the useless in_hdr argument (we have the con
pointer).
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
The ceph_fault() function takes the con mutex, so we should avoid
dropping it before calling it. This fixes a potential race with
another thread calling ceph_con_close(), or _open(), or similar (we
don't reverify con->state after retaking the lock).
Add annotation so that lockdep realizes we will drop the mutex before
returning.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
We drop the con mutex when delivering a message. When we retake the
lock, we need to verify we are still in the OPEN state before
preparing to read the next tag, or else we risk stepping on a
connection that has been closed.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Revoke all mon_client messages when we shut down the old connection.
This is mostly moot since we are re-using the same ceph_connection,
but it is cleaner.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
If the connect() call immediately fails such that sock == NULL, we
still need con_close_socket() to reset our socket state to CLOSED.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
There are many (normal) conditions that can lead to us getting
unexpected replies, include cluster topology changes, osd failures,
and timeouts. There's no need to spam the console about it.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Rename flags with CON_FLAG prefix, move the definitions into the c file,
and (better) document their meaning.
Signed-off-by: Sage Weil <sage@inktank.com>
Use a simple set of 6 enumerated values for the socket states (CON_STATE_*)
and use those instead of the state bits. All of the con->state checks are
now under the protection of the con mutex, so this is safe. It also
simplifies many of the state checks because we can check for anything other
than the expected state instead of various bits for races we can think of.
This appears to hold up well to stress testing both with and without socket
failure injection on the server side.
Signed-off-by: Sage Weil <sage@inktank.com>
It is simpler to do this immediately, since we already hold the con mutex.
It also avoids the need to deal with a not-quite-CLOSED socket in con_work.
Signed-off-by: Sage Weil <sage@inktank.com>
Take the con mutex before checking whether the connection is closed to
avoid racing with someone else closing it.
Signed-off-by: Sage Weil <sage@inktank.com>
If we fault on a lossy connection, we should still close the socket
immediately, and do so under the con mutex.
We should also take the con mutex before printing out the state bits in
the debug output.
Signed-off-by: Sage Weil <sage@inktank.com>
This is a trivial fix for the debug output, as it is inconsistent
with the function name so may confuse people when debugging.
[elder@inktank.com: switched to use __func__]
Signed-off-by: Jiaju Zhang <jjzhang@suse.de>
Reviewed-by: Alex Elder <elder@inktank.com>
We exponentially back off when we encounter connection errors. If several
errors accumulate, we will eventually wait ages before even trying to
reconnect.
Fix this by resetting the backoff counter after a successful negotiation/
connection with the remote node. Fixes ceph issue #2802.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Take the con mutex while we are initiating a ceph open. This is necessary
because the may have previously been in use and then closed, which could
result in a racing workqueue running con_work().
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Previously, we were opportunistically initializing the bio_iter if it
appeared to be uninitialized in the middle of the read path. The problem
is that a sequence like:
- start reading message
- initialize bio_iter
- read half a message
- messenger fault, reconnect
- restart reading message
- ** bio_iter now non-NULL, not reinitialized **
- read past end of bio, crash
Instead, initialize the bio_iter unconditionally when we allocate/claim
the message for read.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
The linger op registration (i.e., watch) modifies the object state. As
such, the OSD will reply with success if it has already applied without
doing the associated side-effects (setting up the watch session state).
If we lose the ACK and resubmit, we will see success but the watch will not
be correctly registered and we won't get notifies.
To fix this, always resubmit the linger op with a new tid. We accomplish
this by re-registering as a linger (i.e., 'registered') if we are not yet
registered. Then the second loop will treat this just like a normal
case of re-registering.
This mirrors a similar fix on the userland ceph.git, commit 5dd68b95, and
ceph bug #2796.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Hold the mutex while twiddling all of the state bits to avoid possible
races. While we're here, make not of why we cannot close the socket
directly.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
We need to set error_msg to something useful before calling ceph_fault();
do so here for try_{read,write}(). This is more informative than
libceph: osd0 192.168.106.220:6801 (null)
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
The server side recently added support for tuning some magic
crush variables. Decode these variables if they are present, or use the
default values if they are not present.
Corresponds to ceph.git commit 89af369c25f274fe62ef730e5e8aad0c54f1e5a5.
Signed-off-by: caleb miles <caleb.miles@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
This is simply cleanup that will keep things more closely synced with the
userland code.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Add an atomic variable 'stopping' as flag in struct ceph_messenger,
set this flag to 1 in function ceph_destroy_client(), and add the condition code
in function ceph_data_ready() to test the flag value, if true(1), just return.
Signed-off-by: Guanjun He <gjhe@suse.com>
Reviewed-by: Sage Weil <sage@inktank.com>
In ancient times, the messenger could both initiate and accept connections.
An artifact if that was data structures to store/process an incoming
ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
Sadly, the negotiation code was referencing those structures and ignoring
important information (like the peer's connect_seq) from the correct ones.
Among other things, this fixes tight reconnect loops where the server sends
RETRY_SESSION and we (the client) retries with the same connect_seq as last
time. This bug pretty easily triggered by injecting socket failures on the
MDS and running some fs workload like workunits/direct_io/test_sync_io.
Signed-off-by: Sage Weil <sage@inktank.com>
These don't strictly need to be initialized based on how they are used, but
it is good practice to do so.
Reported-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Initialize the type field for messages in a msgpool. The caller was doing
this for osd ops, but not for the reply messages.
Reported-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Pull networking changes from David S Miller:
1) Remove the ipv4 routing cache. Now lookups go directly into the FIB
trie and use prebuilt routes cached there.
No more garbage collection, no more rDOS attacks on the routing
cache. Instead we now get predictable and consistent performance,
no matter what the pattern of traffic we service.
This has been almost 2 years in the making. Special thanks to
Julian Anastasov, Eric Dumazet, Steffen Klassert, and others who
have helped along the way.
I'm sure that with a change of this magnitude there will be some
kind of fallout, but such things ought the be simple to fix at this
point. Luckily I'm not European so I'll be around all of August to
fix things :-)
The major stages of this work here are each fronted by a forced
merge commit whose commit message contains a top-level description
of the motivations and implementation issues.
2) Pre-demux of established ipv4 TCP sockets, saves a route demux on
input.
3) TCP SYN/ACK performance tweaks from Eric Dumazet.
4) Add namespace support for netfilter L4 conntrack helpers, from Gao
Feng.
5) Add config mechanism for Energy Efficient Ethernet to ethtool, from
Yuval Mintz.
6) Remove quadratic behavior from /proc/net/unix, from Eric Dumazet.
7) Support for connection tracker helpers in userspace, from Pablo
Neira Ayuso.
8) Allow userspace driven TX load balancing functions in TEAM driver,
from Jiri Pirko.
9) Kill off NLMSG_PUT and RTA_PUT macros, more gross stuff with
embedded gotos.
10) TCP Small Queues, essentially minimize the amount of TCP data queued
up in the packet scheduler layer. Whereas the existing BQL (Byte
Queue Limits) limits the pkt_sched --> netdevice queuing levels,
this controls the TCP --> pkt_sched queueing levels.
From Eric Dumazet.
11) Reduce the number of get_page/put_page ops done on SKB fragments,
from Alexander Duyck.
12) Implement protection against blind resets in TCP (RFC 5961), from
Eric Dumazet.
13) Support the client side of TCP Fast Open, basically the ability to
send data in the SYN exchange, from Yuchung Cheng.
Basically, the sender queues up data with a sendmsg() call using
MSG_FASTOPEN, then they do the connect() which emits the queued up
fastopen data.
14) Avoid all the problems we get into in TCP when timers or PMTU events
hit a locked socket. The TCP Small Queues changes added a
tcp_release_cb() that allows us to queue work up to the
release_sock() caller, and that's what we use here too. From Eric
Dumazet.
15) Zero copy on TX support for TUN driver, from Michael S. Tsirkin.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1870 commits)
genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP
r8169: revert "add byte queue limit support".
ipv4: Change rt->rt_iif encoding.
net: Make skb->skb_iif always track skb->dev
ipv4: Prepare for change of rt->rt_iif encoding.
ipv4: Remove all RTCF_DIRECTSRC handliing.
ipv4: Really ignore ICMP address requests/replies.
decnet: Don't set RTCF_DIRECTSRC.
net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse.
ipv4: Remove redundant assignment
rds: set correct msg_namelen
openvswitch: potential NULL deref in sample()
tcp: dont drop MTU reduction indications
bnx2x: Add new 57840 device IDs
tcp: avoid oops in tcp_metrics and reset tcpm_stamp
niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value
niu: Fix to check for dma mapping errors.
net: Fix references to out-of-scope variables in put_cmsg_compat()
net: ethernet: davinci_emac: add pm_runtime support
net: ethernet: davinci_emac: Remove unnecessary #include
...
In ancient times, the messenger could both initiate and accept connections.
An artifact if that was data structures to store/process an incoming
ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
Sadly, the negotiation code was referencing those structures and ignoring
important information (like the peer's connect_seq) from the correct ones.
Among other things, this fixes tight reconnect loops where the server sends
RETRY_SESSION and we (the client) retries with the same connect_seq as last
time. This bug pretty easily triggered by injecting socket failures on the
MDS and running some fs workload like workunits/direct_io/test_sync_io.
Signed-off-by: Sage Weil <sage@inktank.com>
It is possible to close a socket that is in the OPENING state. For
example, it can happen if ceph_con_close() is called on the con before
the TCP connection is established. con_work() will come around and shut
down the socket.
Signed-off-by: Sage Weil <sage@inktank.com>
Do not re-initialize the con on every connection attempt. When we
ceph_con_close, there may still be work queued on the socket (e.g., to
close it), and re-initializing will clobber the work_struct state.
Signed-off-by: Sage Weil <sage@inktank.com>
Sage liked the state diagram I put in my commit description so
I'm putting it in with the code.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
This patch gathers a few small changes in "net/ceph/messenger.c":
out_msg_pos_next()
- small logic change that mostly affects indentation
write_partial_msg_pages().
- use a local variable trail_off to represent the offset into
a message of the trail portion of the data (if present)
- once we are in the trail portion we will always be there, so we
don't always need to check against our data position
- avoid computing len twice after we've reached the trail
- get rid of the variable tmpcrc, which is not needed
- trail_off and trail_len never change so mark them const
- update some comments
read_partial_message_bio()
- bio_iovec_idx() will never return an error, so don't bother
checking for it
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Currently a ceph connection enters a "CONNECTING" state when it
begins the process of (re-)connecting with its peer. Once the two
ends have successfully exchanged their banner and addresses, an
additional NEGOTIATING bit is set in the ceph connection's state to
indicate the connection information exhange has begun. The
CONNECTING bit/state continues to be set during this phase.
Rather than have the CONNECTING state continue while the NEGOTIATING
bit is set, interpret these two phases as distinct states. In other
words, when NEGOTIATING is set, clear CONNECTING. That way only
one of them will be active at a time.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
There are two phases in the process of linking together the two ends
of a ceph connection. The first involves exchanging a banner and
IP addresses, and if that is successful a second phase exchanges
some detail about each side's connection capabilities.
When initiating a connection, the client side now queues to send
its information for both phases of this process at the same time.
This is probably a bit more efficient, but it is slightly messier
from a layering perspective in the code.
So rearrange things so that the client doesn't send the connection
information until it has received and processed the response in the
initial banner phase (in process_banner()).
Move the code (in the (con->sock == NULL) case in try_write()) that
prepares for writing the connection information, delaying doing that
until the banner exchange has completed. Move the code that begins
the transition to this second "NEGOTIATING" phase out of
process_banner() and into its caller, so preparing to write the
connection information and preparing to read the response are
adjacent to each other.
Finally, preparing to write the connection information now requires
the output kvec to be reset in all cases, so move that into the
prepare_write_connect() and delete it from all callers.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
There is no state explicitly defined when a ceph connection is fully
operational. So define one.
It's set when the connection sequence completes successfully, and is
cleared when the connection gets closed.
Be a little more careful when examining the old state when a socket
disconnect event is reported.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
A connection state's NEGOTIATING bit gets set while in CONNECTING
state after we have successfully exchanged a ceph banner and IP
addresses with the connection's peer (the server). But that bit
is not cleared again--at least not until another connection attempt
is initiated.
Instead, clear it as soon as the connection is fully established.
Also, clear it when a socket connection gets prematurely closed
in the midst of establishing a ceph connection (in case we had
reached the point where it was set).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
A connection that is closed will no longer be connecting. So
clear the CONNECTING state bit in ceph_con_close(). Similarly,
if the socket has been closed we no longer are in connecting
state (a new connect sequence will need to be initiated).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
In con_close_socket(), a connection's SOCK_CLOSED flag gets set and
then cleared while its shutdown method is called and its reference
gets dropped.
Previously, that flag got set only if it had not already been set,
so setting it in con_close_socket() might have prevented additional
processing being done on a socket being shut down. We no longer set
SOCK_CLOSED in the socket event routine conditionally, so setting
that bit here no longer provides whatever benefit it might have
provided before.
A race condition could still leave the SOCK_CLOSED bit set even
after we've issued the call to con_close_socket(), so we still clear
that bit after shutting the socket down. Add a comment explaining
the reason for this.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
When a TCP_CLOSE or TCP_CLOSE_WAIT event occurs, the SOCK_CLOSED
connection flag bit is set, and if it had not been previously set
queue_con() is called to ensure con_work() will get a chance to
handle the changed state.
con_work() atomically checks--and if set, clears--the SOCK_CLOSED
bit if it was set. This means that even if the bit were set
repeatedly, the related processing in con_work() only gets called
once per transition of the bit from 0 to 1.
What's important then is that we ensure con_work() gets called *at
least* once when a socket close event occurs, not that it gets
called *exactly* once.
The work queue mechanism already takes care of queueing work
only if it is not already queued, so there's no need for us
to call queue_con() conditionally.
So this patch just makes it so the SOCK_CLOSED flag gets set
unconditionally in ceph_sock_state_change().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Currently the socket state change event handler records an error
message on a connection to distinguish a close while connecting from
a close while a connection was already established.
Changing connection information during handling of a socket event is
not very clean, so instead move this assignment inside con_work(),
where it can be done during normal connection-level processing (and
under protection of the connection mutex as well).
Move the handling of a socket closed event up to the top of the
processing loop in con_work(); there's no point in handling backoff
etc. if we have a newly-closed socket to take care of.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The following commit changed it so SOCK_CLOSED bit was stored in
a connection's new "flags" field rather than its "state" field.
libceph: start separating connection flags from state
commit 928443cd
That bit is used in con_close_socket() to protect against setting an
error message more than once in the socket event handler function.
Unfortunately, the field being operated on in that function was not
updated to be "flags" as it should have been. This fixes that
error.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Recently a bug was fixed in which the bio_iter field in a ceph
message was not being properly re-initialized when a message got
re-transmitted:
commit 43643528cc
Author: Yan, Zheng <zheng.z.yan@intel.com>
rbd: Clear ceph_msg->bio_iter for retransmitted message
We are now only initializing the bio_iter field when we are about to
start to write message data (in prepare_write_message_data()),
rather than every time we are attempting to write any portion of the
message data (in write_partial_msg_pages()). This means we no
longer need to use the msg->bio_iter field as a flag.
So just don't do that any more. Trust prepare_write_message_data()
to ensure msg->bio_iter is properly initialized, every time we are
about to begin writing (or re-writing) a message's bio data.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
If a message has a non-null bio pointer, its bio_iter field is
initialized in write_partial_msg_pages() if this has not been done
already. This is really a one-time setup operation for sending a
message's (bio) data, so move that initialization code into
prepare_write_message_data() which serves that purpose.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Move init_bio_iter() and iter_bio_next() up in their source file so
the'll be defined before they're needed.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
This is a nit, but prepare_write_message() sets the FOOTER_COMPLETE
flag before the CRC for the data portion (recorded in the footer)
has been completely computed. Hold off setting the complete flag
until we've decided it's ready to send.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>