When setting page array information for message data, provide the
byte length rather than the page count ceph_msg_data_set_pages().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define a function ceph_msg_data_set_pages(), which more clearly
abstracts the assignment page-related fields for data in a ceph
message structure. Use this new function in the osd client and mds
client.
Ideally, these fields would never be set more than once (with
BUG_ON() calls to guarantee that). At the moment though the osd
client sets these every time it receives a message, and in the event
of a communication problem this can happen more than once. (This
will be resolved shortly, but setting up these helpers first makes
it all a bit easier to work with.)
Rearrange the field order in a ceph_msg structure to group those
that are used to define the possible data payloads.
This partially resolves:
http://tracker.ceph.com/issues/4263
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Record the byte count for an osd request rather than the page count.
The number of pages can always be derived from the byte count (and
alignment/offset) but the reverse is not true.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Rather than explicitly initializing many fields to 0, NULL, or false
in a newly-allocated message, just use kzalloc() for allocating new
messages. This will become a much more convenient way of doing
things anyway for upcoming patches that abstract the data field.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
While processing an outgoing pagelist (either the data pagelist or
trail) in a ceph message, the messenger cycles through each of the
pages on the list. This is accomplished in out_msg_pos_next(), if
the end of the first page on the list is reached, the first page is
moved to the end of the list.
There is a list operation, list_rotate_left(), which performs
exactly this operation, and by using it, what's really going on
becomes more obvious.
So replace these two list_move_tail() calls with list_rotate_left().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define a new function in_msg_pos_next() to match out_msg_pos_next(),
and use it in place of code at the end of read_partial_message_pages()
and read_partial_message_bio().
Note that the page number is incremented and offset reset under
slightly different conditions from before. The result is
equivalent, however, as explained below.
Each time an incoming message is going to arrive, we find out how
much room is left--not surpassing the current page--and provide that
as the number of bytes to receive. So the amount we'll use is the
lesser of: all that's left of the entire request; and all that's
left in the current page.
If we received exactly how many were requested, we either reached
the end of the request or the end of the page. In the first case,
we're done, in the second, we move onto the next page in the array.
In all cases but (possibly) on the last page, after adding the
number of bytes received, page_pos == PAGE_SIZE. On the last page,
it doesn't really matter whether we increment the page number and
reset the page position, because we're done and we won't come back
here again. The code previously skipped over that last case,
basically. The new code handles that case the same as the others,
incrementing and resetting.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is only one caller for read_partial_message_bio(), and it
always passes &msg->bio_iter and &bio_seg as the second and third
arguments. Furthermore, the message in question is always the
connection's in_msg, and we can get that inside the called function.
So drop those two parameters and use their derived equivalents.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Change the type of the "more" parameter from int to bool.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Some values printed are not (necessarily) in CPU order. We already
have a copy of the converted versions, so use them.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This is probably unnecessary but the code read as if it were wrong
in read_partial_message().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In ceph_con_in_msg_alloc() it is possible for a connection's
alloc_msg method to indicate an incoming message should be skipped.
By default, read_partial_message() initializes the skip variable
to 0 before it gets provided to ceph_con_in_msg_alloc().
The osd client, mon client, and mds client each supply an alloc_msg
method. The mds client always assigns skip to be 0.
The other two leave the skip value of as-is, or assigns it to zero,
except:
- if no (osd or mon) request having the given tid is found, in
which case skip is set to 1 and NULL is returned; or
- in the osd client, if the data of the reply message is not
adequate to hold the message to be read, it assigns skip
value 1 and returns NULL.
So the returned message pointer will always be NULL if skip is ever
non-zero.
Clean up the logic a bit in ceph_con_in_msg_alloc() to make this
state of affairs more obvious. Add a comment explaining how a null
message pointer can mean either a message that should be skipped or
a problem allocating a message.
This resolves:
http://tracker.ceph.com/issues/4324
Reported-by: Greg Farnum <greg@inktank.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
An osd request defines information about where data to be read
should be placed as well as where data to write comes from.
Currently these are represented by common fields.
Keep information about data for writing separate from data to be
read by splitting these into data_in and data_out fields.
This is the key patch in this whole series, in that it actually
identifies which osd requests generate outgoing data and which
generate incoming data. It's less obvious (currently) that an osd
CALL op generates both outgoing and incoming data; that's the focus
of some upcoming work.
This resolves:
http://tracker.ceph.com/issues/4127
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request uses either pages or a bio list for its data. Use a
union to record information about the two, and add a data type
tag to select between them.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull the fields in an osd request structure that define the data for
the request out into a separate structure.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently ceph_osdc_new_request() assigns an osd request's
r_num_pages and r_alignment fields. The only thing it does
after that is call ceph_osdc_build_request(), and that doesn't
need those fields to be assigned.
Move the assignment of those fields out of ceph_osdc_new_request()
and into its caller. As a result, the page_align parameter is no
longer used, so get rid of it.
Note that in ceph_sync_write(), the value for req->r_num_pages had
already been calculated earlier (as num_pages, and fortunately
it was computed the same way). So don't bother recomputing it,
but because it's not needed earlier, move that calculation after the
call to ceph_osdc_new_request(). Hold off making the assignment to
r_alignment, doing it instead r_pages and r_num_pages are
getting set.
Similarly, in start_read(), nr_pages already holds the number of
pages in the array (and is calculated the same way), so there's no
need to recompute it. Move the assignment of the page alignment
down with the others there as well.
This and the next few patches are preparation work for:
http://tracker.ceph.com/issues/4127
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The only user of the ceph messenger that doesn't define an alloc_msg
method is the mds client. Define one, such that it works just like
it did before, and simplify ceph_con_in_msg_alloc() by assuming the
alloc_msg method is always present.
This and the next patch resolve:
http://tracker.ceph.com/issues/4322
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
In ceph_con_in_msg_alloc(), if no alloc_msg method is defined for a
connection a new message is allocated with ceph_msg_new().
Drop the mutex before making this call, and make sure we're still
connected when we get it back again.
This is preparing for the next patch, which ensures all connections
define an alloc_msg method, and then handles them all the same way.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
The purpose of ceph_calc_object_layout() is to fill in the pool
number and seed for a ceph_pg structure provided, based on a given
osd map and target object id.
Currently that function takes a file layout parameter, but the only
thing used out of that is its pool number.
Change the function so it takes a pool number rather than the full
file layout structure. Only update the ceph_pg if the pool is found
in the osd map. Get rid of few useless lines of code from the
function while there.
Since the function now very clearly just fills in the ceph_pg
structure it's provided, rename it ceph_calc_ceph_pg().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The pagelist_count field is never actually used, so get rid of it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The new cases added to osd_req_encode_op() caused a new sparse
error, which highlighted an existing problem that had been
overlooked since it was originally checked in. When an unsupported
opcode is found the destination rather than the source opcode was
being used in the error message. The two differ in their byte
order, and we want to be using the one in the source.
Fix the problem in both spots.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request marked to linger will be re-submitted in the event
a connection to the target osd gets dropped. Currently, if there
is a callback function associated with a request it will be called
each time a request is submitted--which for lingering requests can
be more than once.
Change it so a request--including lingering ones--will get completed
(from the perspective of the user of the osd client) exactly once.
This resolves:
http://tracker.ceph.com/issues/3967
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The page alignment field for a request is currently set in
ceph_osdc_build_request(). It's not needed at that point
nor do either of its callers need that value assigned at
any point before they call ceph_osdc_start_request().
So move that assignment into ceph_osdc_start_request().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Use distinct fields for tracking the number of pages in a message's
page array and in a message's page list. Currently only one or the
other is used at a time, but that will be changing soon.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The only remaining reason to pass the osd request to calc_layout()
is to fill in its r_num_pages and r_page_alignment fields. Once it
fills those in, it doesn't do anything more with them.
We can therefore move those assignments into the caller, and get rid
of the "req" parameter entirely.
Note, however, that the only caller is ceph_osdc_new_request(),
and that immediately overwrites those fields with values based on
its passed-in page offset. So the assignment inside calc_layout()
was redundant anyway.
This resolves:
http://tracker.ceph.com/issues/4262
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Move the formatting of the object name (oid) to use for an object
request into the caller of calc_layout(). This makes the "vino"
parameter no longer necessary, so get rid of it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Have calc_layout() pass the computed object number back to its
caller. (This is a small step to simplify review.)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The bio_seg field is used by the ceph messenger in iterating through
a bio. It should never have a negative value, so make it an
unsigned. (I contemplated making it unsigned short to match the
struct bio definition, but it offered no benefit.)
Change variables used to hold bio_seg values to all be unsigned as
well. Change two variable names in init_bio_iter() to match the
convention used everywhere else.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
If an invalid layout is provided to ceph_osdc_new_request(), its
call to calc_layout() might return an error. At that point in the
function we've already allocated an osd request structure, so we
need to free it (drop a reference) in the event such an error
occurs.
The only other value calc_layout() will return is 0, so make that
explicit in the successful case.
This resolves:
http://tracker.ceph.com/issues/4240
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull Ceph fix from Sage Weil:
"This fixes a bug in the new message decoding that just went in during
the last window."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: fix decoding of pgids
In 4f6a7e5ee1 we effectively dropped support
for the legacy encoding for the OSDMap and incremental. However, we didn't
fix the decoding for the pgid.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Pull Ceph updates from Sage Weil:
"A few groups of patches here. Alex has been hard at work improving
the RBD code, layout groundwork for understanding the new formats and
doing layering. Most of the infrastructure is now in place for the
final bits that will come with the next window.
There are a few changes to the data layout. Jim Schutt's patch fixes
some non-ideal CRUSH behavior, and a set of patches from me updates
the client to speak a newer version of the protocol and implement an
improved hashing strategy across storage nodes (when the server side
supports it too).
A pair of patches from Sam Lang fix the atomicity of open+create
operations. Several patches from Yan, Zheng fix various mds/client
issues that turned up during multi-mds torture tests.
A final set of patches expose file layouts via virtual xattrs, and
allow the policies to be set on directories via xattrs as well
(avoiding the awkward ioctl interface and providing a consistent
interface for both kernel mount and ceph-fuse users)."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (143 commits)
libceph: add support for HASHPSPOOL pool flag
libceph: update osd request/reply encoding
libceph: calculate placement based on the internal data types
ceph: update support for PGID64, PGPOOL3, OSDENC protocol features
ceph: update "ceph_features.h"
libceph: decode into cpu-native ceph_pg type
libceph: rename ceph_pg -> ceph_pg_v1
rbd: pass length, not op for osd completions
rbd: move rbd_osd_trivial_callback()
libceph: use a do..while loop in con_work()
libceph: use a flag to indicate a fault has occurred
libceph: separate non-locked fault handling
libceph: encapsulate connection backoff
libceph: eliminate sparse warnings
ceph: eliminate sparse warnings in fs code
rbd: eliminate sparse warnings
libceph: define connection flag helpers
rbd: normalize dout() calls
rbd: barriers are hard
rbd: ignore zero-length requests
...
The legacy behavior adds the pgid seed and pool together as the input for
CRUSH. That is problematic because each pool's PGs end up mapping to the
same OSDs: 1.5 == 2.4 == 3.3 == ...
Instead, if the HASHPSPOOL flag is set, we has the ps and pool together and
feed that into CRUSH. This ensures that two adjacent pools will map to
an independent pseudorandom set of OSDs.
Advertise our support for this via a protocol feature flag.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Use the new version of the encoding for osd requests and replies. In the
process, update the way we are tracking request ops and reply lengths and
results in the struct ceph_osd_request. Update the rbd and fs/ceph users
appropriately.
The main changes are:
- we keep pointers into the request memory for fields we need to update
each time the request is sent out over the wire
- we keep information about the result in an array in the request struct
where the users can easily get at it.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Instead of using the old ceph_object_layout struct, update our internal
ceph_calc_object_layout method to use the ceph_pg type. This allows us to
pass the full 32-bit precision of the pgid.seed to the callers. It also
allows some callers to avoid reaching into the request structures for the
struct ceph_object_layout fields.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Support (and require) the PGID64, PGPOOL3, and OSDENC protocol features.
These have been present in ceph.git since v0.42, Feb 2012. Require these
features to simplify support; nobody is running older userspace.
Note that the new request and reply encoding is still not in place, so the new
code is not yet functional.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Always decode data into our cpu-native ceph_pg type that has the correct
field widths. Limit any remaining uses of ceph_pg_v1 to dealing with the
legacy protocol.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Rename the old version this type to distinguish it from the new version.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Pull user namespace and namespace infrastructure changes from Eric W Biederman:
"This set of changes starts with a few small enhnacements to the user
namespace. reboot support, allowing more arbitrary mappings, and
support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
user namespace root.
I do my best to document that if you care about limiting your
unprivileged users that when you have the user namespace support
enabled you will need to enable memory control groups.
There is a minor bug fix to prevent overflowing the stack if someone
creates way too many user namespaces.
The bulk of the changes are a continuation of the kuid/kgid push down
work through the filesystems. These changes make using uids and gids
typesafe which ensures that these filesystems are safe to use when
multiple user namespaces are in use. The filesystems converted for
3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
changes for these filesystems were a little more involved so I split
the changes into smaller hopefully obviously correct changes.
XFS is the only filesystem that remains. I was hoping I could get
that in this release so that user namespace support would be enabled
with an allyesconfig or an allmodconfig but it looks like the xfs
changes need another couple of days before it they are ready."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
cifs: Enable building with user namespaces enabled.
cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
cifs: Convert struct cifs_sb_info to use kuids and kgids
cifs: Modify struct smb_vol to use kuids and kgids
cifs: Convert struct cifsFileInfo to use a kuid
cifs: Convert struct cifs_fattr to use kuid and kgids
cifs: Convert struct tcon_link to use a kuid.
cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
cifs: Convert from a kuid before printing current_fsuid
cifs: Use kuids and kgids SID to uid/gid mapping
cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
cifs: Override unmappable incoming uids and gids
nfsd: Enable building with user namespaces enabled.
nfsd: Properly compare and initialize kuids and kgids
nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
nfsd: Modify nfsd4_cb_sec to use kuids and kgids
nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
nfsd: Convert nfsxdr to use kuids and kgids
nfsd: Convert nfs3xdr to use kuids and kgids
...
This just converts a manually-implemented loop into a do..while loop
in con_work(). It also moves handling of EAGAIN inside the blocks
where it's already been determined an error code was returned.
Also update a few dout() calls near the affected code for
consistency.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This just rearranges the logic in con_work() a little bit so that a
flag is used to indicate a fault has occurred. This allows both the
fault and non-fault case to be handled the same way and avoids a
couple of nearly consecutive gotos.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An error occurring on a ceph connection is treated as a fault,
causing the connection to be reset. The initial part of this fault
handling has to be done while holding the connection mutex, but
it must then be dropped for the last part.
Separate the part of this fault handling that executes without the
lock into its own function, con_fault_finish(). Move the call to
this new function, as well as call that drops the connection mutex,
into ceph_fault(). Rename that function con_fault() to reflect that
it's only handling the connection part of the fault handling.
The motivation for this was a warning from sparse about the locking
being done here. Rearranging things this way keeps all the mutex
manipulation within ceph_fault(), and this stops sparse from
complaining.
This partially resolves:
http://tracker.ceph.com/issues/4184
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Collect the code that tests for and implements a backoff delay for a
ceph connection into a new function, ceph_backoff().
Make the debug output messages in that part of the code report
things consistently by reporting a message in the socket closed
case, and by making the one for PREOPEN state report the connection
pointer like the rest.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Eliminate most of the problems in the libceph code that cause sparse
to issue warnings.
- Convert functions that are never referenced externally to have
static scope.
- Pass NULL rather than 0 for a pointer argument in one spot in
ceph_monc_delete_snapid()
This partially resolves:
http://tracker.ceph.com/issues/4184
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define and use functions that encapsulate operations performed on
a connection's flags.
This resolves:
http://tracker.ceph.com/issues/4234
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The return values provided for ceph_copy_to_page_vector() and
ceph_copy_from_page_vector() serve no purpose, so get rid of them.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The functions used for working with ceph page vectors are defined
with char pointers, but they're really intended to operate on
untyped data. Change the types of these function parameters
to (void *) to reflect this.
(Note that the functions now assume void pointer arithmetic works
like arithmetic on char pointers.)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Add support for CEPH_OSD_OP_STAT operations in the osd client
and in rbd.
This operation sends no data to the osd; everything required is
encoded in identity of the target object.
The result will be ENOENT if the object doesn't exist. If it does
exist and no other error occurs the server returns the size and last
modification time of the target object as output data (in little
endian format). The size is a 64 bit unsigned and the time is
ceph_timespec structure (two unsigned 32-bit integers, representing
a seconds and nanoseconds value).
This resolves:
http://tracker.ceph.com/issues/4007
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Simplify the way the data length recorded in a message header is
calculated in ceph_osdc_build_request().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In osd_req_encode_op() there are a few cases that handle osd
opcodes that are never used in the kernel. The presence of
this code gives the impression it's correct (which really can't
be assumed), and may impose some unnecessary restrictions on
some upcoming refactoring of this code.
So delete this effectively dead code, and report uses of the
previously handled cases as unsupported.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
If osd_req_encode_op() is given any opcode it doesn't recognize
it reports an error.
This patch fleshes out that routine to distinguish between
well-defined but unsupported values and values that are simply
bogus.
This and the next commit are related to:
http://tracker.ceph.com/issues/4126
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>