Commit Graph

502 Commits

Author SHA1 Message Date
Yan, Zheng 76201b6354 ceph: allow connecting to mds whose rank >= mdsmap::m_max_mds
mdsmap::m_max_mds is the expected count of active mds. It's not the
max rank of active mds. User can decrease mdsmap::m_max_mds, but does
not stop mds whose rank >= mdsmap::m_max_mds.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Elena Reshetova 0e1a5ee657 libceph: convert ceph_pagelist.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:19 +02:00
Elena Reshetova 02113a0f14 libceph: convert ceph_osd.o_ref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:19 +02:00
Elena Reshetova 06dfa96399 libceph: convert ceph_snap_context.nref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Ilya Dryomov 74da4a0f57 libceph, ceph: always advertise all supported features
No reason to hide CephFS-specific features in the rbd case.  Recent
feature bits mix RADOS and CephFS-specific stuff together anyway.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Ilya Dryomov 7cc5e38f2f libceph: osd_request_timeout option
osd_request_timeout specifies how many seconds to wait for a response
from OSDs before returning -ETIMEDOUT from an OSD request.  0 (default)
means no limit.

osd_request_timeout is osdkeepalive-precise -- in-flight requests are
swept through every osdkeepalive seconds.  With ack vs commit behaviour
gone, abort_request() is really simple.

This is based on a patch from Artur Molchanov <artur.molchanov@synesis.ru>.

Tested-by: Artur Molchanov <artur.molchanov@synesis.ru>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-03-07 14:30:38 +01:00
Ilya Dryomov b18b9550e4 libceph: get rid of ack vs commit
- CEPH_OSD_FLAG_ACK shouldn't be set anymore, so assert on it
- remove support for handling ack replies (OSDs will send ack replies
  only if clients request them)
- drop the "do lingering callbacks under osd->lock" logic from
  handle_reply() -- lreq->lock is sufficient in all three cases

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-24 19:04:57 +01:00
Ilya Dryomov 6c696d8560 rbd: kill obj_request->object_name and rbd_segment_name_cache
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20 12:16:15 +01:00
Ilya Dryomov 083a51fbc5 libceph: bump CEPH_PG_MAX_SIZE to 32
... to accommodate potentially very wide EC pools.  This increases the
size of a typical rbd ceph_osd_request by ~12% (from 1040 to 1168 bytes),
but I'd rather go future proof here.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-20 12:16:11 +01:00
Ilya Dryomov 743efcffff crush: merge working data and scratch
Much like Arlo Guthrie, I decided that one big pile is better than two
little piles.

Reflects ceph.git commit 95c2df6c7e0b22d2ea9d91db500cf8b9441c73ba.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20 12:16:11 +01:00
Ilya Dryomov 66a0e2d579 crush: remove mutable part of CRUSH map
Then add it to the working state. It would be very nice if we didn't
have to take a lock to calculate a crush placement. By moving the
permutation array into the working data, we can treat the CRUSH map as
immutable.

Reflects ceph.git commit cbcd039651c0569551cb90d26ce27e1432671f2a.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20 12:16:11 +01:00
Arnd Bergmann d24cdcd3e4 libceph: use BUG() instead of BUG_ON(1)
I ran into this compile warning, which is the result of BUG_ON(1)
not always leading to the compiler treating the code path as
unreachable:

    include/linux/ceph/osdmap.h: In function 'ceph_can_shift_osds':
    include/linux/ceph/osdmap.h:62:1: error: control reaches end of non-void function [-Werror=return-type]

Using BUG() here avoids the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20 12:16:06 +01:00
Linus Torvalds 59331c215d A varied set of changes:
- a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
   (myself).  Also fixed a deadlock caused by a bogus allocation on the
   writeback path and authorize reply verification.
 
 - a fix for long stalls during fsync (Jeff Layton).  The client now
   has a way to force the MDS log flush, leading to ~100x speedups in
   some synthetic tests.
 
 - a new [no]require_active_mds mount option (Zheng Yan).  On mount, we
   will now check whether any of the MDSes are available and bail rather
   than block if none are.  This check can be avoided by specifying the
   "no" option.
 
 - a couple of MDS cap handling fixes and a few assorted patches
   throughout.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJYVByGAAoJEEp/3jgCEfOLBqkH/A7nVf7ObSDYmLuYgg1gJ8zq
 4zDDE42S4yZwayAVpn3UjbfPuez5J44lsdXitExdfiHOdIQZDa/WqAbSqQ48HCSg
 7sG6ecRWg3G5zG0psPZnB+S5wGMvsLXmj2hvzV1lt2t0lI5bDLSlNRSnElbhilD/
 8Z7+Ni2go8DMC9o49SJU32lBW7IByKl4p4flveItgwUvGkIFNd8OT3CyPBUqonQs
 lRCeImRYU8Jghb+ifnRxWSbuDf7pZAPc9kL0vibpUUT/1bH6iHsedKp37WQKqc/w
 KDSNnKiZcz0gY/hJeLqE3ymCIKO6SU+JkMQSaYNTouLO5fQsRr8/uWQXSe6S5oc=
 =ypWx
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "A varied set of changes:

   - a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
     (myself). Also fixed a deadlock caused by a bogus allocation on the
     writeback path and authorize reply verification.

   - a fix for long stalls during fsync (Jeff Layton). The client now
     has a way to force the MDS log flush, leading to ~100x speedups in
     some synthetic tests.

   - a new [no]require_active_mds mount option (Zheng Yan).

     On mount, we will now check whether any of the MDSes are available
     and bail rather than block if none are. This check can be avoided
     by specifying the "no" option.

   - a couple of MDS cap handling fixes and a few assorted patches
     throughout"

* tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client: (32 commits)
  libceph: remove now unused finish_request() wrapper
  libceph: always signal completion when done
  ceph: avoid creating orphan object when checking pool permission
  ceph: properly set issue_seq for cap release
  ceph: add flags parameter to send_cap_msg
  ceph: update cap message struct version to 10
  ceph: define new argument structure for send_cap_msg
  ceph: move xattr initialzation before the encoding past the ceph_mds_caps
  ceph: fix minor typo in unsafe_request_wait
  ceph: record truncate size/seq for snap data writeback
  ceph: check availability of mds cluster on mount
  ceph: fix splice read for no Fc capability case
  ceph: try getting buffer capability for readahead/fadvise
  ceph: fix scheduler warning due to nested blocking
  ceph: fix printing wrong return variable in ceph_direct_read_write()
  crush: include mapper.h in mapper.c
  rbd: silence bogus -Wmaybe-uninitialized warning
  libceph: no need to drop con->mutex for ->get_authorizer()
  libceph: drop len argument of *verify_authorizer_reply()
  libceph: verify authorize reply on connect
  ...
2016-12-16 11:23:34 -08:00
Ilya Dryomov c297eb4269 libceph: always signal completion when done
r_safe_completion is currently, and has always been, signaled only if
on-disk ack was requested.  It's there for fsync and syncfs, which wait
for in-flight writes to flush - all data write requests set ONDISK.

However, the pool perm check code introduced in 4.2 sends a write
request with only ACK set.  An unfortunately timed syncfs can then hang
forever: r_safe_completion won't be signaled because only an unsafe
reply was requested.

We could patch ceph_osdc_sync() to skip !ONDISK write requests, but
that is somewhat incomplete and yet another special case.  Instead,
rename this completion to r_done_completion and always signal it when
the OSD client is done with the request, whether unsafe, safe, or
error.  This is a bit cleaner and helps with the cancellation code.

Reported-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-14 22:39:08 +01:00
Linus Torvalds 36869cb93d Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main block pull request this series. Contrary to previous
  release, I've kept the core and driver changes in the same branch. We
  always ended up having dependencies between the two for obvious
  reasons, so makes more sense to keep them together. That said, I'll
  probably try and keep more topical branches going forward, especially
  for cycles that end up being as busy as this one.

  The major parts of this pull request is:

   - Improved support for O_DIRECT on block devices, with a small
     private implementation instead of using the pig that is
     fs/direct-io.c. From Christoph.

   - Request completion tracking in a scalable fashion. This is utilized
     by two components in this pull, the new hybrid polling and the
     writeback queue throttling code.

   - Improved support for polling with O_DIRECT, adding a hybrid mode
     that combines pure polling with an initial sleep. From me.

   - Support for automatic throttling of writeback queues on the block
     side. This uses feedback from the device completion latencies to
     scale the queue on the block side up or down. From me.

   - Support from SMR drives in the block layer and for SD. From Hannes
     and Shaun.

   - Multi-connection support for nbd. From Josef.

   - Cleanup of request and bio flags, so we have a clear split between
     which are bio (or rq) private, and which ones are shared. From
     Christoph.

   - A set of patches from Bart, that improve how we handle queue
     stopping and starting in blk-mq.

   - Support for WRITE_ZEROES from Chaitanya.

   - Lightnvm updates from Javier/Matias.

   - Supoort for FC for the nvme-over-fabrics code. From James Smart.

   - A bunch of fixes from a whole slew of people, too many to name
     here"

* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
  blk-stat: fix a few cases of missing batch flushing
  blk-flush: run the queue when inserting blk-mq flush
  elevator: make the rqhash helpers exported
  blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  blk-mq: add blk_mq_start_stopped_hw_queue()
  block: improve handling of the magic discard payload
  blk-wbt: don't throttle discard or write zeroes
  nbd: use dev_err_ratelimited in io path
  nbd: reset the setup task for NBD_CLEAR_SOCK
  nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
  nvme-fabrics: Add target support for FC transport
  nvme-fabrics: Add host support for FC transport
  nvme-fabrics: Add FC transport LLDD api definitions
  nvme-fabrics: Add FC transport FC-NVME definitions
  nvme-fabrics: Add FC transport error codes to nvme.h
  Add type 0x28 NVME type code to scsi fc headers
  nvme-fabrics: patch target code in prep for FC transport support
  nvme-fabrics: set sqe.command_id in core not transports
  parser: add u64 number parser
  nvme-rdma: align to generic ib_event logging helper
  ...
2016-12-13 10:19:16 -08:00
Jeff Layton 1e4ef0c633 ceph: add flags parameter to send_cap_msg
Add a flags parameter to send_cap_msg, so we can request expedited
service from the MDS when we know we'll be waiting on the result.

Set that flag in the case of try_flush_caps. The callers of that
function generally wait synchronously on the result, so it's beneficial
to ask the server to expedite it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Yan, Zheng e9e427f0a1 ceph: check availability of mds cluster on mount
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Ilya Dryomov 0dde584882 libceph: drop len argument of *verify_authorizer_reply()
The length of the reply is protocol-dependent - for cephx it's
ceph_x_authorize_reply.  Nothing sensible can be passed from the
messenger layer anyway.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:21 +01:00
Ilya Dryomov 264048afab libceph: initialize last_linger_id with a large integer
osdc->last_linger_id is a counter for lreq->linger_id, which is used
for watch cookies.  Starting with a large integer should ease the task
of telling apart kernel and userspace clients.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-11-10 20:13:08 +01:00
Christoph Hellwig 9f08217120 ceph: don't include blk_types.h in messenger.h
The file only needs the struct bvec_iter delcaration, which is available
from bvec.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Yan, Zheng fcff415c94 ceph: handle CEPH_SESSION_REJECT message
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-10-03 16:13:50 +02:00
Ilya Dryomov 005a07bf0a rbd: add 'client_addr' sysfs rbd device attribute
Export client addr/nonce, so userspace can check if a image is being
blacklisted.

Signed-off-by: Mike Christie <mchristi@redhat.com>
[idryomov@gmail.com: ceph_client_addr(), endianess fix]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-08-24 23:49:16 +02:00
Ilya Dryomov 033268a5f0 libceph: rename ceph_client_id() -> ceph_client_gid()
It's gid / global_id in other places.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-24 23:49:16 +02:00
Douglas Fuller 6305a3b415 libceph: support for blacklisting clients
Reuse ceph_mon_generic_request infrastructure for sending monitor
commands.  In particular, add support for 'blacklist add' to prevent
other, non-responsive clients from making further updates.

Signed-off-by: Douglas Fuller <dfuller@redhat.com>
[idryomov@gmail.com: refactor, misc fixes throughout]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-24 23:49:15 +02:00
Douglas Fuller d4ed4a5305 libceph: support for lock.lock_info
Add an interface for the Ceph OSD lock.lock_info method and associated
data structures.

Based heavily on code by Mike Christie <michaelc@cs.wisc.edu>.

Signed-off-by: Douglas Fuller <dfuller@redhat.com>
[idryomov@gmail.com: refactor, misc fixes throughout]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-24 23:49:15 +02:00
Douglas Fuller f66241cb99 libceph: support for advisory locking on RADOS objects
This patch adds support for rados lock, unlock and break lock.

Based heavily on code by Mike Christie <michaelc@cs.wisc.edu>.

Signed-off-by: Douglas Fuller <dfuller@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-24 23:49:15 +02:00
Douglas Fuller 428a715811 libceph: add ceph_osdc_call() single-page helper
Add a convenience function to osd_client to send Ceph OSD
'class' ops. The interface assumes that the request and
reply data each consist of single pages.

Signed-off-by: Douglas Fuller <dfuller@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-24 23:49:15 +02:00
Douglas Fuller a4ed38d7a1 libceph: support for CEPH_OSD_OP_LIST_WATCHERS
Add support for this Ceph OSD op, needed to support the RBD exclusive
lock feature.

Signed-off-by: Douglas Fuller <dfuller@redhat.com>
[idryomov@gmail.com: refactor, misc fixes throughout]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-24 23:49:15 +02:00
Ilya Dryomov f01d5cb24e libceph: rename ceph_entity_name_encode() -> ceph_auth_entity_name_encode()
Clear up EntityName vs entity_name_t confusion.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-24 23:49:15 +02:00
Arnd Bergmann a0f2b65275 ceph: fix symbol versioning for ceph_monc_do_statfs
The genksyms helper in the kernel cannot parse a type definition
like "typeof(((type *)0)->keyfld)" that is used in the DEFINE_RB_FUNCS
helper, causing the following EXPORT_SYMBOL() statement to be ignored
when computing the crcs, and triggering a warning about this:

WARNING: "ceph_monc_do_statfs" [fs/ceph/ceph.ko] has no CRC

To work around the problem, we can rewrite the type to reference
an undefined 'extern' symbol instead of a NULL pointer. This is
evidently ok for genksyms, and it no longer complains about the
line when calling it with 'genksyms -w'.

I've looked briefly into extending genksyms instead, but it seems
really hard to do. Jan Beulich introduced basic support for 'typeof'
a while ago in dc53324060 ("genksyms: fix typeof() handling"),
but that is not sufficient for the expression we have here.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: fcd00b68bb ("libceph: DEFINE_RB_FUNCS macro")
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Michal Marek <mmarek@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-07-28 15:32:53 +02:00
Yan, Zheng 0cabbd94ff libceph: fsmap.user subscription support
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-07-28 03:00:40 +02:00
Yan, Zheng 774a6a118c ceph: reduce i_nr_by_mode array size
Track usage count for individual fmode bit. This can reduce the
array size by half.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 02:55:39 +02:00
Yan, Zheng 30c156d995 libceph: rados pool namespace support
Add pool namesapce pointer to struct ceph_file_layout and struct
ceph_object_locator. Pool namespace is used by when mapping object
to PG, it's also used when composing OSD request.

The namespace pointer in struct ceph_file_layout is RCU protected.
So libceph can read namespace without taking lock.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: ceph_oloc_destroy(), misc minor changes]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-07-28 02:55:37 +02:00
Yan, Zheng 51e9273796 libceph: introduce reference counted string
The data structure is for storing namesapce string. It allows namespace
string to be shared between cephfs inodes with same layout. This data
structure can also be referenced by OSD request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 02:55:37 +02:00
Yan, Zheng 7627151ea3 libceph: define new ceph_file_layout structure
Define new ceph_file_layout structure and rename old ceph_file_layout
to ceph_file_layout_legacy. This is preparation for adding namespace
to ceph_file_layout structure.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 02:55:36 +02:00
Ilya Dryomov 22748f9d61 libceph: add start en/decoding block helpers
Add ceph_start_encoding() and ceph_start_decoding(), the equivalent of
ENCODE_START and DECODE_START in the userspace ceph code.

This is based on a patch from Mike Christie <michaelc@cs.wisc.edu>.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-07-28 02:55:36 +02:00
Ilya Dryomov 281dbe5db8 libceph: add an ONSTACK initializer for oids
An on-stack oid in ceph_ioctl_get_dataloc() is not initialized,
resulting in a WARN and a NULL pointer dereference later on.  We will
have more of these on-stack in the future, so fix it with a convenience
macro.

Fixes: d30291b985 ("libceph: variable-sized ceph_object_id")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-07-28 02:55:35 +02:00
Ilya Dryomov b2aa5d0bc8 libceph: fix some missing includes
- decode.h needs slab.h for kmalloc()
- osd_client.h needs msgpool.h for struct ceph_msgpool
- msgpool.h doesn't need messenger.h

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-07-28 02:55:35 +02:00
Ilya Dryomov b7ec35b304 libceph: change ceph_osdmap_flag() to take osdc
For the benefit of every single caller, take osdc instead of map.
Also, now that osdc->osdmap can't ever be NULL, drop the check.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-30 23:00:22 +02:00
Zhang Zhuoyu 3b33f692c8 ceph: make logical calculation functions return bool
This patch makes serverl logical caculation functions return bool to
improve readability due to these particular functions only using 0/1
as their return value.

No functional change.

Signed-off-by: Zhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
2016-05-26 01:15:39 +02:00
Yan, Zheng f3c4ebe65e ceph: using hash value to compose dentry offset
If MDS sorts dentries in dirfrag in hash order, we use hash value to
compose dentry offset. dentry offset is:

  (0xff << 52) | ((24 bits hash) << 28) |
  (the nth entry hash hash collision)

This offset is stable across directory fragmentation. This alos means
there is no need to reset readdir offset if directory get fragmented
in the middle of readdir.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26 01:15:36 +02:00
Yan, Zheng 956d39d631 ceph: define 'end/complete' in readdir reply as bit flags
Set a flag in readdir request, which indicates that client interprets
'end/complete' as bit flags. So that mds can reply additional flags in
readdir reply.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26 01:15:35 +02:00
Ilya Dryomov 737cc81ead libceph: support for subscribing to "mdsmap.<id>" maps
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:15:30 +02:00
Ilya Dryomov 7cca78c9dc libceph: replace ceph_monc_request_next_osdmap()
... with a wrapper around maybe_request_map() - no need for two
osdmap-specific functions.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:15:30 +02:00
Ilya Dryomov 4609245e26 libceph: pool deletion detection
This adds the "map check" infrastructure for sending osdmap version
checks on CALC_TARGET_POOL_DNE and completing in-flight requests with
-ENOENT if the target pool doesn't exist or has just been deleted.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:15:29 +02:00
Ilya Dryomov d0b19705e9 libceph: async MON client generic requests
For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION
messages asynchronously and get a callback on completion.  Refactor MON
client to allow firing off generic requests asynchronously and add an
async variant of ceph_monc_get_version().  ceph_monc_do_statfs() is
switched over and remains sync.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:15:29 +02:00
Ilya Dryomov b07d3c4bd7 libceph: support for checking on status of watch
Implement ceph_osdc_watch_check() to be able to check on status of
watch.  Note that the time it takes for a watch/notify event to get
delivered through the notify_wq is taken into account.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:15:28 +02:00
Ilya Dryomov 1907920324 libceph: support for sending notifies
Implement ceph_osdc_notify() for sending notifies.

Due to the fact that the current messenger can't do read-in into
pagelists (it can only do write-out from them), I had to go with a page
vector for a NOTIFY_COMPLETE payload, for now.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:15:28 +02:00
Ilya Dryomov 922dab6134 libceph, rbd: ceph_osd_linger_request, watch/notify v2
This adds support and switches rbd to a new, more reliable version of
watch/notify protocol.  As with the OSD client update, this is mostly
about getting the right structures linked into the right places so that
reconnects are properly sent when needed.  watch/notify v2 also
requires sending regular pings to the OSDs - send_linger_ping().

A major change from the old watch/notify implementation is the
introduction of ceph_osd_linger_request - linger requests no longer
piggy back on ceph_osd_request.  ceph_osd_event has been merged into
ceph_osd_linger_request.

All the details are now hidden within libceph, the interface consists
of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
the lifetime management simple.

ceph_osdc_notify_ack() accepts an optional data payload, which is
relayed back to the notifier.

Portions of this patch are loosely based on work by Douglas Fuller
<dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:15:02 +02:00
Ilya Dryomov 5aea3dcd50 libceph: a major OSD client update
This is a major sync up, up to ~Jewel.  The highlights are:

- per-session request trees (vs a global per-client tree)
- per-session locking (vs a global per-client rwlock)
- homeless OSD session
- no ad-hoc global per-client lists
- support for pool quotas
- foundation for watch/notify v2 support
- foundation for map check (pool deletion detection) support

The switchover is incomplete: lingering requests can be setup and
teared down but aren't ever reestablished.  This functionality is
restored with the introduction of the new lingering infrastructure
(ceph_osd_linger_request, linger_work, etc) in a later commit.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:14:03 +02:00
Ilya Dryomov 9dd2845ccb libceph: protect osdc->osd_lru list with a spinlock
OSD client is getting moved from the big per-client lock to a set of
per-session locks.  The big rwlock would only be held for read most of
the time, so a global osdc->osd_lru needs additional protection.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:12:30 +02:00
Ilya Dryomov 42c1b12403 libceph: handle_one_map()
Separate osdmap handling from decoding and iterating over a bag of maps
in a fresh MOSDMap message.  This sets up the scene for the updated OSD
client.

Of particular importance here is the addition of pi->was_full, which
can be used to answer "did this pool go full -> not-full in this map?".
This is the key bit for supporting pool quotas.

We won't be able to downgrade map_sem for much longer, so drop
downgrade_write().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 01:12:27 +02:00
Ilya Dryomov e5253a7bde libceph: allocate dummy osdmap in ceph_osdc_init()
This leads to a simpler osdmap handling code, particularly when dealing
with pi->was_full, which is introduced in a later commit.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:29 +02:00
Ilya Dryomov fe5da05e97 libceph: redo callbacks and factor out MOSDOpReply decoding
If you specify ACK | ONDISK and set ->r_unsafe_callback, both
->r_callback and ->r_unsafe_callback(true) are called on ack.  This is
very confusing.  Redo this so that only one of them is called:

    ->r_unsafe_callback(true), on ack
    ->r_unsafe_callback(false), on commit

or

    ->r_callback, on ack|commit

Decode everything in decode_MOSDOpReply() to reduce clutter.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:28 +02:00
Ilya Dryomov 85e084feb4 libceph: drop msg argument from ceph_osdc_callback_t
finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success.  This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:27 +02:00
Ilya Dryomov bb873b5391 libceph: switch to calc_target(), part 2
The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target.  Encoding now happens within ceph_osdc_start_request().

Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded.  If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.

Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op().  While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:27 +02:00
Ilya Dryomov a66dd38309 libceph: switch to calc_target(), part 1
Replace __calc_request_pg() and most of __map_request() with
calc_target() and start using req->r_t.

ceph_osdc_build_request() however still encodes base_oid, because it's
called before calc_target() is and target_oid is empty at that point in
time; a printf in osdc_show() also shows base_oid.  This is fixed in
"libceph: switch to calc_target(), part 2".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:26 +02:00
Ilya Dryomov 63244fa123 libceph: introduce ceph_osd_request_target, calc_target()
Introduce ceph_osd_request_target, containing all mapping-related
fields of ceph_osd_request and calc_target() for calculating mappings
and populating it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:26 +02:00
Ilya Dryomov 04812acf57 libceph: pi->min_size, pi->last_force_request_resend
Add and decode pi->min_size and pi->last_force_request_resend.  These
are going to be used by calc_target().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:26 +02:00
Ilya Dryomov f984cb76cc libceph: make pgid_cmp() global
calc_target() code is going to need to know how to compare PGs.  Take
lhs and rhs pgid by const * while at it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:25 +02:00
Ilya Dryomov f81f16339a libceph: rename ceph_calc_pg_primary()
Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
emphasise that it returns acting primary.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:25 +02:00
Ilya Dryomov 6f3bfd45cd libceph: ceph_osds, ceph_pg_to_up_acting_osds()
Knowning just acting set isn't enough, we need to be able to record up
set as well to detect interval changes.  This means returning (up[],
up_len, up_primary, acting[], acting_len, acting_primary) and passing
it around.  Introduce and switch to ceph_osds to help with that.

Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
both up and acting sets from it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:25 +02:00
Ilya Dryomov d9591f5e28 libceph: rename ceph_oloc_oid_to_pg()
Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg().  Emphasise
that returned is raw PG and return -ENOENT instead of -EIO if the pool
doesn't exist.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:24 +02:00
Ilya Dryomov 985c167388 libceph: fix ceph_eversion encoding
eversion_t is version+epoch in userspace and is encoded in that order.
ceph_eversion is defined as epoch+version in rados.h, yet we memcpy it
in __send_request().  Reoder ceph_eversion fields.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:24 +02:00
Ilya Dryomov fcd00b68bb libceph: DEFINE_RB_FUNCS macro
Given

    struct foo {
        u64 id;
        struct rb_node bar_node;
    };

generate insert_bar(), erase_bar() and lookup_bar() functions with

    DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)

The key is assumed to be an integer (u64, int, etc), compared with
< and >.  nodefld has to be initialized with RB_CLEAR_NODE().

Start using it for MDS, MON and OSD requests and OSD sessions.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:23 +02:00
Ilya Dryomov 0c0a8de13f libceph: nuke unused fields and functions
Either unused or useless:

    osdmap->mkfs_epoch
    osd->o_marked_for_keepalive
    monc->num_generic_requests
    osdc->map_waiters
    osdc->last_requested_map
    osdc->timeout_tid

    osd_req_op_cls_response_data()

    osdmap_apply_incremental() @msgr arg

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:23 +02:00
Ilya Dryomov d30291b985 libceph: variable-sized ceph_object_id
Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters.  This is enough for all use cases,
expect one - long rbd image names:

- a format 1 header is named "<imgname>.rbd"
- an object that points to a format 2 header is named "rbd_id.<imgname>"

We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh.  (A format 2 header name is
a small system-generated string.)

Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string.  Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:22 +02:00
Ilya Dryomov 13d1ad16d0 libceph: move message allocation out of ceph_osdc_alloc_request()
The size of ->r_request and ->r_reply messages depends on the size of
the object name (ceph_object_id), while the size of ceph_osd_request is
fixed.  Move message allocation into a separate function that would
have to be called after ceph_object_id and ceph_object_locator (which
is also going to become variable in size with RADOS namespaces) have
been filled in:

    req = ceph_osdc_alloc_request(...);
    <fill in req->r_base_oid>
    <fill in req->r_base_oloc>
    ceph_osdc_alloc_messages(req);

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:21 +02:00
Ilya Dryomov 6c1ea260f8 libceph: make authorizer destruction independent of ceph_auth_client
Starting the kernel client with cephx disabled and then enabling cephx
and restarting userspace daemons can result in a crash:

    [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
    [262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130
    [262671.584334] PGD 0
    [262671.635847] Oops: 0000 [#1] SMP
    [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
    [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
    [262672.268937] Workqueue: ceph-msgr con_work [libceph]
    [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
    [262672.428330] RIP: 0010:[<ffffffff811cd04a>]  [<ffffffff811cd04a>] kfree+0x5a/0x130
    [262672.535880] RSP: 0018:ffff880149aeba58  EFLAGS: 00010286
    [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
    [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
    [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
    [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
    [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
    [262673.131722] FS:  0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
    [262673.245377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
    [262673.417556] Stack:
    [262673.472943]  ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
    [262673.583767]  ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
    [262673.694546]  ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
    [262673.805230] Call Trace:
    [262673.859116]  [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
    [262673.968705]  [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
    [262674.078852]  [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph]
    [262674.134249]  [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph]
    [262674.189124]  [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph]
    [262674.243749]  [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph]
    [262674.297485]  [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
    [262674.350813]  [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph]
    [262674.403312]  [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph]
    [262674.454712]  [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690
    [262674.505096]  [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph]
    [262674.555104]  [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0
    [262674.604072]  [<ffffffff810901ea>] worker_thread+0x11a/0x470
    [262674.652187]  [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310
    [262674.699022]  [<ffffffff810957a2>] kthread+0xd2/0xf0
    [262674.744494]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
    [262674.789543]  [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70
    [262674.834094]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0

What happens is the following:

    (1) new MON session is established
    (2) old "none" ac is destroyed
    (3) new "cephx" ac is constructed
    ...
    (4) old OSD session (w/ "none" authorizer) is put
          ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)

osd->o_auth.authorizer in the "none" case is just a bare pointer into
ac, which contains a single static copy for all services.  By the time
we get to (4), "none" ac, freed in (2), is long gone.  On top of that,
a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
so we end up trying to destroy a "none" authorizer with a "cephx"
destructor operating on invalid memory!

To fix this, decouple authorizer destruction from ac and do away with
a single static "none" authorizer by making a copy for each OSD or MDS
session.  Authorizers themselves are independent of ac and so there is
no reason for destroy_authorizer() to be an ac op.  Make it an op on
the authorizer itself by turning ceph_authorizer into a real struct.

Fixes: http://tracker.ceph.com/issues/15447

Reported-by: Alan Zhang <alan.zhang@linux.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-04-25 20:54:13 +02:00
Kirill A. Shutemov 09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Yan, Zheng 315f240880 ceph: fix security xattr deadlock
When security is enabled, security module can call filesystem's
getxattr/setxattr callbacks during d_instantiate(). For cephfs,
d_instantiate() is usually called by MDS' dispatch thread, while
handling MDS reply. If the MDS reply does not include xattrs and
corresponding caps, getxattr/setxattr need to send a new request
to MDS and waits for the reply. This makes MDS' dispatch sleep,
nobody handles later MDS replies.

The fix is make sure lookup/atomic_open reply include xattrs and
corresponding caps. So getxattr can be handled by cached xattrs.
This requires some modification to both MDS and request message.
(Client tells MDS what caps it wants; MDS encodes proper caps in
the reply)

Smack security module may call setxattr during d_instantiate().
Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL
to us. So just make setxattr return error when called by MDS'
dispatch thread.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:55 +01:00
Yan, Zheng 2c63f49a72 libceph: add helper that duplicates last extent operation
This helper duplicates last extent operation in OSD request, then
adjusts the new extent operation's offset and length. The helper
is for scatterd page writeback, which adds nonconsecutive dirty
pages to single OSD request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:43 +01:00
Ilya Dryomov 3f1af42ad0 libceph: enable large, variable-sized OSD requests
Turn r_ops into a flexible array member to enable large, consisting of
up to 16 ops, OSD requests.  The use case is scattered writeback in
cephfs and, as far as the kernel client is concerned, 16 is just a made
up number.

r_ops had size 3 for copyup+hint+write, but copyup is really a special
case - it can only happen once.  ceph_osd_request_cache is therefore
stuffed with num_ops=2 requests, anything bigger than that is allocated
with kmalloc().  req_mempool is backed by ceph_osd_request_cache, which
means either num_ops=1 or num_ops=2 for use_mempool=true - all existing
users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with
that.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:43 +01:00
Yan, Zheng 7665d85b73 libceph: move r_reply_op_{len,result} into struct ceph_osd_req_op
This avoids defining large array of r_reply_op_{len,result} in
in struct ceph_osd_request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:42 +01:00
Ilya Dryomov de2aa102ea libceph: rename ceph_osd_req_op::payload_len to indata_len
Follow userspace nomenclature on this - the next commit adds
outdata_len.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:41 +01:00
Ilya Dryomov 168b9090c7 libceph: monc hunt rate is 3s with backoff up to 30s
Unless we are in the process of setting up a client (i.e. connecting to
the monitor cluster for the first time), apply a backoff: every time we
want to reopen a session, increase our timeout by a multiple (currently
2); when we complete the connection, reduce that multipler by 50%.

Mirrors ceph.git commit 794c86fd289bd62a35ed14368fa096c46736e9a2.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:39 +01:00
Ilya Dryomov 58d81b1294 libceph: monc ping rate is 10s
Split ping interval and ping timeout: ping interval is 10s; keepalive
timeout is 30s.

Make monc_ping_timeout a constant while at it - it's not actually
exported as a mount option (and the rest of tick-related settings won't
be either), so it's got no place in ceph_options.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:39 +01:00
Ilya Dryomov 82dcabad75 libceph: revamp subs code, switch to SUBSCRIBE2 protocol
It is currently hard-coded in the mon_client that mdsmap and monmap
subs are continuous, while osdmap sub is always "onetime".  To better
handle full clusters/pools in the osd_client, we need to be able to
issue continuous osdmap subs.  Revamp subs code to allow us to specify
for each sub whether it should be continuous or not.

Although not strictly required for the above, switch to SUBSCRIBE2
protocol while at it, eliminating the ambiguity between a request for
"every map since X" and a request for "just the latest" when we don't
have a map yet (i.e. have epoch 0).  SUBSCRIBE2 feature bit is now
required - it's been supported since pre-argonaut (2010).

Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling
in before we validate the epoch and successfully install the new map
can mess up mon_client sub state.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:38 +01:00
Yan, Zheng 5ea5c5e0a7 ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support
Add support for the format change of MClientReply/MclientCaps.
Also add code that denies access to inodes with pool_ns layouts.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-03-04 21:00:37 +01:00
Ilya Dryomov b0b31a8ffe libceph: MOSDOpReply v7 encoding
Empty request_redirect_t (struct ceph_request_redirect in the kernel
client) is now encoded with a bool.  NEW_OSDOPREPLY_ENCODING feature
bit overlaps with already supported CRUSH_TUNABLES5.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-02-04 18:26:08 +01:00
Ilya Dryomov 97db9a8818 libceph: advertise support for TUNABLES5
Add TUNABLES5 feature (chooseleaf_stable tunable) to a set of features
supported by default.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-02-04 18:26:04 +01:00
Ilya Dryomov 67645d7619 libceph: fix ceph_msg_revoke()
There are a number of problems with revoking a "was sending" message:

(1) We never make any attempt to revoke data - only kvecs contibute to
con->out_skip.  However, once the header (envelope) is written to the
socket, our peer learns data_len and sets itself to expect at least
data_len bytes to follow front or front+middle.  If ceph_msg_revoke()
is called while the messenger is sending message's data portion,
anything we send after that call is counted by the OSD towards the now
revoked message's data portion.  The effects vary, the most common one
is the eventual hang - higher layers get stuck waiting for the reply to
the message that was sent out after ceph_msg_revoke() returned and
treated by the OSD as a bunch of data bytes.  This is what Matt ran
into.

(2) Flat out zeroing con->out_kvec_bytes worth of bytes to handle kvecs
is wrong.  If ceph_msg_revoke() is called before the tag is sent out or
while the messenger is sending the header, we will get a connection
reset, either due to a bad tag (0 is not a valid tag) or a bad header
CRC, which kind of defeats the purpose of revoke.  Currently the kernel
client refuses to work with header CRCs disabled, but that will likely
change in the future, making this even worse.

(3) con->out_skip is not reset on connection reset, leading to one or
more spurious connection resets if we happen to get a real one between
con->out_skip is set in ceph_msg_revoke() and before it's cleared in
write_partial_skip().

Fixing (1) and (3) is trivial.  The idea behind fixing (2) is to never
zero the tag or the header, i.e. send out tag+header regardless of when
ceph_msg_revoke() is called.  That way the header is always correct, no
unnecessary resets are induced and revoke stands ready for disabled
CRCs.  Since ceph_msg_revoke() rips out con->out_msg, introduce a new
"message out temp" and copy the header into it before sending.

Cc: stable@vger.kernel.org # 4.0+
Reported-by: Matt Conner <matt.conner@keepertech.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Matt Conner <matt.conner@keepertech.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-01-21 19:36:08 +01:00
Yaowei Bai 79a3ed2e98 ceph: ceph_frag_contains_value can be boolean
This patch makes ceph_frag_contains_value return bool to improve
readability due to this particular function only using either one or
zero as its return value.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-01-21 19:36:07 +01:00
Yaowei Bai eade1fe75f ceph: remove unused functions in ceph_frag.h
These functions were introduced in commit 3d14c5d2b ("ceph: factor
out libceph from Ceph file system"). Howover, there's no user of
these functions since then, so remove them for simplicity.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-01-21 19:36:07 +01:00
Ilya Dryomov a51983e4dd libceph: add nocephx_sign_messages option
Support for message signing was merged into 3.19, along with
nocephx_require_signatures option.  But, all that option does is allow
the kernel client to talk to clusters that don't support MSG_AUTH
feature bit.  That's pretty useless, given that it's been supported
since bobtail.

Meanwhile, if one disables message signing on the server side with
"cephx sign messages = false", it becomes impossible to use the kernel
client since it expects messages to be signed if MSG_AUTH was
negotiated.  Add nocephx_sign_messages option to support this use case.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02 23:37:46 +01:00
Ilya Dryomov 859bff51dc libceph: stop duplicating client fields in messenger
supported_features and required_features serve no purpose at all, while
nocrc and tcp_nodelay belong to ceph_options::flags.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02 23:37:46 +01:00
Ilya Dryomov 79dbd1baa6 libceph: msg signing callouts don't need con argument
We can use msg->con instead - at the point we sign an outgoing message
or check the signature on the incoming one, msg->con is always set.  We
wouldn't know how to sign a message without an associated session (i.e.
msg->con == NULL) and being able to sign a message using an explicitly
provided authorizer is of no use.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02 23:37:45 +01:00
Ilya Dryomov 335c258582 libceph: advertise support for keepalive2
We are the client, but advertise keepalive2 anyway - for consistency,
if nothing else.  In the future the server might want to know whether
its clients support keepalive2.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2015-09-17 20:14:27 +03:00
Ilya Dryomov 7f61f54565 libceph: don't access invalid memory in keepalive2 path
This

    struct ceph_timespec ceph_ts;
    ...
    con_out_kvec_add(con, sizeof(ceph_ts), &ceph_ts);

wraps ceph_ts into a kvec and adds it to con->out_kvec array, yet
ceph_ts becomes invalid on return from prepare_write_keepalive().  As
a result, we send out bogus keepalive2 stamps.  Fix this by encoding
into a ceph_timespec member, similar to how acks are read and written.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2015-09-17 20:14:15 +03:00
Yan, Zheng 8b9558aab8 libceph: use keepalive2 to verify the mon session is alive
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-09-08 23:14:30 +03:00
Ilya Dryomov 757856d2b9 libceph: enable ceph in a non-default network namespace
Grab a reference on a network namespace of the 'rbd map' (in case of
rbd) or 'mount' (in case of ceph) process and use that to open sockets
instead of always using init_net and bailing if network namespace is
anything but init_net.  Be careful to not share struct ceph_client
instances between different namespaces and don't add any code in the
!CONFIG_NET_NS case.

This is based on a patch from Hong Zhiguo <zhiguohong@tencent.com>.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2015-07-09 20:30:34 +03:00
Yan, Zheng f66fd9f095 ceph: pre-allocate data structure that tracks caps flushing
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25 11:49:31 +03:00
Ilya Dryomov a319bf56a6 libceph: store timeouts in jiffies, verify user input
There are currently three libceph-level timeouts that the user can
specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive.  All of
these are in seconds and no checking is done on user input: negative
values are accepted, we multiply them all by HZ which may or may not
overflow, arbitrarily large jiffies then get added together, etc.

There is also a bug in the way mount_timeout=0 is handled.  It's
supposed to mean "infinite timeout", but that's not how wait.h APIs
treat it and so __ceph_open_session() for example will busy loop
without much chance of being interrupted if none of ceph-mons are
there.

Fix all this by verifying user input, storing timeouts capped by
msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies()
helper for all user-specified waits to handle infinite timeouts
correctly.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2015-06-25 11:49:29 +03:00
Ilya Dryomov d50c97b566 libceph: nuke time_sub()
Unused since ceph got merged into mainline I guess.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2015-06-25 11:49:29 +03:00
Yan, Zheng 144cba1493 libceph: allow setting osd_req_op's flags
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2015-06-25 11:49:27 +03:00
Ilya Dryomov 7c1c4747f2 libceph: announce support for straw2 buckets
Sync up feature bits and enable CEPH_FEATURE_CRUSH_V4.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-22 18:33:48 +03:00
Yan, Zheng 0ea611a3bc ceph: rename snapshot support
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-22 18:33:41 +03:00
Ilya Dryomov 9571eb4f96 libceph: simplify our debugfs attr macro
No need to do single_open()'s job ourselves.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-20 18:55:39 +03:00
Ilya Dryomov 5cf7bd3012 libceph: expose client options through debugfs
Add a client_options attribute for showing libceph options.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-20 18:55:39 +03:00
Ilya Dryomov ff40f9ae95 libceph, ceph: split ceph_show_options()
Split ceph_show_options() into two pieces and move the piece
responsible for printing client (libceph) options into net/ceph.  This
way people adding a libceph option wouldn't have to remember to update
code in fs/ceph.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-20 18:55:38 +03:00
Joe Perches 3ef650d398 libceph: osdmap.h: Add missing format newlines
To avoid possible interleaving, add missing '\n' to formats.

Convert pr_warning to pr_warn while there.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-20 18:55:35 +03:00
Chaitanya Huilgol ba988f87f5 libceph: tcp_nodelay support
TCP_NODELAY socket option set on connection sockets,
disables Nagle’s algorithm and improves latency characteristics.
tcp_nodelay(default)/notcp_nodelay option flags provided to
enable/disable setting the socket option.

Signed-off-by: Chaitanya Huilgol <chaitanya.huilgol@sandisk.com>
[idryomov@redhat.com: NO_TCP_NODELAY -> TCP_NODELAY, minor adjustments]
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-02-19 13:31:40 +03:00
Yan, Zheng 03f4fcb028 ceph: handle SESSION_FORCE_RO message
mark session as readonly and wake up all cap waiters.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:37 +03:00
Ilya Dryomov 7a6fdeb2b1 libceph: nuke pool op infrastructure
On Mon, Dec 22, 2014 at 5:35 PM, Sage Weil <sage@newdream.net> wrote:
> On Mon, 22 Dec 2014, Ilya Dryomov wrote:
>> Actually, pool op stuff has been unused for over two years - looks like
>> it was added for rbd create_snap and that got ripped out in 2012.  It's
>> unlikely we'd ever need to manage pools or snaps from the kernel client
>> so I think it makes sense to nuke it.  Sage?
>
> Yep!

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-02-19 13:31:37 +03:00
Ilya Dryomov d7d5a007b1 libceph: fix sparse endianness warnings
The only real issue is the one in auth_x.c and it came with
3.19-rc1 merge.

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-01-08 20:36:57 +03:00
Ilya Dryomov 84a1d2d1ec libceph: fixup includes in pagelist.h
pagelist.h needs to include linux/types.h and asm/byteorder.h and not
rely on other headers pulling yet another set of headers.

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17 20:09:53 +03:00
Yan, Zheng 01deead041 ceph: use getattr request to fetch inline data
Add a new parameter 'locked_page' to ceph_do_getattr(). If inline data
in getattr reply will be copied to the page.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:52 +03:00
Yan, Zheng 31c542a199 ceph: add inline data to pagecache
Request reply and cap message can contain inline data. add inline data
to the page cache if there is Fc cap.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:52 +03:00
Yan, Zheng 715e4cd405 libceph: specify position of extent operation
allow specifying position of extent operation in multi-operations
osd request. This is required for cephfs to convert inline data to
normal data (compare xattr, then write object).

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17 20:09:52 +03:00
Yan, Zheng d74b50bed0 libceph: add SETXATTR/CMPXATTR osd operations support
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17 20:09:51 +03:00
Yan, Zheng a3fc98005c libceph: require cephx message signature by default
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17 20:09:51 +03:00
John Spray d4e1a4e0db libceph: update ceph_msg_header structure
2 bytes of what was reserved space is now used by userspace for the
compat_version field.

Signed-off-by: John Spray <john.spray@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-12-17 20:09:50 +03:00
Yan, Zheng 33d0733796 libceph: message signature support
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:50 +03:00
Ilya Dryomov 4965fc38c4 libceph: nuke ceph_kvfree()
Use kvfree() from linux/mm.h instead, which is identical.  Also fix the
ceph_buffer comment: we will allocate with kmalloc() up to 32k - the
value of PAGE_ALLOC_COSTLY_ORDER, but that really is just an
implementation detail so don't mention it at all.

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17 20:09:50 +03:00
Yan, Zheng 9280be24dc ceph: fix file lock interruption
When a lock operation is interrupted, current code sends a unlock request to
MDS to undo the lock operation. This method does not work as expected because
the unlock request can drop locks that have already been acquired.

The fix is use the newly introduced CEPH_LOCK_FCNTL_INTR/CEPH_LOCK_FLOCK_INTR
requests to interrupt blocked file lock request. These requests do not drop
locks that have alread been acquired, they only interrupt blocked file lock
request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:49 +03:00
Ilya Dryomov 70b5bfa360 libceph: sync osd op definitions in rados.h
Bring in missing osd ops and strings, use macros to eliminate multiple
points of maintenance.

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14 12:57:02 -07:00
Fabian Frederick eb179d3975 libceph: remove redundant declaration
ceph_release_page_vector was defined twice in libceph.h

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2014-10-14 12:57:02 -07:00
Yan, Zheng e4339d28f6 libceph: reference counting pagelist
this allow pagelist to present data that may be sent multiple times.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14 12:56:48 -07:00
Ilya Dryomov 2d05f082cb libceph: nuke ceph_osdc_unregister_linger_request()
Remove now unused ceph_osdc_unregister_linger_request().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-08 15:08:45 +04:00
Ilya Dryomov c9f9b93ddf libceph: introduce ceph_osdc_cancel_request()
Introduce ceph_osdc_cancel_request() intended for canceling requests
from the higher layers (rbd and cephfs).  Because higher layers are in
charge and are supposed to know what and when they are canceling, the
request is not completed, only unref'ed and removed from the libceph
data structures.

__cancel_request() is no longer called before __unregister_request(),
because __unregister_request() unconditionally revokes r_request and
there is no point in trying to do it twice.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-08 15:08:44 +04:00
Ilya Dryomov 9e94af202a libceph: move and add dout()s to ceph_osdc_request_{get,put}()
Add dout()s to ceph_osdc_request_{get,put}().  Also move them to .c and
turn kref release callback into a static function.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-08 15:08:43 +04:00
Ilya Dryomov 0215e44bb3 libceph: move and add dout()s to ceph_msg_{get,put}()
Add dout()s to ceph_msg_{get,put}().  Also move them to .c and turn
kref release callback into a static function.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-08 15:08:43 +04:00
Ilya Dryomov 1d0326b13b libceph: rename ceph_osd_request::r_linger_osd to r_linger_osd_item
So that:

req->r_osd_item --> osd->o_requests list
req->r_linger_osd_item --> osd->o_linger_requests list

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-08 15:08:42 +04:00
Linus Torvalds 6d87c225f5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "This has a mix of bug fixes and cleanups.

  Alex's patch fixes a rare race in RBD.  Ilya's patches fix an ENOENT
  check when a second rbd image is mapped and a couple memory leaks.
  Zheng fixes several issues with fragmented directories and multiple
  MDSs.  Josh fixes a spin/sleep issue, and Josh and Guangliang's
  patches fix setting and unsetting RBD images read-only.

  Naturally there are several other cleanups mixed in for good measure"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
  rbd: only set disk to read-only once
  rbd: move calls that may sleep out of spin lock range
  rbd: add ioctl for rbd
  ceph: use truncate_pagecache() instead of truncate_inode_pages()
  ceph: include time stamp in every MDS request
  rbd: fix ida/idr memory leak
  rbd: use reference counts for image requests
  rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()
  rbd: make sure we have latest osdmap on 'rbd map'
  libceph: add ceph_monc_wait_osdmap()
  libceph: mon_get_version request infrastructure
  libceph: recognize poolop requests in debugfs
  ceph: refactor readpage_nounlock() to make the logic clearer
  mds: check cap ID when handling cap export message
  ceph: remember subtree root dirfrag's auth MDS
  ceph: introduce ceph_fill_fragtree()
  ceph: handle cap import atomically
  ceph: pre-allocate ceph_cap struct for ceph_add_cap()
  ceph: update inode fields according to issued caps
  rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
  ...
2014-06-12 23:06:23 -07:00
Ilya Dryomov 6044cde6f2 libceph: add ceph_monc_wait_osdmap()
Add ceph_monc_wait_osdmap(), which will block until the osdmap with the
specified epoch is received or timeout occurs.

Export both of these as they are going to be needed by rbd.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-06-06 09:29:57 +08:00
Ilya Dryomov 513a8243d6 libceph: mon_get_version request infrastructure
Add support for mon_get_version requests to libceph.  This reuses much
of the ceph_mon_generic_request infrastructure, with one exception.
Older OSDs don't set mon_get_version reply hdr->tid even if the
original request had a non-zero tid, which makes it impossible to
lookup ceph_mon_generic_request contexts by tid in get_generic_reply()
for such replies.  As a workaround, we allocate a reply message on the
reply path.  This can probably interfere with revoke, but I don't see
a better way.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-06-06 09:29:57 +08:00
Yan, Zheng f98a128a55 ceph: update inode fields according to issued caps
Cap message and request reply from non-auth MDS may carry stale
information (corresponding locks are in LOCK states) even they
have the newest inode version. So client should update inode fields
according to issued caps.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-06 09:29:52 +08:00
Al Viro 2b777c9dd9 ceph_sync_read: stop poking into iov_iter guts
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:42 -04:00
Ilya Dryomov 18cb95af2d libceph: enable PRIMARY_AFFINITY feature bit
Announce our support for osdmaps with non-default primary affinity
values.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:20 -07:00
Ilya Dryomov 8008ab1080 libceph: return primary from ceph_calc_pg_acting()
In preparation for adding support for primary_temp, stop assuming
primaryness: add a primary out parameter to ceph_calc_pg_acting() and
change call sites accordingly.  Primary is now specified separately
from the order of osds in the set.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:14 -07:00
Ilya Dryomov ac972230e2 libceph: switch ceph_calc_pg_acting() to new helpers
Switch ceph_calc_pg_acting() to new helpers: pg_to_raw_osds(),
raw_to_up_osds() and apply_temps().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:13 -07:00
Ilya Dryomov 2abebdbca7 libceph: ceph_can_shift_osds(pool) and pool type defines
Bring in pg_pool_t::can_shift_osds() counterpart along with pool type
defines.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:08 -07:00
Ilya Dryomov 246138fa67 libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions
Sync up with ceph.git definitions.  Bring in ceph_osd_is_down().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:07 -07:00
Ilya Dryomov ddf3a21a03 libceph: enable OSDMAP_ENC feature bit
Announce our support for "new" (v7 - split and separately versioned
client and osd sections) osdmap enconding.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:05 -07:00
Ilya Dryomov 2cfa34f2d6 libceph: primary_affinity infrastructure
Add primary_affinity infrastructure.  primary_affinity values are
stored in an max_osd-sized array, hanging off ceph_osdmap, similar to
a osd_weight array.

Introduce {get,set}_primary_affinity() helpers, primarily to return
CEPH_OSD_DEFAULT_PRIMARY_AFFINITY when no affinity has been set and to
abstract out osd_primary_affinity array allocation and initialization.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:08:02 -07:00
Ilya Dryomov 9686f94c8c libceph: primary_temp infrastructure
Add primary_temp mappings infrastructure.  struct ceph_pg_mapping is
overloaded, primary_temp mappings are stored in an rb-tree, rooted at
ceph_osdmap, in a manner similar to pg_temp mappings.

Dump primary_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap,
one 'primary_temp <pgid> <osd>' per line, e.g:

    primary_temp 2.6 4

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:58 -07:00
Ilya Dryomov 35a935d75d libceph: generalize ceph_pg_mapping
In preparation for adding support for primary_temp mappings, generalize
struct ceph_pg_mapping so it can hold mappings other than pg_temp.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:57 -07:00
Ilya Dryomov a2505d63ee libceph: split osdmap allocation and decode steps
Split osdmap allocation and initialization into a separate function,
ceph_osdmap_decode().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-04 21:07:37 -07:00
Ilya Dryomov 07bd7de47a crush: support chooseleaf_vary_r tunable (tunables3) by default
Add TUNABLES3 feature (chooseleaf_vary_r tunable) to a set of features
supported by default.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-04-04 21:07:29 -07:00
Yan, Zheng eb13e832f8 ceph: use fl->fl_file as owner identifier of flock and posix lock
flock and posix lock should use fl->fl_file instead of process ID
as owner identifier. (posix lock uses fl->fl_owner. fl->fl_owner
is usually equal to fl->fl_file, but it also can be a customized
value). The process ID of who holds the lock is just for F_GETLK
fcntl(2).

The fix is rename the 'pid' fields of struct ceph_mds_request_args
and struct ceph_filelock to 'owner', rename 'pid_namespace' fields
to 'pid'. Assign fl->fl_file to the 'owner' field of lock messages.
We also set the most significant bit of the 'owner' field. MDS can
use that bit to distinguish between old and new clients.

The MDS counterpart of this patch modifies the flock code to not
take the 'pid_namespace' into consideration when checking conflict
locks.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-04 21:07:11 -07:00
Yan, Zheng 19913b4eac ceph: add get_name() NFS export callback
Use the newly introduced LOOKUPNAME MDS request to connect child
inode to its parent directory.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:53 +08:00
Ilya Dryomov 7cc69d42e6 libceph: bump CEPH_OSD_MAX_OP to 3
Our longest osd request now contains 3 ops: copyup+hint+write.

Also, CEPH_OSD_MAX_OP value in a BUG_ON in rbd_osd_req_callback() was
hard-coded to 2.  Fix it, and switch to rbd_assert while at it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03 10:33:52 +08:00
Ilya Dryomov c647b8a8c6 libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
This is primarily for rbd's benefit and is supposed to combat
fragmentation:

"... knowing that rbd images have a 4m size, librbd can pass a hint
that will let the osd do the xfs allocation size ioctl on new files so
that they are allocated in 1m or 4m chunks.  We've seen cases where
users with rbd workloads have very high levels of fragmentation in xfs
and this would mitigate that and probably have a pretty nice
performance benefit."

SETALLOCHINT is considered advisory, so our backwards compatibility
mechanism here is to set FAILOK flag for all SETALLOCHINT ops.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03 10:33:51 +08:00
Ilya Dryomov 7b25bf5f02 libceph: encode CEPH_OSD_OP_FLAG_* op flags
Encode ceph_osd_op::flags field so that it gets sent over the wire.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-04-03 10:33:51 +08:00
Ilya Dryomov 9d521470a4 libceph: a per-osdc crush scratch buffer
With the addition of erasure coding support in the future, scratch
variable-length array in crush_do_rule_ary() is going to grow to at
least 200 bytes on average, on top of another 128 bytes consumed by
rawosd/osd arrays in the call chain.  Replace it with a buffer inside
struct osdmap and a mutex.  This shouldn't result in any contention,
because all osd requests were already serialized by request_mutex at
that point; the only unlocked caller was ceph_ioctl_get_dataloc().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-04-03 10:33:50 +08:00
Yan, Zheng bcdfeb2eb4 ceph: remove xattr when null value is given to setxattr()
For the setxattr request, introduce a new flag CEPH_XATTR_REMOVE
to distinguish null value case from the zero-length value case.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-02-17 12:37:09 -08:00
Linus Torvalds f568849eda Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "The major piece in here is the immutable bio_ve series from Kent, the
  rest is fairly minor.  It was supposed to go in last round, but
  various issues pushed it to this release instead.  The pull request
  contains:

   - Various smaller blk-mq fixes from different folks.  Nothing major
     here, just minor fixes and cleanups.

   - Fix for a memory leak in the error path in the block ioctl code
     from Christian Engelmayer.

   - Header export fix from CaiZhiyong.

   - Finally the immutable biovec changes from Kent Overstreet.  This
     enables some nice future work on making arbitrarily sized bios
     possible, and splitting more efficient.  Related fixes to immutable
     bio_vecs:

        - dm-cache immutable fixup from Mike Snitzer.
        - btrfs immutable fixup from Muthu Kumar.

  - bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
  xtensa: fixup simdisk driver to work with immutable bio_vecs
  block/blk-mq-cpu.c: use hotcpu_notifier()
  blk-mq: for_each_* macro correctness
  block: Fix memory leak in rw_copy_check_uvector() handling
  bio-integrity: Fix bio_integrity_verify segment start bug
  block: remove unrelated header files and export symbol
  blk-mq: uses page->list incorrectly
  blk-mq: use __smp_call_function_single directly
  btrfs: fix missing increment of bi_remaining
  Revert "block: Warn and free bio if bi_end_io is not set"
  block: Warn and free bio if bi_end_io is not set
  blk-mq: fix initializing request's start time
  block: blk-mq: don't export blk_mq_free_queue()
  block: blk-mq: make blk_sync_queue support mq
  block: blk-mq: support draining mq queue
  dm cache: increment bi_remaining when bi_end_io is restored
  block: fixup for generic bio chaining
  block: Really silence spurious compiler warnings
  block: Silence spurious compiler warnings
  block: Kill bio_pair_split()
  ...
2014-01-30 11:19:05 -08:00
Linus Torvalds d891ea23d5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "This is a big batch.  From Ilya we have:

   - rbd support for more than ~250 mapped devices (now uses same scheme
     that SCSI does for device major/minor numbering)
   - crush updates for new mapping behaviors (will be needed for coming
     erasure coding support, among other things)
   - preliminary support for tiered storage pools

  There is also a big series fixing a pile cephfs bugs with clustered
  MDSs from Yan Zheng, ACL support for cephfs from Guangliang Zhao, ceph
  fscache improvements from Li Wang, improved behavior when we get
  ENOSPC from Josh Durgin, some readv/writev improvements from
  Majianpeng, and the usual mix of small cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (76 commits)
  ceph: cast PAGE_SIZE to size_t in ceph_sync_write()
  ceph: fix dout() compile warnings in ceph_filemap_fault()
  libceph: support CEPH_FEATURE_OSD_CACHEPOOL feature
  libceph: follow redirect replies from osds
  libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid}
  libceph: follow {read,write}_tier fields on osd request submission
  libceph: add ceph_pg_pool_by_id()
  libceph: CEPH_OSD_FLAG_* enum update
  libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg()
  libceph: introduce and start using oid abstraction
  libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN
  libceph: move ceph_file_layout helpers to ceph_fs.h
  libceph: start using oloc abstraction
  libceph: dout() is missing a newline
  libceph: add ceph_kv{malloc,free}() and switch to them
  libceph: support CEPH_FEATURE_EXPORT_PEER
  ceph: add imported caps when handling cap export message
  ceph: add open export target session helper
  ceph: remove exported caps when handling cap import message
  ceph: handle session flush message
  ...
2014-01-28 11:02:23 -08:00
Ilya Dryomov 80e163a58c libceph: support CEPH_FEATURE_OSD_CACHEPOOL feature
Announce our (limited, see previous commit) support for CACHEPOOL
feature.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:57 +02:00
Ilya Dryomov 205ee1187a libceph: follow redirect replies from osds
Follow redirect replies from osds, for details see ceph.git commit
fbbe3ad1220799b7bb00ea30fce581c5eadaf034.

v1 (current) version of redirect reply consists of oloc and oid, which
expands to pool, key, nspace, hash and oid.  However, server-side code
that would populate anything other than pool doesn't exist yet, and
hence this commit adds support for pool redirects only.  To make sure
that future server-side updates don't break us, we decode all fields
and, if any of key, nspace, hash or oid have a non-default value, error
out with "corrupt osd_op_reply ..." message.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:53 +02:00
Ilya Dryomov 3c972c95c6 libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid}
Rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} before
introducing r_target_{oloc,oid} needed for redirects.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:49 +02:00
Ilya Dryomov 17a13e4028 libceph: follow {read,write}_tier fields on osd request submission
Overwrite ceph_osd_request::r_oloc.pool with read_tier for read ops and
write_tier for write and read+write ops (aka basic tiering support).
{read,write}_tier are part of pg_pool_t since v9.  This commit bumps
our pg_pool_t decode compat version from v7 to v9, all new fields
except for {read,write}_tier are ignored.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:45 +02:00
Ilya Dryomov ce7f6a2790 libceph: add ceph_pg_pool_by_id()
"Lookup pool info by ID" function is hidden in osdmap.c.  Expose it to
the rest of libceph.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:40 +02:00
Ilya Dryomov 1b3f2ab510 libceph: CEPH_OSD_FLAG_* enum update
Update CEPH_OSD_FLAG_* enum.  (We need CEPH_OSD_FLAG_IGNORE_OVERLAY to
support tiering).

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:36 +02:00
Ilya Dryomov 7c13cb6435 libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg()
Switch ceph_calc_ceph_pg() to new oloc and oid abstractions and rename
it to ceph_oloc_oid_to_pg() to make its purpose more clear.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:32 +02:00
Ilya Dryomov 4295f2217a libceph: introduce and start using oid abstraction
In preparation for tiering support, which would require having two
(base and target) object names for each osd request and also copying
those names around, introduce struct ceph_object_id (oid) and a couple
helpers to facilitate those copies and encapsulate the fact that object
name is not necessarily a NUL-terminated string.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:28 +02:00
Ilya Dryomov 2d0ebc5d59 libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN
In preparation for adding oid abstraction, rename MAX_OBJ_NAME_SIZE to
CEPH_MAX_OID_NAME_LEN.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:24 +02:00
Ilya Dryomov e8221464fc libceph: move ceph_file_layout helpers to ceph_fs.h
Move ceph_file_layout helper macros and inline functions to ceph_fs.h.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:20 +02:00
Ilya Dryomov 22116525ba libceph: start using oloc abstraction
Instead of relying on pool fields in ceph_file_layout (for mapping) and
ceph_pg (for enconding), start using ceph_object_locator (oloc)
abstraction.  Note that userspace oloc currently consists of pool, key,
nspace and hash fields, while this one contains only a pool.  This is
OK, because at this point we only send (i.e. encode) olocs and never
have to receive (i.e. decode) them.

This makes keeping a copy of ceph_file_layout in every osd request
unnecessary, so ceph_osd_request::r_file_layout field is nuked.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:03 +02:00
Ilya Dryomov eeb0bed557 libceph: add ceph_kv{malloc,free}() and switch to them
Encapsulate kmalloc vs vmalloc memory allocation and freeing logic into
two helpers, ceph_kvmalloc() and ceph_kvfree(), and switch to them.

ceph_kvmalloc() kmalloc()'s a maximum of 8 pages, anything bigger is
vmalloc()'ed with __GFP_HIGHMEM set.  This changes the existing
behaviour:

- for buffers (ceph_buffer_new()), from trying to kmalloc() everything
  and using vmalloc() just as a fallback

- for messages (ceph_msg_new()), from going to vmalloc() for anything
  bigger than a page

- for messages (ceph_msg_new()), from disallowing vmalloc() to use high
  memory

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-26 12:34:23 +02:00
Alex Elder 04f9b74e4d remove extra definitions of U32_MAX
Now that the definition is centralized in <linux/kernel.h>, the
definitions of U32_MAX (and related) elsewhere in the kernel can be
removed.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Sage Weil <sage@inktank.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:36:55 -08:00
Alex Elder 77719536dc conditionally define U32_MAX
The symbol U32_MAX is defined in several spots.  Change these
definitions to be conditional.  This is in preparation for the next
patch, which centralizes the definition in <linux/kernel.h>.

Signed-off-by: Alex Elder <elder@linaro.org>
Cc: Sage Weil <sage@inktank.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:36:54 -08:00
Yan, Zheng 80213a84a9 libceph: support CEPH_FEATURE_EXPORT_PEER
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21 16:30:31 +08:00
Yan, Zheng 4ee6a914ed ceph: remove exported caps when handling cap import message
Version 3 cap import message includes the ID of the exported
caps. It allow us to remove the exported caps if we still haven't
received the corresponding cap export message.

We remove the exported caps because they are stale, keeping them
can compromise consistence.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21 16:30:28 +08:00
Yan, Zheng 186e4f7a4b ceph: handle session flush message
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21 13:29:33 +08:00
Ilya Dryomov 3cea4c3071 libceph: rename ceph_msg::front_max to front_alloc_len
Rename front_max field of struct ceph_msg to front_alloc_len to make
its purpose more clear.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-14 11:27:26 +02:00
Ilya Dryomov cdff49918c crush: support new indep mode and SET_* steps (crush v2) by default
Add CRUSH_V2 feature (new indep mode and SET_* steps) to a set of
features supported by default.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:29 +02:00
Ilya Dryomov 2b3e0c905a libceph: update ceph_features.h
This updates ceph_features.h so that it has all feature bits defined in
ceph.git.  In the interim since the last update, ceph.git crossed the
"32 feature bits" point, and, the addition of the 33rd bit wasn't
handled correctly.  The work-around is squashed into this commit and
reflects ceph.git commit 053659d05e0349053ef703b414f44965f368b9f0.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:09 +02:00
Ilya Dryomov 12b4629a9f libceph: all features fields must be u64
In preparation for ceph_features.h update, change all features fields
from unsigned int/u32 to u64.  (ceph.git has ~40 feature bits at this
point.)

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-31 20:32:08 +02:00
Josh Durgin d29adb34a9 libceph: block I/O when PAUSE or FULL osd map flags are set
The PAUSEWR and PAUSERD flags are meant to stop the cluster from
processing writes and reads, respectively. The FULL flag is set when
the cluster determines that it is out of space, and will no longer
process writes.  PAUSEWR and PAUSERD are purely client-side settings
already implemented in userspace clients. The osd does nothing special
with these flags.

When the FULL flag is set, however, the osd responds to all writes
with -ENOSPC. For cephfs, this makes sense, but for rbd the block
layer translates this into EIO.  If a cluster goes from full to
non-full quickly, a filesystem on top of rbd will not behave well,
since some writes succeed while others get EIO.

Fix this by blocking any writes when the FULL flag is set in the osd
client. This is the same strategy used by userspace, so apply it by
default.  A follow-on patch makes this configurable.

__map_request() is called to re-target osd requests in case the
available osds changed.  Add a paused field to a ceph_osd_request, and
set it whenever an appropriate osd map flag is set.  Avoid queueing
paused requests in __map_request(), but force them to be resent if
they become unpaused.

Also subscribe to the next osd map from the monitor if any of these
flags are set, so paused requests can be unblocked as soon as
possible.

Fixes: http://tracker.ceph.com/issues/6079

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-13 09:13:29 -08:00
Kent Overstreet f38a5181d9 ceph: Convert to immutable biovecs
Now that we've got a mechanism for immutable biovecs -
bi_iter.bi_bvec_done - we need to convert drivers to use primitives that
respect it instead of using the bvec array directly.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org
2013-11-23 22:33:52 -08:00
Josh Durgin dd935f44a4 libceph: add function to ensure notifies are complete
Without a way to flush the osd client's notify workqueue, a watch
event that is unregistered could continue receiving callbacks
indefinitely.

Unregistering the event simply means no new notifies are added to the
queue, but there may still be events in the queue that will call the
watch callback for the event. If the queue is flushed after the event
is unregistered, the caller can be sure no more watch callbacks will
occur for the canceled watch.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2013-09-09 11:15:49 -07:00
Josh Durgin 8b8cf8917f libceph: fix invalid unsigned->signed conversion for timespec encoding
__kernel_time_t is a long, which cannot hold a U32_MAX on 32-bit
architectures.  Just drop this check as it has limited value.

This fixes a crash like:

[  957.905812] kernel BUG at /srv/autobuild-ceph/gitbuilder.git/build/include/linux/ceph/decode.h:164!
[  957.914849] Internal error: Oops - BUG: 0 [#1] SMP ARM
[  957.919978] Modules linked in: rbd libceph libcrc32c ipmi_devintf ipmi_si ipmi_msghandler nfsd nfs_acl auth_rpcgss nfs fscache lockd sunrpc
[  957.932547] CPU: 1    Tainted: G        W     (3.9.0-ceph-19bb6a83-highbank #1)
[  957.939881] PC is at ceph_osdc_build_request+0x8c/0x4f8 [libceph]
[  957.945967] LR is at 0xec520904
[  957.949103] pc : [<bf13e76c>]    lr : [<ec520904>]    psr: 20000153
[  957.949103] sp : ec753df8  ip : 00000001  fp : ec53e100
[  957.960571] r10: ebef25c0  r9 : ec5fa400  r8 : ecbcc000
[  957.965788] r7 : 00000000  r6 : 00000000  r5 : ffffffff  r4 : 00000020
[  957.972307] r3 : 51cc8143  r2 : ec520900  r1 : ec753e58  r0 : ec520908
[  957.978827] Flags: nzCv  IRQs on  FIQs off  Mode SVC_32  ISA ARM  Segment user
[  957.986039] Control: 10c5387d  Table: 2c59c04a  DAC: 00000015
[  957.991777] Process rbd (pid: 2138, stack limit = 0xec752238)
[  957.997514] Stack: (0xec753df8 to 0xec754000)
[  958.001864] 3de0:                                                       00000001 00000001
[  958.010032] 3e00: 00000001 bf139744 ecbcc000 ec55a0a0 00000024 00000000 ebef25c0 fffffffe
[  958.018204] 3e20: ffffffff 00000000 00000000 00000001 ec5fa400 ebef25c0 ec53e100 bf166b68
[  958.026377] 3e40: 00000000 0000220f fffffffe ffffffff ec753e58 bf13ff24 51cc8143 05b25ed2
[  958.034548] 3e60: 00000001 00000000 00000000 bf1688d4 00000001 00000000 00000000 00000000
[  958.042720] 3e80: 00000001 00000060 ec5fa400 ed53d200 ed439600 ed439300 00000001 00000060
[  958.050888] 3ea0: ec5fa400 ed53d200 00000000 bf16a320 00000000 ec53e100 00000040 ec753eb8
[  958.059059] 3ec0: ec51df00 ed53d7c0 ed53d200 ed53d7c0 00000000 ed53d7c0 ec5fa400 bf16ed70
[  958.067230] 3ee0: 00000000 00000060 00000002 ed53d200 00000000 bf16acf4 ed53d7c0 ec752000
[  958.075402] 3f00: ed980e50 e954f5d8 00000000 00000060 ed53d240 ed53d258 ec753f80 c04f44a8
[  958.083574] 3f20: edb7910c ec664700 01ade920 c02e4c44 00000060 c016b3dc ec51de40 01adfb84
[  958.091745] 3f40: 00000060 ec752000 ec753f80 ec752000 00000060 c0108444 00000007 ec51de48
[  958.099914] 3f60: ed0eb8c0 00000000 00000000 ec51de40 01adfb84 00000001 00000060 c0108858
[  958.108085] 3f80: 00000000 00000000 51cc8143 00000060 01adfb84 00000007 00000004 c000dd68
[  958.116257] 3fa0: 00000000 c000dbc0 00000060 01adfb84 00000007 01adfb84 00000060 01adfb80
[  958.124429] 3fc0: 00000060 01adfb84 00000007 00000004 beded1a8 00000000 01adf2f0 01ade920
[  958.132599] 3fe0: 00000000 beded180 b6811324 b6811334 800f0010 00000007 2e7f5821 2e7f5c21
[  958.140815] [<bf13e76c>] (ceph_osdc_build_request+0x8c/0x4f8 [libceph]) from [<bf166b68>] (rbd_osd_req_format_write+0x50/0x7c [rbd])
[  958.152739] [<bf166b68>] (rbd_osd_req_format_write+0x50/0x7c [rbd]) from [<bf1688d4>] (rbd_dev_header_watch_sync+0xe0/0x204 [rbd])
[  958.164486] [<bf1688d4>] (rbd_dev_header_watch_sync+0xe0/0x204 [rbd]) from [<bf16a320>] (rbd_dev_image_probe+0x23c/0x850 [rbd])
[  958.175967] [<bf16a320>] (rbd_dev_image_probe+0x23c/0x850 [rbd]) from [<bf16acf4>] (rbd_add+0x3c0/0x918 [rbd])
[  958.185975] [<bf16acf4>] (rbd_add+0x3c0/0x918 [rbd]) from [<c02e4c44>] (bus_attr_store+0x20/0x2c)
[  958.194850] [<c02e4c44>] (bus_attr_store+0x20/0x2c) from [<c016b3dc>] (sysfs_write_file+0x168/0x198)
[  958.203984] [<c016b3dc>] (sysfs_write_file+0x168/0x198) from [<c0108444>] (vfs_write+0x9c/0x170)
[  958.212768] [<c0108444>] (vfs_write+0x9c/0x170) from [<c0108858>] (sys_write+0x3c/0x70)
[  958.220768] [<c0108858>] (sys_write+0x3c/0x70) from [<c000dbc0>] (ret_fast_syscall+0x0/0x30)
[  958.229199] Code: e59d1058 e5913000 e3530000 ba000114 (e7f001f2)

CC: stable@vger.kernel.org  # 3.4+
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-09 12:18:06 -07:00
Yan, Zheng eb845ff13a libceph: fix safe completion
handle_reply() calls complete_request() only if the first OSD reply
has ONDISK flag.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-03 15:32:44 -07:00
Alex Elder 5522ae0b68 libceph: use slab cache for osd client requests
Create a slab cache to manage allocation of ceph_osdc_request
structures.

This resolves:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:41 -05:00
Alex Elder 4f0dcb10cf libceph: create source file "net/ceph/snapshot.c"
This creates a new source file "net/ceph/snapshot.c" to contain
utility routines related to ceph snapshot contexts.  The main
motivation was to define ceph_create_snap_context() as a common way
to create these structures, but I've moved the definitions of
ceph_get_snap_context() and ceph_put_snap_context() there too.
(The benefit of inlining those is very small, and I'd rather
keep this collection of functions together.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:08 -07:00
Alex Elder c3f56102f2 libceph: validate timespec conversions
A ceph timespec contains 32-bit unsigned values for its seconds and
nanoseconds components.  For a standard timespec, both fields are
signed, and the seconds field is almost surely 64 bits.

Add some explicit casts so the fact that this conversion is taking
place is obvious.  Also trip a bug if we ever try to put out of
range (negative or too big) values into a ceph timespec.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:17 -07:00
Alex Elder b587398a4f libceph: add signed type limits
Flesh out the limits defined in <linux/ceph/decode.h> to include the
maximum and minimum values for signed type S8, S16, S32, and S64.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:16 -07:00
Alex Elder 6c57b5545d libceph: support pages for class request data
Add the ability to provide an array of pages as outbound request
data for object class method calls.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:06 -07:00
Alex Elder 49719778bf libceph: support raw data requests
Allow osd request ops that aren't otherwise structured (not class,
extent, or watch ops) to specify "raw" data to be used to hold
incoming data for the op.  Make use of this capability for the osd
STAT op.

Prefix the name of the private function osd_req_op_init() with "_",
and expose a new function by that (earlier) name whose purpose is to
initialize osd ops with (only) implied data.

For now we'll just support the use of a page array for an osd op
with incoming raw data.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:00 -07:00
Alex Elder 406e2c9f92 libceph: kill off osd data write_request parameters
In the incremental move toward supporting distinct data items in an
osd request some of the functions had "write_request" parameters to
indicate, basically, whether the data belonged to in_data or the
out_data.  Now that we maintain the data fields in the op structure
there is no need to indicate the direction, so get rid of the
"write_request" parameters.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:58 -07:00
Alex Elder 26be88087a libceph: change how "safe" callback is used
An osd request currently has two callbacks.  They inform the
initiator of the request when we've received confirmation for the
target osd that a request was received, and when the osd indicates
all changes described by the request are durable.

The only time the second callback is used is in the ceph file system
for a synchronous write.  There's a race that makes some handling of
this case unsafe.  This patch addresses this problem.  The error
handling for this callback is also kind of gross, and this patch
changes that as well.

In ceph_sync_write(), if a safe callback is requested we want to add
the request on the ceph inode's unsafe items list.  Because items on
this list must have their tid set (by ceph_osd_start_request()), the
request added *after* the call to that function returns.  The
problem with this is that there's a race between starting the
request and adding it to the unsafe items list; the request may
already be complete before ceph_sync_write() even begins to put it
on the list.

To address this, we change the way the "safe" callback is used.
Rather than just calling it when the request is "safe", we use it to
notify the initiator the bounds (start and end) of the period during
which the request is *unsafe*.  So the initiator gets notified just
before the request gets sent to the osd (when it is "unsafe"), and
again when it's known the results are durable (it's no longer
unsafe).  The first call will get made in __send_request(), just
before the request message gets sent to the messenger for the first
time.  That function is only called by __send_queued(), which is
always called with the osd client's request mutex held.

We then have this callback function insert the request on the ceph
inode's unsafe list when we're told the request is unsafe.  This
will avoid the race because this call will be made under protection
of the osd client's request mutex.  It also nicely groups the setup
and cleanup of the state associated with managing unsafe requests.

The name of the "safe" callback field is changed to "unsafe" to
better reflect its new purpose.  It has a Boolean "unsafe" parameter
to indicate whether the request is becoming unsafe or is now safe.
Because the "msg" parameter wasn't used, we drop that.

This resolves the original problem reportedin:
    http://tracker.ceph.com/issues/4706

Reported-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 21:18:52 -07:00
Alex Elder 04017e29bb libceph: make method call data be a separate data item
Right now the data for a method call is specified via a pointer and
length, and it's copied--along with the class and method name--into
a pagelist data item to be sent to the osd.  Instead, encode the
data in a data item separate from the class and method names.

This will allow large amounts of data to be supplied to methods
without copying.  Only rbd uses the class functionality right now,
and when it really needs this it will probably need to use a page
array rather than a page list.  But this simple implementation
demonstrates the functionality on the osd client, and that's enough
for now.

This resolves:
    http://tracker.ceph.com/issues/4104

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:35 -07:00
Alex Elder 90af36022a libceph: add, don't set data for a message
Change the names of the functions that put data on a pagelist to
reflect that we're adding to whatever's already there rather than
just setting it to the one thing.  Currently only one data item is
ever added to a message, but that's about to change.

This resolves:
    http://tracker.ceph.com/issues/2770

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:34 -07:00
Alex Elder ca8b3a6917 libceph: implement multiple data items in a message
This patch adds support to the messenger for more than one data item
in its data list.

A message data cursor has two more fields to support this:
    - a count of the number of bytes left to be consumed across
      all data items in the list, "total_resid"
    - a pointer to the head of the list (for validation only)

The cursor initialization routine has been split into two parts: the
outer one, which initializes the cursor for traversing the entire
list of data items; and the inner one, which initializes the cursor
to start processing a single data item.

When a message cursor is first initialized, the outer initialization
routine sets total_resid to the length provided.  The data pointer
is initialized to the first data item on the list.  From there, the
inner initialization routine finishes by setting up to process the
data item the cursor points to.

Advancing the cursor consumes bytes in total_resid.  If the resid
field reaches zero, it means the current data item is fully
consumed.  If total_resid indicates there is more data, the cursor
is advanced to point to the next data item, and then the inner
initialization routine prepares for using that.  (A check is made at
this point to make sure we don't wrap around the front of the list.)

The type-specific init routines are modified so they can be given a
length that's larger than what the data item can support.  The resid
field is initialized to the smaller of the provided length and the
length of the entire data item.

When total_resid reaches zero, we're done.

This resolves:
    http://tracker.ceph.com/issues/3761

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:33 -07:00
Alex Elder 5240d9f95d libceph: replace message data pointer with list
In place of the message data pointer, use a list head which links
through message data items.  For now we only support a single entry
on that list.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:32 -07:00
Alex Elder 8ae4f4f5c0 libceph: have cursor point to data
Rather than having a ceph message data item point to the cursor it's
associated with, have the cursor point to a data item.  This will
allow a message cursor to be used for more than one data item.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:30 -07:00
Alex Elder 36153ec9dd libceph: move cursor into message
A message will only be processing a single data item at a time, so
there's no need for each data item to have its own cursor.

Move the cursor embedded in the message data structure into the
message itself.  To minimize the impact, keep the data->cursor
field, but make it be a pointer to the cursor in the message.

Move the definition of ceph_msg_data above ceph_msg_data_cursor so
the cursor can point to the data without a forward definition rather
than vice-versa.

This and the upcoming patches are part of:
    http://tracker.ceph.com/issues/3761

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:29 -07:00
Alex Elder c851c49591 libceph: record bio length
The bio is the only data item type that doesn't record its full
length.  Fix that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:28 -07:00
Alex Elder ea96571f7b libceph: fix possible CONFIG_BLOCK build problem
This patch:
    15a0d7b libceph: record message data length
did not enclose some bio-specific code inside CONFIG_BLOCK as
it should have.  Fix that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:26 -07:00
Alex Elder 5476492fba libceph: kill off osd request r_data_in and r_data_out
Finally!  Convert the osd op data pointers into real structures, and
make the switch over to using them instead of having all ops share
the in and/or out data structures in the osd request.

Set up a new function to traverse the set of ops and release any
data associated with them (pages).

This and the patches leading up to it resolve:
    http://tracker.ceph.com/issues/4657

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:25 -07:00
Alex Elder ec9123c567 libceph: set the data pointers when encoding ops
Still using the osd request r_data_in and r_data_out pointer, but
we're basically only referring to it via the data pointers in the
osd ops.  And we're transferring that information to the request
or reply message only when the op indicates it's needed, in
osd_req_encode_op().

To avoid a forward reference, ceph_osdc_msg_data_set() was moved up
in the file.

Don't bother calling ceph_osd_data_init(), in ceph_osd_alloc(),
because the ops array will already be zeroed anyway.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:24 -07:00
Alex Elder a4ce40a9a7 libceph: combine initializing and setting osd data
This ends up being a rather large patch but what it's doing is
somewhat straightforward.

Basically, this is replacing two calls with one.  The first of the
two calls is initializing a struct ceph_osd_data with data (either a
page array, a page list, or a bio list); the second is setting an
osd request op so it associates that data with one of the op's
parameters.  In place of those two will be a single function that
initializes the op directly.

That means we sort of fan out a set of the needed functions:
    - extent ops with pages data
    - extent ops with pagelist data
    - extent ops with bio list data
and
    - class ops with page data for receiving a response

We also have define another one, but it's only used internally:
    - class ops with pagelist data for request parameters

Note that we *still* haven't gotten rid of the osd request's
r_data_in and r_data_out fields.  All the osd ops refer to them for
their data.  For now, these data fields are pointers assigned to the
appropriate r_data_* field when these new functions are called.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:23 -07:00
Alex Elder 5f562df5f5 libceph: format class info at init time
An object class method is formatted using a pagelist which contains
the class name, the method name, and the data concatenated into an
osd request's outbound data.

Currently when a class op is initialized in osd_req_op_cls_init(),
the lengths of and pointers to these three items are recorded.
Later, when the op is getting formatted into the request message, a
new pagelist is created and that is when these items get copied into
the pagelist.

This patch makes it so the pagelist to hold these items is created
when the op is initialized instead.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:19 -07:00
Alex Elder c99d2d4abb libceph: specify osd op by index in request
An osd request now holds all of its source op structures, and every
place that initializes one of these is in fact initializing one
of the entries in the the osd request's array.

So rather than supplying the address of the op to initialize, have
caller specify the osd request and an indication of which op it
would like to initialize.  This better hides the details the
op structure (and faciltates moving the data pointers they use).

Since osd_req_op_init() is a common routine, and it's not used
outside the osd client code, give it static scope.  Also make
it return the address of the specified op (so all the other
init routines don't have to repeat that code).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:15 -07:00
Alex Elder 8c042b0df9 libceph: add data pointers in osd op structures
An extent type osd operation currently implies that there will
be corresponding data supplied in the data portion of the request
(for write) or response (for read) message.  Similarly, an osd class
method operation implies a data item will be supplied to receive
the response data from the operation.

Add a ceph_osd_data pointer to each of those structures, and assign
it to point to eithre the incoming or the outgoing data structure in
the osd message.  The data is not always available when an op is
initially set up, so add two new functions to allow setting them
after the op has been initialized.

Begin to make use of the data item pointer available in the osd
operation rather than the request data in or out structure in
places where it's convenient.  Add some assertions to verify
pointers are always set the way they're expected to be.

This is a sort of stepping stone toward really moving the data
into the osd request ops, to allow for some validation before
making that jump.

This is the first in a series of patches that resolve:
    http://tracker.ceph.com/issues/4657

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:14 -07:00
Alex Elder 54d5064912 libceph: rename data out field in osd request op
There are fields "indata" and "indata_len" defined the ceph osd
request op structure.  The "in" part is with from the point of view
of the osd server, but is a little confusing here on the client
side.  Change their names to use "request" instead of "in" to
indicate that it defines data provided with the request (as opposed
the data returned in the response).

Rename the local variable in osd_req_encode_op() to match.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:13 -07:00
Alex Elder 79528734f3 libceph: keep source rather than message osd op array
An osd request keeps a pointer to the osd operations (ops) array
that it builds in its request message.

In order to allow each op in the array to have its own distinct
data, we will need to keep track of each op's data, and that
information does not go over the wire.

As long as we're tracking the data we might as well just track the
entire (source) op definition for each of the ops.  And if we're
doing that, we'll have no more need to keep a pointer to the
wire-encoded version.

This patch makes the array of source ops be kept with the osd
request structure, and uses that instead of the version encoded in
the message in places where that was previously used.  The array
will be embedded in the request structure, and the maximum number of
ops we ever actually use is currently 2.  So reduce CEPH_OSD_MAX_OP
to 2 to reduce the size of the structure.

The result of doing this sort of ripples back up, and as a result
various function parameters and local variables become unnecessary.

Make r_num_ops be unsigned, and move the definition of struct
ceph_osd_req_op earlier to ensure it's defined where needed.

It does not yet add per-op data, that's coming soon.

This resolves:
    http://tracker.ceph.com/issues/4656

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:12 -07:00
Alex Elder 43bfe5de9f libceph: define osd data initialization helpers
Define and use functions that encapsulate the initializion of a
ceph_osd_data structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:06 -07:00
Alex Elder e5975c7c8e ceph: build osd request message later for writepages
Hold off building the osd request message in ceph_writepages_start()
until just before it will be submitted to the osd client for
execution.

We'll still create the request and allocate the page pointer array
after we learn we have at least one page to write.  A local variable
will be used to keep track of the allocated array of pages.  Wait
until just before submitting the request for assigning that page
array pointer to the request message.

Create ands use a new function osd_req_op_extent_update() whose
purpose is to serve this one spot where the length value supplied
when an osd request's op was initially formatted might need to get
changed (reduced, never increased) before submitting the request.

Previously, ceph_writepages_start() assigned the message header's
data length because of this update.  That's no longer necessary,
because ceph_osdc_build_request() will recalculate the right
value to use based on the content of the ops in the request.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:18:02 -07:00