Commit Graph

489 Commits

Author SHA1 Message Date
Luis Henriques 78beb0ff2f ceph: use copy-from2 op in copy_file_range
Instead of using the copy-from operation, switch copy_file_range to the
new copy-from2 operation, which allows to send the truncate_seq and
truncate_size parameters.

If an OSD does not support the copy-from2 operation it will return
-EOPNOTSUPP.  In that case, the kernel client will stop trying to do
remote object copies for this fs client and will always use the generic
VFS copy_file_range.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:40 +01:00
Xiubo Li b38c9eb475 ceph: add possible_max_rank and make the code more readable
The m_num_mds here is actually the number for MDSs which are in
up:active status, and it will be duplicated to m_num_active_mds,
so remove it.

Add possible_max_rank to the mdsmap struct and this will be
the correctly possible largest rank boundary.

Remove the special case for one mds in __mdsmap_get_random_mds(),
because the validate mds rank may not always be 0.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:40 +01:00
Xiubo Li 4d7ace02ba ceph: fix mdsmap cluster available check based on laggy number
In case the max_mds > 1 in MDS cluster and there is no any standby
MDS and all the max_mds MDSs are in up:active state, if one of the
up:active MDSs is dead, the m->m_num_laggy in kclient will be 1.
Then the mount will fail without considering other healthy MDSs.

There manybe some MDSs still "in" the cluster but not in up:active
state, we will ignore them. Only when all the up:active MDSs in
the cluster are laggy will treat the cluster as not be available.

In case decreasing the max_mds, the cluster will not stop the extra
up:active MDSs immediately and there will be a latency. During it
the up:active MDS number will be larger than the max_mds, so later
the m_info memories will 100% be reallocated.

Here will pick out the up:active MDSs as the m_num_mds and allocate
the needed memories once.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:39 +01:00
David Howells 82995cc6c5 libceph, rbd, ceph: convert to use the new mount API
Convert the ceph filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

[ Numerous string handling, leak and regression fixes; rbd conversion
  was particularly broken and had to be redone almost from scratch. ]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-27 22:28:37 +01:00
Yan, Zheng 2cef0ba803 libceph: add function that clears osd client's abort_err
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-09-16 12:06:23 +02:00
Yan, Zheng 120a75ea9f libceph: add function that reset client's entity addr
This function also re-open connections to OSD/MON, and re-send in-flight
OSD requests after re-opening connections to OSD.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-09-16 12:06:23 +02:00
Luis Henriques 5c498950f7 libceph: allow ceph_buffer_put() to receive a NULL ceph_buffer
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-08-22 10:47:41 +02:00
Linus Torvalds d9b9c89304 Lots of exciting things this time!
- support for rbd object-map and fast-diff features (myself).  This
   will speed up reads, discards and things like snap diffs on sparse
   images.
 
 - ceph.snap.btime vxattr to expose snapshot creation time (David
   Disseldorp).  This will be used to integrate with "Restore Previous
   Versions" feature added in Windows 7 for folks who reexport ceph
   through SMB.
 
 - security xattrs for ceph (Zheng Yan).  Only selinux is supported
   for now due to the limitations of ->dentry_init_security().
 
 - support for MSG_ADDR2, FS_BTIME and FS_CHANGE_ATTR features (Jeff
   Layton).  This is actually a single feature bit which was missing
   because of the filesystem pieces.  With this in, the kernel client
   will finally be reported as "luminous" by "ceph features" -- it is
   still being reported as "jewel" even though all required Luminous
   features were implemented in 4.13.
 
 - stop NULL-terminating ceph vxattrs (Jeff Layton).  The convention
   with xattrs is to not terminate and this was causing inconsistencies
   with ceph-fuse.
 
 - change filesystem time granularity from 1 us to 1 ns, again fixing
   an inconsistency with ceph-fuse (Luis Henriques).
 
 On top of this there are some additional dentry name handling and cap
 flushing fixes from Zheng.  Finally, Jeff is formally taking over for
 Zheng as the filesystem maintainer.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl0u+X8THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi9byB/9fIxzoxtDMvixtJabuGSJRtlijDlWF
 GlO6yIWCXl/8v8easR2PCF75U/xv0+QFQmze8PVi8u4Xz589P247NnEuyEZ9n84i
 aCavARho6QLZPEL+B04NaqoHBl+ORKQTA6eKGhyKwRp/rn83z5Ubuw2tN7krHT3b
 kCY61FuTQGxNY2o/WKv/iLwINYr7H23hCf0WwyyKH1bp7OegiQ14Ebn1NtfS3sMx
 hS6h8Ya826vmUW0bCSS/9kzKYBCjksTig0HphUOHq6BoZJs++0b7GukIulRyuLfD
 J9Gr9HGPoDCVzdmFfpn2FSlxdmqfO9amUSagd0ftLQfFlPlrpoULi0GW
 =Bgxr
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "Lots of exciting things this time!

   - support for rbd object-map and fast-diff features (myself). This
     will speed up reads, discards and things like snap diffs on sparse
     images.

   - ceph.snap.btime vxattr to expose snapshot creation time (David
     Disseldorp). This will be used to integrate with "Restore Previous
     Versions" feature added in Windows 7 for folks who reexport ceph
     through SMB.

   - security xattrs for ceph (Zheng Yan). Only selinux is supported for
     now due to the limitations of ->dentry_init_security().

   - support for MSG_ADDR2, FS_BTIME and FS_CHANGE_ATTR features (Jeff
     Layton). This is actually a single feature bit which was missing
     because of the filesystem pieces. With this in, the kernel client
     will finally be reported as "luminous" by "ceph features" -- it is
     still being reported as "jewel" even though all required Luminous
     features were implemented in 4.13.

   - stop NULL-terminating ceph vxattrs (Jeff Layton). The convention
     with xattrs is to not terminate and this was causing
     inconsistencies with ceph-fuse.

   - change filesystem time granularity from 1 us to 1 ns, again fixing
     an inconsistency with ceph-fuse (Luis Henriques).

  On top of this there are some additional dentry name handling and cap
  flushing fixes from Zheng. Finally, Jeff is formally taking over for
  Zheng as the filesystem maintainer"

* tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client: (71 commits)
  ceph: fix end offset in truncate_inode_pages_range call
  ceph: use generic_delete_inode() for ->drop_inode
  ceph: use ceph_evict_inode to cleanup inode's resource
  ceph: initialize superblock s_time_gran to 1
  MAINTAINERS: take over for Zheng as CephFS kernel client maintainer
  rbd: setallochint only if object doesn't exist
  rbd: support for object-map and fast-diff
  rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()
  libceph: export osd_req_op_data() macro
  libceph: change ceph_osdc_call() to take page vector for response
  libceph: bump CEPH_MSG_MAX_DATA_LEN (again)
  rbd: new exclusive lock wait/wake code
  rbd: quiescing lock should wait for image requests
  rbd: lock should be quiesced on reacquire
  rbd: introduce copyup state machine
  rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()
  rbd: move OSD request allocation into object request state machines
  rbd: factor out __rbd_osd_setup_discard_ops()
  rbd: factor out rbd_osd_setup_copyup()
  rbd: introduce obj_req->osd_reqs list
  ...
2019-07-18 11:05:25 -07:00
Ilya Dryomov 22e8bd51bb rbd: support for object-map and fast-diff
Speed up reads, discards and zeroouts through RBD_OBJ_FLAG_MAY_EXIST
and RBD_OBJ_FLAG_NOOP_FOR_NONEXISTENT based on object map.

Invalid object maps are not trusted, but still updated.  Note that we
never iterate, resize or invalidate object maps.  If object-map feature
is enabled but object map fails to load, we just fail the requester
(either "rbd map" or I/O, by way of post-acquire action).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:45 +02:00
Ilya Dryomov 4cf3e6dff7 libceph: export osd_req_op_data() macro
We already have one exported wrapper around it for extent.osd_data and
rbd_object_map_update_finish() needs another one for cls.request_data.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2019-07-08 14:01:45 +02:00
Ilya Dryomov 68ada915ee libceph: change ceph_osdc_call() to take page vector for response
This will be used for loading object map.  rbd_obj_read_sync() isn't
suitable because object map must be accessed through class methods.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2019-07-08 14:01:45 +02:00
Ilya Dryomov ef83171b49 libceph: bump CEPH_MSG_MAX_DATA_LEN (again)
This time for rbd object map.  Object maps are limited in size to
256000000 objects, two bits per object.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08 14:01:45 +02:00
Ilya Dryomov 94e8577188 libceph: rename r_unsafe_item to r_private_item
This list item remained from when we had safe and unsafe replies
(commit vs ack).  It has since become a private list item for use by
clients.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:44 +02:00
Yan, Zheng 49ada6e8dc ceph: more precise CEPH_CLIENT_CAPS_PENDING_CAPSNAP
Client uses this flag to tell mds if there is more cap snap need to
flush. It's mainly for the case that client needs to re-send cap/snap
flushes after mds failover, but CEPH_CAP_ANY_FILE_WR on corresponding
inodes are all released before mds failover.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:44 +02:00
Jeff Layton 6adaaafdd8 libceph: turn on CEPH_FEATURE_MSG_ADDR2
Now that the client can handle either address formatting, advertise to
the peer that we can support it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:43 +02:00
Jeff Layton 2c66de560f libceph: rename ceph_encode_addr to ceph_encode_banner_addr
...ditto for the decode function. We only use these functions to fix
up banner addresses now, so let's name them more appropriately.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:43 +02:00
Jeff Layton d3c3c0a841 libceph: use TYPE_LEGACY for entity addrs instead of TYPE_NONE
Going forward, we'll have different address types so let's use
the addr2 TYPE_LEGACY for internal tracking rather than TYPE_NONE.

Also, make ceph_pr_addr print the address type value as well.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:43 +02:00
Jeff Layton 0bfb0f2889 libceph: ADDR2 support for monmap
Switch the MonMap decoder to use the new decoding routine for
entity_addr_t's.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:43 +02:00
Jeff Layton 6c37f0e641 libceph: add ceph_decode_entity_addr
Add a function for decoding an entity_addr_t. Once
CEPH_FEATURE_MSG_ADDR2 is enabled, the server daemons will start
encoding entity_addr_t differently.

Add a new helper function that can handle either format.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:43 +02:00
Christoph Hellwig 97a385e558 libceph: remove ceph_get_direct_page_vector()
This function is entirely unused.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:40 +02:00
Greg Kroah-Hartman 1a829ff2a6 ceph: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

This cleanup allows the return value of the functions to be made void,
as no logic should care if these files succeed or not.

Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Sage Weil <sage@redhat.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: ceph-devel@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20190612145538.GA18772@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03 16:57:18 +02:00
Jeff Layton b726ec972c libceph: make ceph_pr_addr take an struct ceph_entity_addr pointer
GCC9 is throwing a lot of warnings about unaligned accesses by
callers of ceph_pr_addr. All of the current callers are passing a
pointer to the sockaddr inside struct ceph_entity_addr.

Fix it to take a pointer to a struct ceph_entity_addr instead,
and then have the function make a copy of the sockaddr before
printing it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-05-07 19:43:05 +02:00
Arnd Bergmann 0384892c2d libceph: fix clang warning for CEPH_DEFINE_OID_ONSTACK
clang complains about assigning a variable to itself during the
declaration:

fs/ceph/ioctl.c:187:26: error: variable 'oid' is uninitialized when used within its own initialization [-Werror,-Wuninitialized]
        CEPH_DEFINE_OID_ONSTACK(oid);
                                ^~~
include/linux/ceph/osdmap.h:122:52: note: expanded from macro 'CEPH_DEFINE_OID_ONSTACK'
        struct ceph_object_id oid = CEPH_OID_INIT_ONSTACK(oid)
                              ~~~                         ^~~
include/linux/ceph/osdmap.h:120:29: note: expanded from macro 'CEPH_OID_INIT_ONSTACK'
    ({ ceph_oid_init(&oid); oid; })
                            ^~~

We use this trick in other places, but it is completely unnecessary
here, as we can just use a regular struct initializer.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-05-07 19:22:37 +02:00
Yan, Zheng 570df4e9c2 ceph: snapshot nfs re-export
To support snapshot nfs re-export, we need a way to lookup snapped
inode by file handle. For directory inode, snapped metadata are always
stored together with head inode. Client just need to pass vinodeno_t
to MDS. For non-directory inode, there can be multiple version of
snapped inodes and they can be stored in different dirfrags. Besides
vinodeno_t, client also need to tell mds from which dirfrag it got the
snapped inode.

Another problem of supporting snapshot nfs re-export is that there
can be multiple paths to access a snapped inode. For example:

  mkdir -p d1/d2/d3
  mkdir d1/.snap/s1

Paths 'd1/.snap/s1/d2/d3', 'd1/d2/.snap/_s1_<inode number of d1>/d3'
and 'd1/d2/d3/.snap/_s1_<inode number of d1>' are all reference to the
same snapped inode. For a given snapped inode, There is no convenient
way to get the first form and the second form paths. For simplicity,
ceph_get_parent() return snapdir for snapped directory inode.

Furthermore, client may access snapshot of deleted directory. For
example:

  mkdir -p d1/d2
  mkdir d1/.snap/s1
  open d1/.snap/s1/d2
  rm -rf d1/d2
  <nfs server restart>

The path constucted by ceph_get_parent() and ceph_get_name() is
'<inode of d2>/.snap/_s1_<inode number of d1>'. Futher lookup parent
of <inode of d2> will fail. To workaround this case, this patch uses
d_obtain_root() to get dentry for snapdir of deleted directory.
snapdir dentry has no DCACHE_DISCONNECTED flag set, reconnect_path()
stops when it reaches snapdir dentry.

Link: http://tracker.ceph.com/issues/22105
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-05-07 19:22:36 +02:00
Ilya Dryomov bb229bbb3b libceph: wait for latest osdmap in ceph_monc_blacklist_add()
Because map updates are distributed lazily, an OSD may not know about
the new blacklist for quite some time after "osd blacklist add" command
is completed.  This makes it possible for a blacklisted but still alive
client to overwrite a post-blacklist update, resulting in data
corruption.

Waiting for latest osdmap in ceph_monc_blacklist_add() and thus using
the post-blacklist epoch for all post-blacklist requests ensures that
all such requests "wait" for the blacklist to come into force on their
respective OSDs.

Cc: stable@vger.kernel.org
Fixes: 6305a3b415 ("libceph: support for blacklisting clients")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2019-03-20 16:27:40 +01:00
Yan, Zheng fe33032daa ceph: add mount option to limit caps count
If number of caps exceed the limit, ceph_trim_dentires() also trim
dentries with valid leases. Trimming dentry releases references to
associated inode, which may evict inode and release caps.

By default, there is no limit for caps count.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-03-05 18:55:17 +01:00
Dongsheng Yang 02b2f549d5 libceph: allow setting abort_on_full for rbd
Introduce a new option abort_on_full, default to false. Then
we can get -ENOSPC when the pool is full, or reaches quota.

[ Don't show abort_on_full in /proc/mounts. ]

Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-01-07 22:47:48 +01:00
Ilya Dryomov 23c625ce30 libceph: assume argonaut on the server side
No one is running pre-argonaut.  In addition one of the argonaut
features (NOSRCADDR) has been required since day one (and a half,
2.6.34 vs 2.6.35) of the kernel client.

Allow for the possibility of reusing these feature bits later.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-11-08 17:51:11 +01:00
Luis Henriques 23ddf9bea9 libceph: support the RADOS copy-from operation
Add support for performing remote object copies using the 'copy-from'
operation.

[ Add COPY_FROM to get_num_data_items(). ]

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:23 +02:00
Ilya Dryomov 0d9c1ab3be libceph: preallocate message data items
Currently message data items are allocated with ceph_msg_data_create()
in setup_request_data() inside send_request().  send_request() has never
been allowed to fail, so each allocation is followed by a BUG_ON:

  data = ceph_msg_data_create(...);
  BUG_ON(!data);

It's been this way since support for multiple message data items was
added in commit 6644ed7b7e ("libceph: make message data be a pointer")
in 3.10.

There is no reason to delay the allocation of message data items until
the last possible moment and we certainly don't need a linked list of
them as they are only ever appended to the end and never erased.  Make
ceph_msg_new2() take max_data_items and adapt the rest of the code.

Reported-by: Jerry Lee <leisurelysw24@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:22 +02:00
Ilya Dryomov 33165d4723 libceph: introduce ceph_pagelist_alloc()
struct ceph_pagelist cannot be embedded into anything else because it
has its own refcount.  Merge allocation and initialization together.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:21 +02:00
Ilya Dryomov 24639ce560 libceph: osd_req_op_cls_init() doesn't need to take opcode
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Ilya Dryomov 94e6992bb5 libceph: bump CEPH_MSG_MAX_DATA_LEN
If the read is large enough, we end up spinning in the messenger:

  libceph: osd0 192.168.122.1:6801 io error
  libceph: osd0 192.168.122.1:6801 io error
  libceph: osd0 192.168.122.1:6801 io error

This is a receive side limit, so only reads were affected.

Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Ilya Dryomov cc255c76c7 libceph: implement CEPHX_V2 calculation mode
Derive the signature from the entire buffer (both AES cipher blocks)
instead of using just the first half of the first block, leaving out
data_crc entirely.

This addresses CVE-2018-1129.

Link: http://tracker.ceph.com/issues/24837
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-08-02 21:33:25 +02:00
Ilya Dryomov 6daca13d2e libceph: add authorizer challenge
When a client authenticates with a service, an authorizer is sent with
a nonce to the service (ceph_x_authorize_[ab]) and the service responds
with a mutation of that nonce (ceph_x_authorize_reply).  This lets the
client verify the service is who it says it is but it doesn't protect
against a replay: someone can trivially capture the exchange and reuse
the same authorizer to authenticate themselves.

Allow the service to reject an initial authorizer with a random
challenge (ceph_x_authorize_challenge).  The client then has to respond
with an updated authorizer proving they are able to decrypt the
service's challenge and that the new authorizer was produced for this
specific connection instance.

The accepting side requires this challenge and response unconditionally
if the client side advertises they have CEPHX_V2 feature bit.

This addresses CVE-2018-1128.

Link: http://tracker.ceph.com/issues/24836
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-08-02 21:33:24 +02:00
Ilya Dryomov 262614c429 libceph: store ceph_auth_handshake pointer in ceph_connection
We already copy authorizer_reply_buf and authorizer_reply_buf_len into
ceph_connection.  Factoring out __prepare_write_connect() requires two
more: authorizer_buf and authorizer_buf_len.  Store the pointer to the
handshake in con->auth rather than piling on.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-08-02 21:33:22 +02:00
Ilya Dryomov f7e52d8efe libceph: remove now unused ceph_{en,de}code_timespec()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:20 +02:00
Arnd Bergmann fac02ddf91 libceph: use timespec64 for r_mtime
The request mtime field is used all over ceph, and is currently
represented as a 'timespec' structure in Linux. This changes it to
timespec64 to allow times beyond 2038, modifying all users at the
same time.

[ Remove now redundant ts variable in writepage_nounlock(). ]

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:14 +02:00
Arnd Bergmann 473bd2d780 libceph: use timespec64 in for keepalive2 and ticket validity
ceph_con_keepalive_expired() is the last user of timespec_add() and some
of the last uses of ktime_get_real_ts().  Replacing this with timespec64
based interfaces  lets us remove that deprecated API.

I'm introducing new ceph_encode_timespec64()/ceph_decode_timespec64()
here that take timespec64 structures and convert to/from ceph_timespec,
which is defined to have an unsigned 32-bit tv_sec member. This extends
the range of valid times to year 2106, avoiding the year 2038 overflow.

The ceph file system portion still uses the old functions for inode
timestamps, this will be done separately after the VFS layer is converted.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Ilya Dryomov c9ed51c912 libceph: change ceph_pagelist_encode_string() to take u32
The wire format dictates that the length of string fits into 4 bytes.
Take u32 instead of size_t to reflect that.

We were already truncating len in ceph_pagelist_encode_32() -- this
just pushes that truncation one level up.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
Ilya Dryomov 6d54228fd1 libceph: make ceph_osdc_notify{,_ack}() payload_len u32
The wire format dictates that payload_len fits into 4 bytes.  Take u32
instead of size_t to reflect that.

All callers pass a small integer, so no changes required.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
Ilya Dryomov a86f009f10 libceph: allocate the locator string with GFP_NOFAIL
calc_target() isn't supposed to fail with anything but POOL_DNE, in
which case we report that the pool doesn't exist and fail the request
with -ENOENT.  Doing this for -ENOMEM is at the very least confusing
and also harmful -- as the preceding requests complete, a short-lived
locator string allocation is likely to succeed after a wait.

(We used to call ceph_object_locator_to_pg() for a pi lookup.  In
theory that could fail with -ENOENT, hence the "ret != -ENOENT" warning
being removed.)

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:46:00 +02:00
Ilya Dryomov c843d13cae libceph: make abort_on_full a per-osdc setting
The intent behind making it a per-request setting was that it would be
set for writes, but not for reads.  As it is, the flag is set for all
fs/ceph requests except for pool perm check stat request (technically
a read).

ceph_osdc_abort_on_full() skips reads since the previous commit and
I don't see a use case for marking individual requests.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
2018-06-04 20:46:00 +02:00
Ilya Dryomov 88bc1922c2 libceph: defer __complete_request() to a workqueue
In the common case, req->r_callback is called by handle_reply() on the
ceph-msgr worker thread without any locks.  If handle_reply() fails, it
is called with both osd->lock and osdc->lock.  In the map check case,
it is called with just osdc->lock but held for write.  Finally, if the
request is aborted because of -ENOSPC or by ceph_osdc_abort_requests(),
it is called directly on the submitter's thread, again with both locks.

req->r_callback on the submitter's thread is relatively new (introduced
in 4.12) and ripe for deadlocks -- e.g. writeback worker thread waiting
on itself:

  inode_wait_for_writeback+0x26/0x40
  evict+0xb5/0x1a0
  iput+0x1d2/0x220
  ceph_put_wrbuffer_cap_refs+0xe0/0x2c0 [ceph]
  writepages_finish+0x2d3/0x410 [ceph]
  __complete_request+0x26/0x60 [libceph]
  complete_request+0x2e/0x70 [libceph]
  __submit_request+0x256/0x330 [libceph]
  submit_request+0x2b/0x30 [libceph]
  ceph_osdc_start_request+0x25/0x40 [libceph]
  ceph_writepages_start+0xdfe/0x1320 [ceph]
  do_writepages+0x1f/0x70
  __writeback_single_inode+0x45/0x330
  writeback_sb_inodes+0x26a/0x600
  __writeback_inodes_wb+0x92/0xc0
  wb_writeback+0x274/0x330
  wb_workfn+0x2d5/0x3b0

Defer __complete_request() to a workqueue in all failure cases so it's
never on the same thread as ceph_osdc_start_request() and always called
with no locks held.

Link: http://tracker.ceph.com/issues/23978
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
2018-06-04 20:45:58 +02:00
Ilya Dryomov 66850df585 libceph: introduce ceph_osdc_abort_requests()
This will be used by the filesystem for "umount -f".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:57 +02:00
Yan, Zheng 49a9f4f671 ceph: always get rstat from auth mds
rstat is not tracked by capability. client can't know if rstat from
non-auth mds is uptodate or not.

Link: http://tracker.ceph.com/issues/23538
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:55 +02:00
Chengguang Xu fe943d5042 libceph, rbd: add error handling for osd_req_op_cls_init()
Add proper error handling for osd_req_op_cls_init() to replace
BUG_ON statement when failing from memory allocation.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:54 +02:00
Ilya Dryomov 0010f7052d libceph: add osd_req_op_extent_osd_data_bvecs()
... and store num_bvecs for client code's convenience.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
2018-05-10 10:15:05 +02:00
Luis Henriques fb18a57568 ceph: quota: add initial infrastructure to support cephfs quotas
This patch adds the infrastructure required to support cephfs quotas as it
is currently implemented in the ceph fuse client.  Cephfs quotas can be
set on any directory, and can restrict the number of bytes or the number
of files stored beneath that point in the directory hierarchy.

Quotas are set using the extended attributes 'ceph.quota.max_files' and
'ceph.quota.max_bytes', and can be removed by setting these attributes to
'0'.

Link: http://tracker.ceph.com/issues/22372
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 11:17:51 +02:00
Chengguang Xu bb48bd4dc4 ceph: optimize memory usage
In current code, regular file and directory use same struct
ceph_file_info to store fs specific data so the struct has to
include some fields which are only used for directory
(e.g., readdir related info), when having plenty of regular files,
it will lead to memory waste.

This patch introduces dedicated ceph_dir_file_info cache for
readdir related thins. So that regular file does not include those
unused fields anymore.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:49 +02:00
Ilya Dryomov 08c1ac508b libceph, ceph: move ceph_calc_file_object_mapping() to striper.c
ceph_calc_file_object_mapping() has nothing to do with osdmaps.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:43 +02:00
Ilya Dryomov ed0811d2d2 libceph: striping framework implementation
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:42 +02:00
Ilya Dryomov b9e281c2b3 libceph: introduce BVECS data type
In preparation for rbd "fancy" striping, introduce ceph_bvec_iter for
working with bio_vec array data buffers.  The wrappers are trivial, but
make it look similar to ceph_bio_iter.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:39 +02:00
Ilya Dryomov 5359a17d27 libceph, rbd: new bio handling code (aka don't clone bios)
The reason we clone bios is to be able to give each object request
(and consequently each ceph_osd_data/ceph_msg_data item) its own
pointer to a (list of) bio(s).  The messenger then initializes its
cursor with cloned bio's ->bi_iter, so it knows where to start reading
from/writing to.  That's all the cloned bios are used for: to determine
each object request's starting position in the provided data buffer.

Introduce ceph_bio_iter to do exactly that -- store position within bio
list (i.e. pointer to bio) + position within that bio (i.e. bvec_iter).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:38 +02:00
Ilya Dryomov dccbf08005 libceph, ceph: change ceph_calc_file_object_mapping() signature
- make it void
- xlen (object extent length) out parameter should be u32 because only
  a single stripe unit is mapped at a time

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2018-04-02 10:12:38 +02:00
Greg Kroah-Hartman b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Douglas Fuller 06d74376c8 ceph: more accurate statfs
Improve accuracy of statfs reporting for Ceph filesystems comprising
exactly one data pool. In this case, the Ceph monitor can now report
the space usage for the single data pool instead of the global data
for the entire Ceph cluster. Include support for this message in
mon_client and leverage it in ceph/super.

Signed-off-by: Douglas Fuller <dfuller@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:49 +02:00
Yan, Zheng 95569713af ceph: new cap message flags indicate if there is pending capsnap
These flags tell mds if there is pending capsnap explicitly.
Without this explicit notification, mds can only conclude if
client has pending capsnap. The method mds use is inefficient
and error-prone.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:44 +02:00
Yanhu Cao 3fb99d483e ceph: nuke startsync op
startsync is a no-op, has been for years.  Remove it.

Link: http://tracker.ceph.com/issues/20604
Signed-off-by: Yanhu Cao <gmayyyha@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:43 +02:00
Yan, Zheng 4214fb158c ceph: validate correctness of some mount options
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:42 +02:00
Yan, Zheng 2ae409dc6a ceph: remove unused cap_release_safety mount option
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:43:05 +02:00
Ilya Dryomov ae78dd8139 libceph: make RECOVERY_DELETES feature create a new interval
This is needed so that the OSDs can regenerate the missing set at the
start of a new interval where support for recovery deletes changed.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-08-01 16:46:45 +02:00
Arnd Bergmann e67ae2b7b2 libceph: fix old style declaration warnings
The new macros don't follow the usual style for declarations,
which we get a warning for with 'make W=1':

In file included from fs/ceph/mds_client.c:16:0:
include/linux/ceph/ceph_features.h:74:1: error: 'static' is not at beginning of declaration [-Werror=old-style-declaration]

This moves the 'static' keyword to the front of the
declaration.

Fixes: f179d3ba8c ("libceph: new features macros")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-17 14:54:58 +02:00
Ilya Dryomov 33e9c8dbfb libceph: advertise support for NEW_OSDOP_ENCODING and SERVER_LUMINOUS
All four SERVER_LUMINOUS feature bits are implemented, switch it on!

NEW_OSDOP_ENCODING doesn't mean much for the client (it signifies
support for MOSDOp v6) but needs to be enabled in order to get the
latest (currently v25) pg_pool_t.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Acked-by: Sage Weil <sage@redhat.com>
2017-07-07 17:26:24 +02:00
Ilya Dryomov 0bb05da2ec libceph: osd_state is 32 bits wide in luminous
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:19 +02:00
Ilya Dryomov 6f428df47d libceph: pg_upmap[_items] infrastructure
pg_temp and pg_upmap encodings are the same (PG -> array of osds),
except for the incremental remove: it's an empty mapping in new_pg_temp
for pg_temp and a separate old_pg_upmap set for pg_upmap.  (This isn't
to allow for empty pg_upmap mappings -- apparently, pg_temp just wasn't
looked at as an example for pg_upmap encoding.)

Reuse __decode_pg_temp() for decoding pg_upmap and new_pg_upmap.
__decode_pg_temp() stores into pg_temp union member, but since pg_upmap
union member is identical, reading through pg_upmap later is OK.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:18 +02:00
Ilya Dryomov 278b1d709c libceph: ceph_decode_skip_* helpers
Some of these won't be as efficient as they could be (e.g.
ceph_decode_skip_set(... 32 ...) could advance by len * sizeof(u32)
once instead of advancing by sizeof(u32) len times), but that's fine
and not worth a bunch of extra macro code.

Replace skip_name_map() with ceph_decode_skip_map as an example.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:18 +02:00
Ilya Dryomov a02a946dfe libceph: respect RADOS_BACKOFF backoffs
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:17 +02:00
Ilya Dryomov 76f827a7b1 libceph: make DEFINE_RB_* helpers more general
Initially for ceph_pg_mapping, ceph_spg_mapping and ceph_hobject_id,
compared with ceph_pg_compare(), ceph_spg_compare() and hoid_compare()
respectively.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:17 +02:00
Ilya Dryomov df28152d53 libceph: avoid unnecessary pi lookups in calc_target()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:17 +02:00
Ilya Dryomov 04c7d789e2 libceph: make sure need_resend targets reflect latest map
Otherwise we may miss events like PG splits, pool deletions, etc when
we get multiple incremental maps at once.  Because check_pool_dne() can
now be fed an unlinked request, finish_request() needed to be taught to
handle unlinked requests.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:16 +02:00
Ilya Dryomov 7de030d6b1 libceph: resend on PG splits if OSD has RESEND_ON_SPLIT
Note that ceph_osd_request_target fields are updated regardless of
RESEND_ON_SPLIT.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:16 +02:00
Ilya Dryomov 8cb441c054 libceph: MOSDOp v8 encoding (actual spgid + full hash)
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:15 +02:00
Ilya Dryomov 98ad5ebd15 libceph: ceph_connection_operations::reencode_message() method
Give upper layers a chance to reencode the message after the connection
is negotiated and ->peer_features is set.  OSD client will use this to
support both luminous and pre-luminous OSDs (in a single cluster): the
former need MOSDOp v8; the latter will continue to be sent MOSDOp v4.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:15 +02:00
Ilya Dryomov dc98ff7230 libceph: introduce ceph_spg, ceph_pg_to_primary_shard()
Store both raw pgid and actual spgid in ceph_osd_request_target.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:15 +02:00
Ilya Dryomov dc93e0e283 libceph: fold [l]req->last_force_resend into ceph_osd_request_target
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:15 +02:00
Ilya Dryomov 220abf5aa7 libceph: support SERVER_JEWEL feature bits
Only MON_STATEFUL_SUB, really.  MON_ROUTE_OSDMAP and
OSDSUBOP_NO_SNAPCONTEXT are irrelevant.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:15 +02:00
Ilya Dryomov 2d7522e0bd libceph: advertise support for OSD_POOLRESEND
The code has been in place since commit 63244fa123 ("libceph:
introduce ceph_osd_request_target, calc_target()"), and, with the
ceph_{oloc,oid}_copy() issue fixed in the previous commit, is now
in working order.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:14 +02:00
Ilya Dryomov f179d3ba8c libceph: new features macros
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:14 +02:00
Ilya Dryomov dcbbd97ccb libceph: remove ceph_sanitize_features() workaround
Reflects ceph.git commit ff1959282826ae6acd7134e1b1ede74ffd1cc04a.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:14 +02:00
Ilya Dryomov 6f4dbd149d libceph: use kbasename() and kill ceph_file_part()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2017-05-23 20:32:10 +02:00
Alexander Graf f775ff7d89 ceph: fix file open flags on ppc64
The file open flags (O_foo) are platform specific and should never go
out to an interface that is not local to the system.

Unfortunately these flags have leaked out onto the wire in the cephfs
implementation. That lead to bogus flags getting transmitted on ppc64.

This patch converts the kernel view of flags to the ceph view of file
open flags.

Fixes: 124e68e74 ("ceph: file operations")
Signed-off-by: Alexander Graf <agraf@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:24 +02:00
Ilya Dryomov 14bb211d32 rbd: support updating the lock cookie without releasing the lock
As we no longer release the lock before potentially raising BLACKLISTED
in rbd_reregister_watch(), the "either locked or blacklisted" assert in
rbd_queue_workfn() needs to go: we can be both locked and blacklisted
at that point now.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04 09:19:23 +02:00
Jeff Layton 58eb7932ae libceph: add an epoch_barrier field to struct ceph_osd_client
Cephfs can get cap update requests that contain a new epoch barrier in
them. When that happens we want to pause all OSD traffic until the right
map epoch arrives.

Add an epoch_barrier field to ceph_osd_client that is protected by the
osdc->lock rwsem. When the barrier is set, and the current OSD map
epoch is below that, pause the request target when submitting the
request or when revisiting it. Add a way for upper layers (cephfs)
to update the epoch_barrier as well.

If we get a new map, compare the new epoch against the barrier before
kicking requests and request another map if the map epoch is still lower
than the one we want.

If we get a map with a full pool, or at quota condition, then set the
barrier to the current epoch value.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Jeff Layton a1f4020aab libceph: allow requests to return immediately on full conditions if caller wishes
Usually, when the osd map is flagged as full or the pool is at quota,
write requests just hang. This is not what we want for cephfs, where
it would be better to simply report -ENOSPC back to userland instead
of stalling.

If the caller knows that it will want an immediate error return instead
of blocking on a full or at-quota error condition then allow it to set a
flag to request that behavior.

Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
and on any other write request from ceph.ko.

A later patch will deal with requests that were submitted before the new
map showing the full condition came in.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Jeff Layton aa26d662b9 libceph: remove req->r_replay_version
Nothing uses this anymore with the removal of the ack vs. commit code.
Remove the field and just encode zeroes into place in the request
encoding.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Yan, Zheng 79162547b7 ceph: make seeky readdir more efficient
Current cephfs client uses string to indicate start position of
readdir. The string is last entry of previous readdir reply.
This approach does not work for seeky readdir because we can
not easily convert the new postion to a string. For seeky readdir,
mds needs to return dentries from the beginning. Client keeps
retrying if the reply does not contain the dentry it wants.

In current version of ceph, mds sorts CDentry in its cache in
hash order. Client also uses dentry hash to compose dir postion.
For seeky readdir, if client passes the hash part of dir postion
to mds. mds can avoid replying useless dentries.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Yan, Zheng 76201b6354 ceph: allow connecting to mds whose rank >= mdsmap::m_max_mds
mdsmap::m_max_mds is the expected count of active mds. It's not the
max rank of active mds. User can decrease mdsmap::m_max_mds, but does
not stop mds whose rank >= mdsmap::m_max_mds.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Elena Reshetova 0e1a5ee657 libceph: convert ceph_pagelist.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:19 +02:00
Elena Reshetova 02113a0f14 libceph: convert ceph_osd.o_ref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:19 +02:00
Elena Reshetova 06dfa96399 libceph: convert ceph_snap_context.nref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Ilya Dryomov 74da4a0f57 libceph, ceph: always advertise all supported features
No reason to hide CephFS-specific features in the rbd case.  Recent
feature bits mix RADOS and CephFS-specific stuff together anyway.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Ilya Dryomov 7cc5e38f2f libceph: osd_request_timeout option
osd_request_timeout specifies how many seconds to wait for a response
from OSDs before returning -ETIMEDOUT from an OSD request.  0 (default)
means no limit.

osd_request_timeout is osdkeepalive-precise -- in-flight requests are
swept through every osdkeepalive seconds.  With ack vs commit behaviour
gone, abort_request() is really simple.

This is based on a patch from Artur Molchanov <artur.molchanov@synesis.ru>.

Tested-by: Artur Molchanov <artur.molchanov@synesis.ru>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-03-07 14:30:38 +01:00
Ilya Dryomov b18b9550e4 libceph: get rid of ack vs commit
- CEPH_OSD_FLAG_ACK shouldn't be set anymore, so assert on it
- remove support for handling ack replies (OSDs will send ack replies
  only if clients request them)
- drop the "do lingering callbacks under osd->lock" logic from
  handle_reply() -- lreq->lock is sufficient in all three cases

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-24 19:04:57 +01:00
Ilya Dryomov 6c696d8560 rbd: kill obj_request->object_name and rbd_segment_name_cache
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20 12:16:15 +01:00
Ilya Dryomov 083a51fbc5 libceph: bump CEPH_PG_MAX_SIZE to 32
... to accommodate potentially very wide EC pools.  This increases the
size of a typical rbd ceph_osd_request by ~12% (from 1040 to 1168 bytes),
but I'd rather go future proof here.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-20 12:16:11 +01:00
Ilya Dryomov 743efcffff crush: merge working data and scratch
Much like Arlo Guthrie, I decided that one big pile is better than two
little piles.

Reflects ceph.git commit 95c2df6c7e0b22d2ea9d91db500cf8b9441c73ba.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20 12:16:11 +01:00
Ilya Dryomov 66a0e2d579 crush: remove mutable part of CRUSH map
Then add it to the working state. It would be very nice if we didn't
have to take a lock to calculate a crush placement. By moving the
permutation array into the working data, we can treat the CRUSH map as
immutable.

Reflects ceph.git commit cbcd039651c0569551cb90d26ce27e1432671f2a.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20 12:16:11 +01:00
Arnd Bergmann d24cdcd3e4 libceph: use BUG() instead of BUG_ON(1)
I ran into this compile warning, which is the result of BUG_ON(1)
not always leading to the compiler treating the code path as
unreachable:

    include/linux/ceph/osdmap.h: In function 'ceph_can_shift_osds':
    include/linux/ceph/osdmap.h:62:1: error: control reaches end of non-void function [-Werror=return-type]

Using BUG() here avoids the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20 12:16:06 +01:00
Linus Torvalds 59331c215d A varied set of changes:
- a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
   (myself).  Also fixed a deadlock caused by a bogus allocation on the
   writeback path and authorize reply verification.
 
 - a fix for long stalls during fsync (Jeff Layton).  The client now
   has a way to force the MDS log flush, leading to ~100x speedups in
   some synthetic tests.
 
 - a new [no]require_active_mds mount option (Zheng Yan).  On mount, we
   will now check whether any of the MDSes are available and bail rather
   than block if none are.  This check can be avoided by specifying the
   "no" option.
 
 - a couple of MDS cap handling fixes and a few assorted patches
   throughout.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJYVByGAAoJEEp/3jgCEfOLBqkH/A7nVf7ObSDYmLuYgg1gJ8zq
 4zDDE42S4yZwayAVpn3UjbfPuez5J44lsdXitExdfiHOdIQZDa/WqAbSqQ48HCSg
 7sG6ecRWg3G5zG0psPZnB+S5wGMvsLXmj2hvzV1lt2t0lI5bDLSlNRSnElbhilD/
 8Z7+Ni2go8DMC9o49SJU32lBW7IByKl4p4flveItgwUvGkIFNd8OT3CyPBUqonQs
 lRCeImRYU8Jghb+ifnRxWSbuDf7pZAPc9kL0vibpUUT/1bH6iHsedKp37WQKqc/w
 KDSNnKiZcz0gY/hJeLqE3ymCIKO6SU+JkMQSaYNTouLO5fQsRr8/uWQXSe6S5oc=
 =ypWx
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "A varied set of changes:

   - a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
     (myself). Also fixed a deadlock caused by a bogus allocation on the
     writeback path and authorize reply verification.

   - a fix for long stalls during fsync (Jeff Layton). The client now
     has a way to force the MDS log flush, leading to ~100x speedups in
     some synthetic tests.

   - a new [no]require_active_mds mount option (Zheng Yan).

     On mount, we will now check whether any of the MDSes are available
     and bail rather than block if none are. This check can be avoided
     by specifying the "no" option.

   - a couple of MDS cap handling fixes and a few assorted patches
     throughout"

* tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client: (32 commits)
  libceph: remove now unused finish_request() wrapper
  libceph: always signal completion when done
  ceph: avoid creating orphan object when checking pool permission
  ceph: properly set issue_seq for cap release
  ceph: add flags parameter to send_cap_msg
  ceph: update cap message struct version to 10
  ceph: define new argument structure for send_cap_msg
  ceph: move xattr initialzation before the encoding past the ceph_mds_caps
  ceph: fix minor typo in unsafe_request_wait
  ceph: record truncate size/seq for snap data writeback
  ceph: check availability of mds cluster on mount
  ceph: fix splice read for no Fc capability case
  ceph: try getting buffer capability for readahead/fadvise
  ceph: fix scheduler warning due to nested blocking
  ceph: fix printing wrong return variable in ceph_direct_read_write()
  crush: include mapper.h in mapper.c
  rbd: silence bogus -Wmaybe-uninitialized warning
  libceph: no need to drop con->mutex for ->get_authorizer()
  libceph: drop len argument of *verify_authorizer_reply()
  libceph: verify authorize reply on connect
  ...
2016-12-16 11:23:34 -08:00