OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Ilya Dryomov	038b8d1d1a	libceph: optionally use bounce buffer on recv path in crc mode Both msgr1 and msgr2 in crc mode are zero copy in the sense that message data is read from the socket directly into the destination buffer. We assume that the destination buffer is stable (i.e. remains unchanged while it is being read to) though. Otherwise, CRC errors ensue: libceph: read_partial_message 0000000048edf8ad data crc 1063286393 != exp. 228122706 libceph: osd1 (1)192.168.122.1:6843 bad crc/signature libceph: bad data crc, calculated 57958023, expected 1805382778 libceph: osd2 (2)192.168.122.1:6876 integrity error, bad crc Introduce rxbounce option to enable use of a bounce buffer when receiving message data. In particular this is needed if a mapped image is a Windows VM disk, passed to QEMU. Windows has a system-wide "dummy" page that may be mapped into the destination buffer (potentially more than once into the same buffer) by the Windows Memory Manager in an effort to generate a single large I/O [1][2]. QEMU makes a point of preserving overlap relationships when cloning I/O vectors, so krbd gets exposed to this behaviour. [1] "What Is Really in That MDL?" https://docs.microsoft.com/en-us/previous-versions/windows/hardware/design/dn614012(v=vs.85) [2] https://blogs.msmvps.com/kernelmustard/2005/05/04/dummy-pages/ URL: https://bugzilla.redhat.com/show_bug.cgi?id=1973317 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2022-02-02 18:50:36 +01:00
Ilya Dryomov	2ea8871636	libceph: make recv path in secure mode work the same as send path The recv path of secure mode is intertwined with that of crc mode. While it's slightly more efficient that way (the ciphertext is read into the destination buffer and decrypted in place, thus avoiding two potentially heavy memory allocations for the bounce buffer and the corresponding sg array), it isn't really amenable to changes. Sacrifice that edge and align with the send path which always uses a full-sized bounce buffer (currently there is no other way -- if the kernel crypto API ever grows support for streaming (piecewise) en/decryption for GCM [1], we would be able to easily take advantage of that on both sides). [1] https://lore.kernel.org/all/20141225202830.GA18794@gondor.apana.org.au/ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2022-02-02 18:50:36 +01:00
Linus Torvalds	64f29d8856	The highlight is the new mount "device" string syntax implemented by Venky Shankar. It solves some long-standing issues with using different auth entities and/or mounting different CephFS filesystems from the same cluster, remounting and also misleading /proc/mounts contents. The existing syntax of course remains to be maintained. On top of that, there is a couple of fixes for edge cases in quota and a new mount option for turning on unbuffered I/O mode globally instead of on a per-file basis with ioctl(CEPH_IOC_SYNCIO). -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmHpP5ATHGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzi0TgB/480i2lPHgA3ujJNqo5Q6z+W0vtTA2+ Wx+4rAUgIESJVunbFxvecPbzyUXTe7wWFI11TCVHPpf6GyIIDTD+uHd3kKWtLsfL Zkk1/2PN9Q5Dh29R+N8rP9NaP8tIaTQjyiO3iqmRZlo+k0Z/lYtWUb+fUP05XlVY ML/ktW543tkKeYwl3SWdW5MqAAOVGDbTt+L51CraDhVoiUac5ptkP+cmDmIqsnGa ZHVqpwugxgndEIyuBHDLBps+5/LrEaL10xDhGcMtP9hwGYhyNr6Yj+azfGtHWwOi jdVsdHDiecUBVtGyZ351Y4pCMOmP0uJif6MOUZFXYYSSeUBUhH8UjgEi =jcte -----END PGP SIGNATURE----- Merge tag 'ceph-for-5.17-rc1' of git://github.com/ceph/ceph-client Pull ceph updates from Ilya Dryomov: "The highlight is the new mount "device" string syntax implemented by Venky Shankar. It solves some long-standing issues with using different auth entities and/or mounting different CephFS filesystems from the same cluster, remounting and also misleading /proc/mounts contents. The existing syntax of course remains to be maintained. On top of that, there is a couple of fixes for edge cases in quota and a new mount option for turning on unbuffered I/O mode globally instead of on a per-file basis with ioctl(CEPH_IOC_SYNCIO)" * tag 'ceph-for-5.17-rc1' of git://github.com/ceph/ceph-client: ceph: move CEPH_SUPER_MAGIC definition to magic.h ceph: remove redundant Lsx caps check ceph: add new "nopagecache" option ceph: don't check for quotas on MDS stray dirs ceph: drop send metrics debug message rbd: make const pointer spaces a static const array ceph: Fix incorrect statfs report for small quota ceph: mount syntax module parameter doc: document new CephFS mount device syntax ceph: record updated mon_addr on remount ceph: new device mount syntax libceph: rename parse_fsid() to ceph_parse_fsid() and export libceph: generalize addr/ip parsing based on delimiter	2022-01-20 13:46:20 +02:00
Michal Hocko	a421ef3030	mm: allow !GFP_KERNEL allocations for kvmalloc Support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented by previous patches so we can allow the support for kvmalloc. This will allow some external users to simplify or completely remove their helpers. GFP_NOWAIT semantic hasn't been supported so far but it hasn't been explicitly documented so let's add a note about that. ceph_kvmalloc is the first helper to be dropped and changed to kvmalloc. Link: https://lkml.kernel.org/r/20211122153233.9924-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Neil Brown <neilb@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-01-15 16:30:29 +02:00
Venky Shankar	4153c7fc93	libceph: rename parse_fsid() to ceph_parse_fsid() and export ... as it is too generic. also, use __func__ when logging rather than hardcoding the function name. Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2022-01-13 13:40:06 +01:00
Venky Shankar	2d7c86a8f9	libceph: generalize addr/ip parsing based on delimiter ... and remove hardcoded function name in ceph_parse_ips(). [ idryomov: delim parameter, drop CEPH_ADDR_PARSE_DEFAULT_DELIM ] Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2022-01-13 13:40:05 +01:00
Luís Henriques	aca39d9e86	libceph, ceph: move ceph_osdc_copy_from() into cephfs code This patch moves ceph_osdc_copy_from() function out of libceph code into cephfs. There are no other users for this function, and there is the need (in another patch) to access internal ceph_osd_request struct members. Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2021-11-08 03:29:52 +01:00
Kotresh HR	e1c9788cb3	ceph: don't rely on error_string to validate blocklisted session. The "error_string" in the metadata of MClientSession is being parsed by kclient to validate whether the session is blocklisted. The "error_string" is for humans and shouldn't be relied on it. Hence added the flag to MClientsession to indicate the session is blocklisted. [ jlayton: minor formatting cleanup ] URL: https://tracker.ceph.com/issues/47450 Signed-off-by: Kotresh HR <khiremat@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2021-11-08 03:29:51 +01:00
Xiubo Li	d095559ce4	ceph: flush mdlog before umounting Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2021-09-02 22:49:16 +02:00
Ilya Dryomov	03af4c7bad	libceph: set global_id as soon as we get an auth ticket Commit `61ca49a910` ("libceph: don't set global_id until we get an auth ticket") delayed the setting of global_id too much. It is set only after all tickets are received, but in pre-nautilus clusters an auth ticket and the service tickets are obtained in separate steps (for a total of three MAuth replies). When the service tickets are requested, global_id is used to build an authorizer; if global_id is still 0 we never get them and fail to establish the session. Moving the setting of global_id into protocol implementations. This way global_id can be set exactly when an auth ticket is received, not sooner nor later. Fixes: `61ca49a910` ("libceph: don't set global_id until we get an auth ticket") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2021-06-24 21:03:17 +02:00
Ilya Dryomov	3c0d089432	libceph: don't pass result into ac->ops->handle_reply() There is no result to pass in msgr2 case because authentication failures are reported through auth_bad_method frame and in MAuth case an error is returned immediately. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2021-06-24 21:03:16 +02:00
Ilya Dryomov	afd56e78dd	libceph: deprecate [no]cephx_require_signatures options These options were introduced in 3.19 with support for message signing and are rather useless, as explained in commit `a51983e4dd` ("libceph: add nocephx_sign_messages option"). Deprecate them. In case there is someone out there with a cluster that lacks support for MSG_AUTH feature (very unlikely but has to be considered since we haven't formally raised the bar from argonaut to bobtail yet), make nocephx_sign_messages also waive MSG_AUTH requirement. This is probably how it should have been done in the first place -- if we aren't going to sign, requiring the signing feature makes no sense. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2021-02-16 12:09:52 +01:00
Ilya Dryomov	664f1e259a	libceph: add __maybe_unused to DEFINE_MSGR2_FEATURE Avoid -Wunused-const-variable warnings for "make W=1". Reported-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-28 20:34:33 +01:00
Ilya Dryomov	2f0df6cfa3	libceph: drop ceph_auth_{create,update}_authorizer() Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	cd1a677cad	libceph, ceph: implement msgr2.1 protocol (crc and secure modes) Implement msgr2.1 wire protocol, available since nautilus 14.2.11 and octopus 15.2.5. msgr2.0 wire protocol is not implemented -- it has several security, integrity and robustness issues and therefore considered deprecated. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	00498b9941	libceph: introduce connection modes and ms_mode option msgr2 supports two connection modes: crc (plain) and secure (on-wire encryption). Connection mode is picked by server based on input from client. Introduce ms_mode option: ms_mode=legacy - msgr1 (default) ms_mode=crc - crc mode, if denied fail ms_mode=secure - secure mode, if denied fail ms_mode=prefer-crc - crc mode, if denied agree to secure mode ms_mode=prefer-secure - secure mode, if denied agree to crc mode ms_mode affects all connections, we don't separate connections to mons like it's done in userspace with ms_client_mode vs ms_mon_client_mode. For now the default is legacy, to be flipped to prefer-crc after some time. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	313771e80f	libceph, rbd: ignore addr->type while comparing in some cases For libceph, this ensures that libceph instance sharing (share option) continues to work. For rbd, this avoids blocklisting alive lock owners (locker addr is always LEGACY, while watcher addr is ANY in nautilus). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	a5cbd5fc22	libceph, ceph: get and handle cluster maps with addrvecs In preparation for msgr2, make the cluster send us maps with addrvecs including both LEGACY and MSGR2 addrs instead of a single LEGACY addr. This means advertising support for SERVER_NAUTILUS and also some older features: SERVER_MIMIC, MONENC and MONNAMES. MONNAMES and MONENC are actually pre-argonaut, we just never updated ceph_monmap_decode() for them. Decoding is unconditional, see commit `23c625ce30` ("libceph: assume argonaut on the server side"). SERVER_MIMIC doesn't bear any meaning for the kernel client. Since ceph_decode_entity_addrvec() is guarded by encoding version checks (and in msgr2 case it is guarded implicitly by the fact that server is speaking msgr2), we assume MSG_ADDR2 for it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	c1c0ce78f4	libceph: drop ac->ops->name field Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	59711f9ec2	libceph: amend cephx init_protocol() and build_request() In msgr2, initial authentication happens with an exchange of msgr2 control frames -- MAuth message and struct ceph_mon_request_header aren't used. Make that optional. Stop reporting cephx protocol as "x". Use "cephx" instead. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	285ea34fc8	libceph, ceph: incorporate nautilus cephx changes - request service tickets together with auth ticket. Currently we get auth ticket via CEPHX_GET_AUTH_SESSION_KEY op and then request service tickets via CEPHX_GET_PRINCIPAL_SESSION_KEY op in a separate message. Since nautilus, desired service tickets are shared togther with auth ticket in CEPHX_GET_AUTH_SESSION_KEY reply. - propagate session key and connection secret, if any. In preparation for msgr2, update handle_reply() and verify_authorizer_reply() auth ops to propagate session key and connection secret. Since nautilus, if secure mode is negotiated, connection secret is shared either in CEPHX_GET_AUTH_SESSION_KEY reply (for mons) or in a final authorizer reply (for osds and mdses). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	a56dd9bf47	libceph: move msgr1 protocol specific fields to its own struct A couple whitespace fixups, no functional changes. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	2f713615dd	libceph: move msgr1 protocol implementation to its own file A pure move, no other changes. Note that ceph_tcp_recv{msg,page}() and ceph_tcp_send{msg,page}() helpers are also moved. msgr2 will bring its own, more efficient, variants based on iov_iter. Switching msgr1 to them was considered but decided against to avoid subtle regressions. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	566050e17e	libceph: separate msgr1 protocol implementation In preparation for msgr2, define internal messenger <-> protocol interface (as opposed to external messenger <-> client interface, which is struct ceph_connection_operations) consisting of try_read(), try_write(), revoke(), revoke_incoming(), opened(), reset_session() and reset_protocol() ops. The semantics are exactly the same as they are now. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	6503e0b69c	libceph: export remaining protocol independent infrastructure In preparation for msgr2, make all protocol independent functions in messenger.c global. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	699921d9e6	libceph: export zero_page In preparation for msgr2, make zero_page global. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	3fefd43e74	libceph: rename and export con->flags bits In preparation for msgr2, move the defines to the header file. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	6d7f62bfb5	libceph: rename and export con->state states In preparation for msgr2, rename msgr1 specific states and move the defines to the header file. Also drop state transition comments. They don't cover all possible transitions (e.g. NEGOTIATING -> STANDBY, etc) and currently do more harm than good. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	30be780a87	libceph: make con->state an int unsigned long is a leftover from when con->state used to be a set of bits managed with set_bit(), clear_bit(), etc. Save a bit of memory. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	5cd8da3a1c	libceph: drop msg->ack_stamp field It is set in process_ack() but never used. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	418af5b3bf	libceph: lower exponential backoff delay The current setting allows the backoff to climb up to 5 minutes. This is too high -- it becomes hard to tell whether the client is stuck on something or just in backoff. In userspace, ms_max_backoff is defaulted to 15 seconds. Let's do the same. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:48 +01:00
Jeff Layton	4f1ddb1ea8	ceph: implement updated ceph_mds_request_head structure When we added the btime feature in mainline ceph, we had to extend struct ceph_mds_request_args so that it could be set. Implement the same in the kernel client. Rename ceph_mds_request_head with a _old extension, and a union ceph_mds_request_args_ext to allow for the extended size of the new header format. Add the appropriate code to handle both formats in struct create_request_message and key the behavior on whether the peer supports CEPH_FEATURE_FS_BTIME. The gid_list field in the payload is now populated from the saved credential. For now, we don't add any support for setting the btime via setattr, but this does enable us to add that in the future. [ idryomov: break unnecessarily long lines ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:48 +01:00
Xiubo Li	968cd14edc	ceph: set osdmap epoch for setxattr When setting the file/dir layout, it may need data pool info. So in mds server, it needs to check the osdmap. At present, if mds doesn't find the data pool specified, it will try to get the latest osdmap. Now if pass the osd epoch for setxattr, the mds server can only check this epoch of osdmap. URL: https://tracker.ceph.com/issues/48504 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:48 +01:00
Liu, Changcheng	36c9478d60	libceph: remove unused port macros 1. monitor's default port is defined by CEPH_MON_PORT 2. CEPH_PORT_START and CEPH_PORT_LAST are not needed. Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:47 +01:00
Jeff Layton	50c9132ddf	ceph: add new RECOVER mount_state when recovering session When recovering a session (a'la recover_session=clean), we want to do all of the operations that we do on a forced umount, but changing the mount state to SHUTDOWN is can cause queued MDS requests to fail when the session comes back. Most of those can idle until the session is recovered in this situation. Reserve SHUTDOWN state for forced umount, and make a new RECOVER state for the forced reconnect situation. Change several tests for equality with SHUTDOWN to test for that or RECOVER. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:46 +01:00
Ilya Dryomov	b07720d0bd	libceph: fix ENTITY_NAME format suggestion Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-10-12 15:29:27 +02:00
Ilya Dryomov	0b98acd618	libceph, rbd, ceph: "blacklist" -> "blocklist" Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-10-12 15:29:26 +02:00
Ilya Dryomov	3986f9a42e	libceph: multiple workspaces for CRUSH computations Replace a global map->crush_workspace (protected by a global mutex) with a list of workspaces, up to the number of CPUs + 1. This is based on a patch from Robin Geuze <robing@nl.team.blue>. Robin and his team have observed a 10-20% increase in IOPS on all queue depths and lower CPU usage as well on a high-end all-NVMe 100GbE cluster. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-10-12 15:29:26 +02:00
Ilya Dryomov	f062f025fc	libceph: add __maybe_unused to DEFINE_CEPH_FEATURE Avoid -Wunused-const-variable warnings for "make W=1". Reported-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>	2020-08-24 10:33:08 +02:00
Jeff Layton	a0102bda5b	ceph: move sb->wb_pagevec_pool to be a global mempool When doing some testing recently, I hit some page allocation failures on mount, when creating the wb_pagevec_pool for the mount. That requires 128k (32 contiguous pages), and after thrashing the memory during an xfstests run, sometimes that would fail. 128k for each mount seems like a lot to hold in reserve for a rainy day, so let's change this to a global mempool that gets allocated when the module is plugged in. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-08-04 19:41:12 +02:00
Randy Dunlap	f1f565a269	ceph: delete repeated words in fs/ceph/ Drop duplicated words "down" and "the" in fs/ceph/. [ idryomov: merge into a single patch ] Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-08-03 11:05:27 +02:00
Xiubo Li	18f473b384	ceph: periodically send perf metrics to MDSes This will send the caps/read/write/metadata metrics to any available MDS once per second, which will be the same as the userland client. It will skip the MDS sessions which don't support the metric collection, as the MDSs will close socket connections when they get an unknown type message. We can disable the metric sending via the disable_send_metrics module parameter. [ jlayton: fix up endianness bug in ceph_mdsc_send_metrics() ] URL: https://tracker.ceph.com/issues/43215 Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-08-03 11:05:26 +02:00
Jeff Layton	042f649810	libceph: just have osd_req_op_init() return a pointer The caller can just ignore the return. No need for this wrapper that just casts the other function to void. [ idryomov: argument alignment ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-08-03 11:05:25 +02:00
Ilya Dryomov	22d2cfdffa	libceph: move away from global osd_req_flags osd_req_flags is overly general and doesn't suit its only user (read_from_replica option) well: - applying osd_req_flags in account_request() affects all OSD requests, including linger (i.e. watch and notify). However, linger requests should always go to the primary even though some of them are reads (e.g. notify has side effects but it is a read because it doesn't result in mutation on the OSDs). - calls to class methods that are reads are allowed to go to the replica, but most such calls issued for "rbd map" and/or exclusive lock transitions are requested to be resent to the primary via EAGAIN, doubling the latency. Get rid of global osd_req_flags and set read_from_replica flag only on specific OSD requests instead. Fixes: `8ad44d5e0d` ("libceph: read_from_replica option") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2020-06-16 16:01:53 +02:00
Ilya Dryomov	d3798acc09	libceph: support for alloc hint flags Allow indicating future I/O pattern via flags. This is supported since Kraken (and bluestore persists flags together with expected_object_size and expected_write_size). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>	2020-06-01 23:32:35 +02:00
Ilya Dryomov	8ad44d5e0d	libceph: read_from_replica option Expose replica reads through read_from_replica=balance and read_from_replica=localize. The default is to read from primary (read_from_replica=no). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2020-06-01 13:22:53 +02:00
Ilya Dryomov	117d96a04f	libceph: support for balanced and localized reads OSD-side issues with reads from replica have been resolved in Octopus. Reading from replica should be safe wrt. unstable or uncommitted state now, so add support for balanced and localized reads. There are two cases when a read from replica can't be served: - OSD may silently drop the request, expecting the client to notice that the acting set has changed and resend via the usual means (handled with t->used_replica) - OSD may return EAGAIN, expecting the client to resend to the primary, ignoring replica read flags (see handle_reply()) Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2020-06-01 13:22:53 +02:00
Ilya Dryomov	45e6aa9f55	libceph: crush_location infrastructure Allow expressing client's location in terms of CRUSH hierarchy as a set of (bucket type name, bucket name) pairs. The userspace syntax "crush_location = key1=value1 key2=value2" is incompatible with mount options and needed adaptation. Key-value pairs are separated by '\|' and we use ':' instead of '=' to separate keys from values. So for: crush_location = host=foo rack=bar one would write: crush_location=host:foo\|rack:bar As in userspace, "multipath" locations are supported, so indicating locality for parallel hierarchies is possible: crush_location=rack:foo1\|rack:foo2\|datacenter:bar Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2020-06-01 13:22:53 +02:00
Ilya Dryomov	8a4b863c87	libceph: add non-asserting rbtree insertion helper Needed for the next commit and useful for ceph_pg_pool_info tree as well. I'm leaving the asserting helper in for now, but we should look at getting rid of it in the future. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2020-06-01 13:22:53 +02:00
Gustavo A. R. Silva	53ab8e7cd2	libceph, rbd: replace zero-length array with flexible-array The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-06-01 13:22:53 +02:00

1 2 3 4 5 ...

502 Commits