OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	f41def3971	The highlights are: - automatic recovery of a blacklisted filesystem session (Zheng Yan). This is disabled by default and can be enabled by mounting with the new "recover_session=clean" option. - serialize buffered reads and O_DIRECT writes (Jeff Layton). Care is taken to avoid serializing O_DIRECT reads and writes with each other, this is based on the exclusion scheme from NFS. - handle large osdmaps better in the face of fragmented memory (myself) - don't limit what security.* xattrs can be get or set (Jeff Layton). We were overly restrictive here, unnecessarily preventing things like file capability sets stored in security.capability from working. - allow copy_file_range() within the same inode and across different filesystems within the same cluster (Luis Henriques) -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl2LoD8THGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzixRYB/9H5S4fif8Pn9eiXkIJiT9UZR/o7S1k ikfQNPeDxlBLKnoZXpDp2HqCu1/YuCcJ0zpZzPGGrKECZb7r+NaayxhmEXAZ+Vsg YwsO3eNHBbb58pe9T4oiHp19sflwcTOeNwg8wlvmvrgfBupFz2pU8Xm72EdFyoYm tP0QNTOCAuQK3pJcgozaptAO1TzBL3LomyVM0YzAKcumgMg47zALpaSLWJLGtDLM 5+5WLvcVfBGLVv60h4B62ldS39eBxqTsFodcRMUaqAsnhLK70HVfKlwR3GgtZggr PDqbsuIfw/O3b65U2XDKZt1P9dyG3OE/ucueduXUxJPYNGmooEE+PpE+ =DRVP -----END PGP SIGNATURE----- Merge tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-client Pull ceph updates from Ilya Dryomov: "The highlights are: - automatic recovery of a blacklisted filesystem session (Zheng Yan). This is disabled by default and can be enabled by mounting with the new "recover_session=clean" option. - serialize buffered reads and O_DIRECT writes (Jeff Layton). Care is taken to avoid serializing O_DIRECT reads and writes with each other, this is based on the exclusion scheme from NFS. - handle large osdmaps better in the face of fragmented memory (myself) - don't limit what security.* xattrs can be get or set (Jeff Layton). We were overly restrictive here, unnecessarily preventing things like file capability sets stored in security.capability from working. - allow copy_file_range() within the same inode and across different filesystems within the same cluster (Luis Henriques)" * tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-client: (41 commits) ceph: call ceph_mdsc_destroy from destroy_fs_client libceph: use ceph_kvmalloc() for osdmap arrays libceph: avoid a __vmalloc() deadlock in ceph_kvmalloc() ceph: allow object copies across different filesystems in the same cluster ceph: include ceph_debug.h in cache.c ceph: move static keyword to the front of declarations rbd: pull rbd_img_request_create() dout out into the callers ceph: reconnect connection if session hang in opening state libceph: drop unused con parameter of calc_target() ceph: use release_pages() directly rbd: fix response length parameter for encoded strings ceph: allow arbitrary security.* xattrs ceph: only set CEPH_I_SEC_INITED if we got a MAC label ceph: turn ceph_security_invalidate_secctx into static inline ceph: add buffered/direct exclusionary locking for reads and writes libceph: handle OSD op ceph_pagelist_append() errors ceph: don't return a value from void function ceph: don't freeze during write page faults ceph: update the mtime when truncating up ceph: fix indentation in __get_snap_name() ...	2019-09-25 10:21:13 -07:00
Jeff Layton	3ee5a7015c	ceph: call ceph_mdsc_destroy from destroy_fs_client They're always called in succession. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:25 +02:00
Luis Henriques	6fd4e63483	ceph: allow object copies across different filesystems in the same cluster OSDs are able to perform object copies across different pools. Thus, there's no need to prevent copy_file_range from doing remote copies if the source and destination superblocks are different. Only return -EXDEV if they have different fsid (the cluster ID). Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:25 +02:00
Ilya Dryomov	48f930ea6d	ceph: include ceph_debug.h in cache.c Any file that uses dout() should include ceph_debug.h at the top. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:25 +02:00
Krzysztof Wilczynski	536cc331a4	ceph: move static keyword to the front of declarations Move the static keyword to the front of declarations of snap_handle_length, handle_length and connected_handle_length, and resolve the following compiler warnings that can be seen when building with warnings enabled (W=1): fs/ceph/export.c:38:2: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration] fs/ceph/export.c:88:2: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration] fs/ceph/export.c:90:2: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration] Signed-off-by: Krzysztof Wilczynski <kw@linux.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:25 +02:00
Erqi Chen	71a228bc8d	ceph: reconnect connection if session hang in opening state If client mds session is evicted in CEPH_MDS_SESSION_OPENING state, mds won't send session msg to client, and delayed_work skip CEPH_MDS_SESSION_OPENING state session, the session hang forever. Allow ceph_con_keepalive to reconnect a session in OPENING to avoid session hang. Also, ensure that we skip sessions in RESTARTING and REJECTED states since those states can't be resurrected by issuing a keepalive. Link: https://tracker.ceph.com/issues/41551 Signed-off-by: Erqi Chen chenerqi@gmail.com Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:25 +02:00
John Hubbard	96ac9158a2	ceph: use release_pages() directly release_pages() has been available to modules since Oct, 2010, when commit `0be8557bcd` ("fuse: use release_pages()") added EXPORT_SYMBOL(release_pages). However, this ceph code was still using a workaround. Remove the workaround, and call release_pages() directly. Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:25 +02:00
Jeff Layton	b8fe918b09	ceph: allow arbitrary security.* xattrs Most filesystems don't limit what security.* xattrs can be set or fetched. I see no reason that we need to limit that on cephfs either. Drop the special xattr handler for "security." xattrs, and allow the "other" xattr handler to handle security xattrs as well. In addition to fixing xfstest generic/093, this allows us to support per-file capabilities (a'la setcap(8)). Link: https://tracker.ceph.com/issues/41135 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:25 +02:00
Jeff Layton	026105ebb0	ceph: only set CEPH_I_SEC_INITED if we got a MAC label __ceph_getxattr will set the CEPH_I_SEC_INITED flag whenever it gets any xattr that starts with "security.". We only want to set that flag when fetching the MAC label for the currently-active LSM, however. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:25 +02:00
Jeff Layton	668959a535	ceph: turn ceph_security_invalidate_secctx into static inline No need to do an extra jump here. Also add some comments on the endifs. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:25 +02:00
Jeff Layton	321fe13c93	ceph: add buffered/direct exclusionary locking for reads and writes xfstest generic/451 intermittently fails. The test does O_DIRECT writes to a file, and then reads back the result using buffered I/O, while running a separate set of tasks that are also doing buffered reads. The client will invalidate the cache prior to a direct write, but it's easy for one of the other readers' replies to race in and reinstantiate the invalidated range with stale data. To fix this, we must to serialize direct I/O writes and buffered reads. We could just sprinkle in some shared locks on the i_rwsem for reads, and increase the exclusive footprint on the write side, but that would cause O_DIRECT writes to end up serialized vs. other direct requests. Instead, borrow the scheme used by nfs.ko. Buffered writes take the i_rwsem exclusively, but buffered reads take a shared lock, allowing them to run in parallel. O_DIRECT requests also take a shared lock, but we need for them to not run in parallel with buffered reads. A flag on the ceph_inode_info is used to indicate whether it's in direct or buffered I/O mode. When a conflicting request is submitted, it will block until the inode can be flipped to the necessary mode. Link: https://tracker.ceph.com/issues/40985 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:25 +02:00
John Hubbard	3e8730fac9	ceph: don't return a value from void function This fixes a build warning to that effect. Fixes: `1a829ff2a6` ("ceph: no need to check return value of debugfs_create functions") Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:25 +02:00
Jeff Layton	249c1df59a	ceph: don't freeze during write page faults Prevent freezing operations during write page faults. This is good practice for most filesystems, but especially for ceph since we're monkeying with the signal table here. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:25 +02:00
Jeff Layton	c62498d7f9	ceph: update the mtime when truncating up If we have Fx caps, and the we're truncating the size to be larger, then we'll cache the size attribute change, but the mtime won't be updated. Move the size handling before the mtime, and add ATTR_MTIME to ia_valid in that case to make sure the mtime also gets updated. This fixes xfstest generic/313. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Ilya Dryomov	0ed26f3693	ceph: fix indentation in __get_snap_name() Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Jeff Layton	98cd281a76	ceph: remove incorrect comment above __send_cap It doesn't do anything to invalidate the cache when dropping RD caps. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Jeff Layton	daca8bda95	ceph: remove CEPH_I_NOFLUSH Nothing sets this flag. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Jeff Layton	27b0a39209	ceph: remove unneeded test in try_flush_caps cap->session is always non-NULL, so we can just do a single test for equality w/o testing explicitly for a NULL pointer. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Jeff Layton	9f3345d8ec	ceph: have __mark_caps_flushing return flush_tid Currently, this function returns ci->i_dirty_caps, but the callers have to check that that isn't 0 before calling this function. Have the callers grab that value directly out of the inode, and have __mark_caps_flushing return the flush_tid instead. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Jeff Layton	354c63a003	ceph: fix comments over ceph_add_cap We actually need the ci->i_ceph_lock here. The necessity of the s_mutex is less clear. Also add a lockdep assertion for the i_ceph_lock. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Jeff Layton	533a2818dd	ceph: eliminate session->s_trim_caps It's only used to keep count of caps being trimmed, but that requires that we hold the session->s_mutex to prevent multiple trimming operations from running concurrently. We can achieve the same effect using an integer on the stack, which allows us to (eventually) not need the s_mutex. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Jeff Layton	606d102327	ceph: fetch cap_gen under spinlock in ceph_add_cap It's protected by the s_gen_ttl_lock, so we should fetch under it and ensure that we're using the same generation in both places. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Jeff Layton	5de16b30d3	ceph: remove ceph_get_cap_mds and __ceph_get_cap_mds Nothing calls these routines. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Jeff Layton	b72b13eb20	ceph: don't SetPageError on writepage errors We already mark the mapping in that case, and doing this can cause false positives to occur at fsync time, as well as spurious read errors. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Yan, Zheng	131d7eb4fa	ceph: auto reconnect after blacklisted Make client use osd reply and session message to infer if itself is blacklisted. Client reconnect to cluster using new entity addr if it is blacklisted. Auto reconnect is limited to once every 30 minutes. Auto reconnect is disabled by default. It can be enabled/disabled by recover_session=<no\|clean> mount option. In 'clean' mode, client drops any dirty data/metadata, invalidates page caches and invalidates all writable file handles. After reconnect, file locks become stale because MDS loses track of them. If an inode contains any stale file locks, read/write on the indoe are not allowed until applications release all stale file locks. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Yan, Zheng	81f148a910	ceph: invalidate all write mode filp after reconnect Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Yan, Zheng	ff5d913dfc	ceph: return -EIO if read/write against filp that lost file locks After mds evicts session, file locks get lost sliently. It's not safe to let programs continue to do read/write. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Yan, Zheng	d468e729b7	ceph: add helper function that forcibly reconnects to ceph cluster. It closes mds sessions, drop all caps and invalidates page caches, then use new entity address to reconnect to the cluster. After reconnect, all dirty data/metadata are dropped, file locks get lost sliently. Open files continue to work because client will try renewing caps on later read/write. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Yan, Zheng	5e3ded1bb6	ceph: pass filp to ceph_get_caps() Also change several other functions' arguments, no logical changes. This is preparetion for later patch that checks filp error. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:24 +02:00
Yan, Zheng	f4b9786622	ceph: track and report error of async metadata operation Use errseq_t to track and report errors of async metadata operations, similar to how kernel handles errors during writeback. If any dirty caps or any unsafe request gets dropped during session eviction, record -EIO in corresponding inode's i_meta_err. The error will be reported by subsequent fsync, Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:23 +02:00
Yan, Zheng	7e6906c1e6	ceph: allow closing session in restarting/reconnect state CEPH_MDS_SESSION_{RESTARTING,RECONNECTING} are for for mds failover, they are sub-states of CEPH_MDS_SESSION_OPEN. So __close_session() should send close request for session in these two state. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:23 +02:00
Jeff Layton	e09580b343	ceph: don't list vxattrs in listxattr() Most filesystems that provide virtual xattrs (e.g. CIFS) don't display them via listxattr(). Ceph does, and that causes some of the tests in xfstests to fail. Have cephfs stop listing vxattrs in listxattr. Userspace can always query them directly when the name is known. Signed-off-by: Jeff Layton <jlayton@kernel.org> Acked-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:23 +02:00
Jeff Layton	e1e4460202	ceph: allow copy_file_range when src and dst inode are same There is no reason to prevent this. The OSD should be able to handle this as long as the objects are different, and the existing code falls back when the offset into the object is different. Signed-off-by: Jeff Layton <jlayton@kernel.org> Acked-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:23 +02:00
Luis Henriques	750670341a	ceph: fix directories inode i_blkbits initialization When filling an inode with info from the MDS, i_blkbits is being initialized using fl_stripe_unit, which contains the stripe unit in bytes. Unfortunately, this doesn't make sense for directories as they have fl_stripe_unit set to '0'. This means that i_blkbits will be set to 0xff, causing an UBSAN undefined behaviour in i_blocksize(): UBSAN: Undefined behaviour in ./include/linux/fs.h:731:12 shift exponent 255 is too large for 32-bit type 'int' Fix this by initializing i_blkbits to CEPH_BLOCK_SHIFT if fl_stripe_unit is zero. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-09-16 12:06:23 +02:00
Deepa Dinamani	028ca4db0a	fs: ceph: Initialize filesystem timestamp ranges Fill in the appropriate limits to avoid inconsistencies in the vfs cached inode times when timestamps are outside the permitted range. According to the disscussion in https://patchwork.kernel.org/patch/8308691/ we agreed to use unsigned 32 bit timestamps on ceph. Update the limits accordingly. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Jeff Layton <jlayton@kernel.org> Cc: zyan@redhat.com Cc: sage@redhat.com Cc: idryomov@gmail.com Cc: ceph-devel@vger.kernel.org	2019-08-30 07:27:19 -07:00
Jeff Layton	28a282616f	ceph: don't try fill file_lock on unsuccessful GETFILELOCK reply When ceph_mdsc_do_request returns an error, we can't assume that the filelock_reply pointer will be set. Only try to fetch fields out of the r_reply_info when it returns success. Cc: stable@vger.kernel.org Reported-by: Hector Martin <hector@marcansoft.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-08-22 10:47:41 +02:00
Erqi Chen	c95f1c5f43	ceph: clear page dirty before invalidate page clear_page_dirty_for_io(page) before mapping->a_ops->invalidatepage(). invalidatepage() clears page's private flag, if dirty flag is not cleared, the page may cause BUG_ON failure in ceph_set_page_dirty(). Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/40862 Signed-off-by: Erqi Chen <chenerqi@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-08-22 10:47:41 +02:00
Luis Henriques	af8a85a417	ceph: fix buffer free while holding i_ceph_lock in fill_inode() Calling ceph_buffer_put() in fill_inode() may result in freeing the i_xattrs.blob buffer while holding the i_ceph_lock. This can be fixed by postponing the call until later, when the lock is released. The following backtrace was triggered by fstests generic/070. BUG: sleeping function called from invalid context at mm/vmalloc.c:2283 in_atomic(): 1, irqs_disabled(): 0, pid: 3852, name: kworker/0:4 6 locks held by kworker/0:4/3852: #0: 000000004270f6bb ((wq_completion)ceph-msgr){+.+.}, at: process_one_work+0x1b8/0x5f0 #1: 00000000eb420803 ((work_completion)(&(&con->work)->work)){+.+.}, at: process_one_work+0x1b8/0x5f0 #2: 00000000be1c53a4 (&s->s_mutex){+.+.}, at: dispatch+0x288/0x1476 #3: 00000000559cb958 (&mdsc->snap_rwsem){++++}, at: dispatch+0x2eb/0x1476 #4: 000000000d5ebbae (&req->r_fill_mutex){+.+.}, at: dispatch+0x2fc/0x1476 #5: 00000000a83d0514 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: fill_inode.isra.0+0xf8/0xf70 CPU: 0 PID: 3852 Comm: kworker/0:4 Not tainted 5.2.0+ #441 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014 Workqueue: ceph-msgr ceph_con_workfn Call Trace: dump_stack+0x67/0x90 ___might_sleep.cold+0x9f/0xb1 vfree+0x4b/0x60 ceph_buffer_release+0x1b/0x60 fill_inode.isra.0+0xa9b/0xf70 ceph_fill_trace+0x13b/0xc70 ? dispatch+0x2eb/0x1476 dispatch+0x320/0x1476 ? __mutex_unlock_slowpath+0x4d/0x2a0 ceph_con_workfn+0xc97/0x2ec0 ? process_one_work+0x1b8/0x5f0 process_one_work+0x244/0x5f0 worker_thread+0x4d/0x3e0 kthread+0x105/0x140 ? process_one_work+0x5f0/0x5f0 ? kthread_park+0x90/0x90 ret_from_fork+0x3a/0x50 Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-08-22 10:47:41 +02:00
Luis Henriques	12fe3dda7e	ceph: fix buffer free while holding i_ceph_lock in __ceph_build_xattrs_blob() Calling ceph_buffer_put() in __ceph_build_xattrs_blob() may result in freeing the i_xattrs.blob buffer while holding the i_ceph_lock. This can be fixed by having this function returning the old blob buffer and have the callers of this function freeing it when the lock is released. The following backtrace was triggered by fstests generic/117. BUG: sleeping function called from invalid context at mm/vmalloc.c:2283 in_atomic(): 1, irqs_disabled(): 0, pid: 649, name: fsstress 4 locks held by fsstress/649: #0: 00000000a7478e7e (&type->s_umount_key#19){++++}, at: iterate_supers+0x77/0xf0 #1: 00000000f8de1423 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: ceph_check_caps+0x7b/0xc60 #2: 00000000562f2b27 (&s->s_mutex){+.+.}, at: ceph_check_caps+0x3bd/0xc60 #3: 00000000f83ce16a (&mdsc->snap_rwsem){++++}, at: ceph_check_caps+0x3ed/0xc60 CPU: 1 PID: 649 Comm: fsstress Not tainted 5.2.0+ #439 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x67/0x90 ___might_sleep.cold+0x9f/0xb1 vfree+0x4b/0x60 ceph_buffer_release+0x1b/0x60 __ceph_build_xattrs_blob+0x12b/0x170 __send_cap+0x302/0x540 ? __lock_acquire+0x23c/0x1e40 ? __mark_caps_flushing+0x15c/0x280 ? _raw_spin_unlock+0x24/0x30 ceph_check_caps+0x5f0/0xc60 ceph_flush_dirty_caps+0x7c/0x150 ? __ia32_sys_fdatasync+0x20/0x20 ceph_sync_fs+0x5a/0x130 iterate_supers+0x8f/0xf0 ksys_sync+0x4f/0xb0 __ia32_sys_sync+0xa/0x10 do_syscall_64+0x50/0x1c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fc6409ab617 Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-08-22 10:47:41 +02:00
Luis Henriques	86968ef215	ceph: fix buffer free while holding i_ceph_lock in __ceph_setxattr() Calling ceph_buffer_put() in __ceph_setxattr() may end up freeing the i_xattrs.prealloc_blob buffer while holding the i_ceph_lock. This can be fixed by postponing the call until later, when the lock is released. The following backtrace was triggered by fstests generic/117. BUG: sleeping function called from invalid context at mm/vmalloc.c:2283 in_atomic(): 1, irqs_disabled(): 0, pid: 650, name: fsstress 3 locks held by fsstress/650: #0: 00000000870a0fe8 (sb_writers#8){.+.+}, at: mnt_want_write+0x20/0x50 #1: 00000000ba0c4c74 (&type->i_mutex_dir_key#6){++++}, at: vfs_setxattr+0x55/0xa0 #2: 000000008dfbb3f2 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: __ceph_setxattr+0x297/0x810 CPU: 1 PID: 650 Comm: fsstress Not tainted 5.2.0+ #437 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x67/0x90 ___might_sleep.cold+0x9f/0xb1 vfree+0x4b/0x60 ceph_buffer_release+0x1b/0x60 __ceph_setxattr+0x2b4/0x810 __vfs_setxattr+0x66/0x80 __vfs_setxattr_noperm+0x59/0xf0 vfs_setxattr+0x81/0xa0 setxattr+0x115/0x230 ? filename_lookup+0xc9/0x140 ? rcu_read_lock_sched_held+0x74/0x80 ? rcu_sync_lockdep_assert+0x2e/0x60 ? __sb_start_write+0x142/0x1a0 ? mnt_want_write+0x20/0x50 path_setxattr+0xba/0xd0 __x64_sys_lsetxattr+0x24/0x30 do_syscall_64+0x50/0x1c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7ff23514359a Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-08-22 10:47:41 +02:00
Linus Torvalds	18253e034d	Merge branch 'work.dcache2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache and mountpoint updates from Al Viro: "Saner handling of refcounts to mountpoints. Transfer the counting reference from struct mount ->mnt_mountpoint over to struct mountpoint ->m_dentry. That allows us to get rid of the convoluted games with ordering of mount shutdowns. The cost is in teaching shrink_dcache_{parent,for_umount} to cope with mixed-filesystem shrink lists, which we'll also need for the Slab Movable Objects patchset" * 'work.dcache2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: switch the remnants of releasing the mountpoint away from fs_pin get rid of detach_mnt() make struct mountpoint bear the dentry reference to mountpoint, not struct mount Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists fs/namespace.c: shift put_mountpoint() to callers of unhash_mnt() __detach_mounts(): lookup_mountpoint() can't return ERR_PTR() anymore nfs: dget_parent() never returns NULL ceph: don't open-code the check for dead lockref	2019-07-20 09:15:51 -07:00
Linus Torvalds	d9b9c89304	Lots of exciting things this time! - support for rbd object-map and fast-diff features (myself). This will speed up reads, discards and things like snap diffs on sparse images. - ceph.snap.btime vxattr to expose snapshot creation time (David Disseldorp). This will be used to integrate with "Restore Previous Versions" feature added in Windows 7 for folks who reexport ceph through SMB. - security xattrs for ceph (Zheng Yan). Only selinux is supported for now due to the limitations of ->dentry_init_security(). - support for MSG_ADDR2, FS_BTIME and FS_CHANGE_ATTR features (Jeff Layton). This is actually a single feature bit which was missing because of the filesystem pieces. With this in, the kernel client will finally be reported as "luminous" by "ceph features" -- it is still being reported as "jewel" even though all required Luminous features were implemented in 4.13. - stop NULL-terminating ceph vxattrs (Jeff Layton). The convention with xattrs is to not terminate and this was causing inconsistencies with ceph-fuse. - change filesystem time granularity from 1 us to 1 ns, again fixing an inconsistency with ceph-fuse (Luis Henriques). On top of this there are some additional dentry name handling and cap flushing fixes from Zheng. Finally, Jeff is formally taking over for Zheng as the filesystem maintainer. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl0u+X8THGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzi9byB/9fIxzoxtDMvixtJabuGSJRtlijDlWF GlO6yIWCXl/8v8easR2PCF75U/xv0+QFQmze8PVi8u4Xz589P247NnEuyEZ9n84i aCavARho6QLZPEL+B04NaqoHBl+ORKQTA6eKGhyKwRp/rn83z5Ubuw2tN7krHT3b kCY61FuTQGxNY2o/WKv/iLwINYr7H23hCf0WwyyKH1bp7OegiQ14Ebn1NtfS3sMx hS6h8Ya826vmUW0bCSS/9kzKYBCjksTig0HphUOHq6BoZJs++0b7GukIulRyuLfD J9Gr9HGPoDCVzdmFfpn2FSlxdmqfO9amUSagd0ftLQfFlPlrpoULi0GW =Bgxr -----END PGP SIGNATURE----- Merge tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client Pull ceph updates from Ilya Dryomov: "Lots of exciting things this time! - support for rbd object-map and fast-diff features (myself). This will speed up reads, discards and things like snap diffs on sparse images. - ceph.snap.btime vxattr to expose snapshot creation time (David Disseldorp). This will be used to integrate with "Restore Previous Versions" feature added in Windows 7 for folks who reexport ceph through SMB. - security xattrs for ceph (Zheng Yan). Only selinux is supported for now due to the limitations of ->dentry_init_security(). - support for MSG_ADDR2, FS_BTIME and FS_CHANGE_ATTR features (Jeff Layton). This is actually a single feature bit which was missing because of the filesystem pieces. With this in, the kernel client will finally be reported as "luminous" by "ceph features" -- it is still being reported as "jewel" even though all required Luminous features were implemented in 4.13. - stop NULL-terminating ceph vxattrs (Jeff Layton). The convention with xattrs is to not terminate and this was causing inconsistencies with ceph-fuse. - change filesystem time granularity from 1 us to 1 ns, again fixing an inconsistency with ceph-fuse (Luis Henriques). On top of this there are some additional dentry name handling and cap flushing fixes from Zheng. Finally, Jeff is formally taking over for Zheng as the filesystem maintainer" * tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client: (71 commits) ceph: fix end offset in truncate_inode_pages_range call ceph: use generic_delete_inode() for ->drop_inode ceph: use ceph_evict_inode to cleanup inode's resource ceph: initialize superblock s_time_gran to 1 MAINTAINERS: take over for Zheng as CephFS kernel client maintainer rbd: setallochint only if object doesn't exist rbd: support for object-map and fast-diff rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe() libceph: export osd_req_op_data() macro libceph: change ceph_osdc_call() to take page vector for response libceph: bump CEPH_MSG_MAX_DATA_LEN (again) rbd: new exclusive lock wait/wake code rbd: quiescing lock should wait for image requests rbd: lock should be quiesced on reacquire rbd: introduce copyup state machine rbd: rename rbd_obj_setup_() to rbd_obj_init_() rbd: move OSD request allocation into object request state machines rbd: factor out __rbd_osd_setup_discard_ops() rbd: factor out rbd_osd_setup_copyup() rbd: introduce obj_req->osd_reqs list ...	2019-07-18 11:05:25 -07:00
Linus Torvalds	f632a8170a	Driver Core and debugfs changes for 5.3-rc1 Here is the "big" driver core and debugfs changes for 5.3-rc1 It's a lot of different patches, all across the tree due to some api changes and lots of debugfs cleanups. Because of this, there is going to be some merge issues with your tree at the moment, I'll follow up with the expected resolutions to make it easier for you. Other than the debugfs cleanups, in this set of changes we have: - bus iteration function cleanups (will cause build warnings with s390 and coresight drivers in your tree) - scripts/get_abi.pl tool to display and parse Documentation/ABI entries in a simple way - cleanups to Documenatation/ABI/ entries to make them parse easier due to typos and other minor things - default_attrs use for some ktype users - driver model documentation file conversions to .rst - compressed firmware file loading - deferred probe fixes All of these have been in linux-next for a while, with a bunch of merge issues that Stephen has been patient with me for. Other than the merge issues, functionality is working properly in linux-next :) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXSgpnQ8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ykcwgCfS30OR4JmwZydWGJ7zK/cHqk+KjsAnjOxjC1K LpRyb3zX29oChFaZkc5a =XrEZ -----END PGP SIGNATURE----- Merge tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core and debugfs updates from Greg KH: "Here is the "big" driver core and debugfs changes for 5.3-rc1 It's a lot of different patches, all across the tree due to some api changes and lots of debugfs cleanups. Other than the debugfs cleanups, in this set of changes we have: - bus iteration function cleanups - scripts/get_abi.pl tool to display and parse Documentation/ABI entries in a simple way - cleanups to Documenatation/ABI/ entries to make them parse easier due to typos and other minor things - default_attrs use for some ktype users - driver model documentation file conversions to .rst - compressed firmware file loading - deferred probe fixes All of these have been in linux-next for a while, with a bunch of merge issues that Stephen has been patient with me for" * tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (102 commits) debugfs: make error message a bit more verbose orangefs: fix build warning from debugfs cleanup patch ubifs: fix build warning after debugfs cleanup patch driver: core: Allow subsystems to continue deferring probe drivers: base: cacheinfo: Ensure cpu hotplug work is done before Intel RDT arch_topology: Remove error messages on out-of-memory conditions lib: notifier-error-inject: no need to check return value of debugfs_create functions swiotlb: no need to check return value of debugfs_create functions ceph: no need to check return value of debugfs_create functions sunrpc: no need to check return value of debugfs_create functions ubifs: no need to check return value of debugfs_create functions orangefs: no need to check return value of debugfs_create functions nfsd: no need to check return value of debugfs_create functions lib: 842: no need to check return value of debugfs_create functions debugfs: provide pr_fmt() macro debugfs: log errors when something goes wrong drivers: s390/cio: Fix compilation warning about const qualifiers drivers: Add generic helper to match by of_node driver_find_device: Unify the match function with class_find_device() bus_find_device: Unify the match callback with class_find_device ...	2019-07-12 12:24:03 -07:00
Linus Torvalds	40f06c7995	Changes to copy_file_range for 5.3 from Dave and Amir: - Create a generic copy_file_range handler and make individual filesystems responsible for calling it (i.e. no more assuming that do_splice_direct will work or is appropriate) - Refactor copy_file_range and remap_range parameter checking where they are the same - Install missing copy_file_range parameter checking(!) - Remove suid/sgid and update mtime like any other file write - Change the behavior so that a copy range crossing the source file's eof will result in a short copy to the source file's eof instead of EINVAL - Permit filesystems to decide if they want to handle cross-superblock copy_file_range in their local handlers. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl0BGvAACgkQ+H93GTRK tOu2aw/+KGG7PiXm9ED3ZXUppKVddrZMOgqM7mSfHo6TBgW3pJUJcRIhawK0Wz/P stgTsOkurHSl3iT3vQyX4GTZvLoGN/rfsRLPxogJptBUqVv3BOrXsrI53f7V/kbm rtjlYsgExji7VBUiMTe5kOWWqxyR7B4nXyvY/8rier57rW/8C1I58B0OrxAmTK0k rz1e5BtE1dg91xA7cSdEc38FInz8MW8cvsrEzW9vyYY4IVE0PBuhhA1EvryxTrAZ hfthHFfzwxhJkI0mdha8uqNufNWrHLSqiwyjYC7pwAwSQzQPiQz9U17flu+URnfF kXaR5LdXbBP3pl46RdthrfuonWsv612cC1Qwfjs8PBG9lG7b9PGJ40MGVTiw7LlQ 924/03ho0zAnV0E8Qn5O9nPshQNDJhwhzMS39EmMyFKb1D5XGzdMV0gDdIfx6hdO HDbw6VQ33S59gvk7v/gxsFB5Bs4PKfamHx/QmwQwpqWM5XExcr0yJ90OTBtAuY4r S+9gwG6uED3aPh8HbQ5UgnA8bZmMmi8AkcBvqJ9GgNw5SbZl0oyv9Sj6JNpoOejV 8y9JkhoZUxqiihnKTw/vtMrj5RCOfifNBjMSwrShfLdLKtK0AZl1mXC0/1Q3VnEQ TUcyRHEzrtHgJ9/AK9xIyDNvNYzvHSLZj7maoZZumgQa2FOFrmw= =qM44 -----END PGP SIGNATURE----- Merge tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull copy_file_range updates from Darrick Wong: "This fixes numerous parameter checking problems and inconsistent behaviors in the new(ish) copy_file_range system call. Now the system call will actually check its range parameters correctly; refuse to copy into files for which the caller does not have sufficient privileges; update mtime and strip setuid like file writes are supposed to do; and allows copying up to the EOF of the source file instead of failing the call like we used to. Summary: - Create a generic copy_file_range handler and make individual filesystems responsible for calling it (i.e. no more assuming that do_splice_direct will work or is appropriate) - Refactor copy_file_range and remap_range parameter checking where they are the same - Install missing copy_file_range parameter checking(!) - Remove suid/sgid and update mtime like any other file write - Change the behavior so that a copy range crossing the source file's eof will result in a short copy to the source file's eof instead of EINVAL - Permit filesystems to decide if they want to handle cross-superblock copy_file_range in their local handlers" * tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: fuse: copy_file_range needs to strip setuid bits and update timestamps vfs: allow copy_file_range to copy across devices xfs: use file_modified() helper vfs: introduce file_modified() helper vfs: add missing checks to copy_file_range vfs: remove redundant checks from generic_remap_checks() vfs: introduce generic_file_rw_checks() vfs: no fallback for ->copy_file_range vfs: introduce generic_copy_file_range()	2019-07-10 20:32:37 -07:00
Luis Henriques	d31d07b97a	ceph: fix end offset in truncate_inode_pages_range call Commit `e450f4d1a5` ("ceph: pass inclusive lend parameter to filemap_write_and_wait_range()") fixed the end offset parameter used to call filemap_write_and_wait_range and invalidate_inode_pages2_range. Unfortunately it missed truncate_inode_pages_range, introducing a regression that is easily detected by xfstest generic/130. The problem is that when doing direct IO it is possible that an extra page is truncated from the page cache when the end offset is page aligned. This can cause data loss if that page hasn't been sync'ed to the OSDs. While there, change code to use PAGE_ALIGN macro instead. Cc: stable@vger.kernel.org Fixes: `e450f4d1a5` ("ceph: pass inclusive lend parameter to filemap_write_and_wait_range()") Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-07-08 14:01:45 +02:00
Luis Henriques	52dd0f1b3f	ceph: use generic_delete_inode() for ->drop_inode ceph_drop_inode() implementation is not any different from the generic function, thus there's no point in keeping it around. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-07-08 14:01:45 +02:00
Yan, Zheng	87bc5b895d	ceph: use ceph_evict_inode to cleanup inode's resource remove_session_caps() relies on __wait_on_freeing_inode(), to wait for freeing inode to remove its caps. But VFS wakes freeing inode waiters before calling destroy_inode(). Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/40102 Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-07-08 14:01:45 +02:00
Luis Henriques	0f7cf80ae9	ceph: initialize superblock s_time_gran to 1 Having granularity set to 1us results in having inode timestamps with a accurancy different from the fuse client (i.e. atime, ctime and mtime will always end with '000'). This patch normalizes this behaviour and sets the granularity to 1. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-07-08 14:01:45 +02:00
Ilya Dryomov	94e8577188	libceph: rename r_unsafe_item to r_private_item This list item remained from when we had safe and unsafe replies (commit vs ack). It has since become a private list item for use by clients. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-07-08 14:01:44 +02:00
Jeff Layton	26350535c2	ceph: don't NULL terminate virtual xattrs The convention with xattrs is to not store the termination with string data, given that it returns the length. This is how setfattr/getfattr operate. Most of ceph's virtual xattr routines use snprintf to plop the string directly into the destination buffer, but snprintf always NULL terminates the string. This means that if we send the kernel a buffer that is the exact length needed to hold the string, it'll end up truncated. Add a ceph_fmt_xattr helper function to format the string into an on-stack buffer that should always be large enough to hold the whole thing and then memcpy the result into the destination buffer. If it does turn out that the formatted string won't fit in the on-stack buffer, then return -E2BIG and do a WARN_ONCE(). Change over most of the virtual xattr routines to use the new helper. A couple of the xattrs are sourced from strings however, and it's difficult to know how long they'll be. Just have those memcpy the result in place after verifying the length. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Acked-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-07-08 14:01:44 +02:00

1 2 3 4 5 ...

1778 Commits