Pull vfs open-related updates from Al Viro:
- "do we need fput() or put_filp()" rules are gone - it's always fput()
now. We keep track of that state where it belongs - in ->f_mode.
- int *opened mess killed - in finish_open(), in ->atomic_open()
instances and in fs/namei.c code around do_last()/lookup_open()/atomic_open().
- alloc_file() wrappers with saner calling conventions are introduced
(alloc_file_clone() and alloc_file_pseudo()); callers converted, with
much simplification.
- while we are at it, saner calling conventions for path_init() and
link_path_walk(), simplifying things inside fs/namei.c (both on
open-related paths and elsewhere).
* 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
few more cleanups of link_path_walk() callers
allow link_path_walk() to take ERR_PTR()
make path_init() unconditionally paired with terminate_walk()
document alloc_file() changes
make alloc_file() static
do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()
new helper: alloc_file_clone()
create_pipe_files(): switch the first allocation to alloc_file_pseudo()
anon_inode_getfile(): switch to alloc_file_pseudo()
hugetlb_file_setup(): switch to alloc_file_pseudo()
ocxlflash_getfile(): switch to alloc_file_pseudo()
cxl_getfile(): switch to alloc_file_pseudo()
... and switch shmem_file_setup() to alloc_file_pseudo()
__shmem_file_setup(): reorder allocations
new wrapper: alloc_file_pseudo()
kill FILE_{CREATED,OPENED}
switch atomic_open() and lookup_open() to returning 0 in all success cases
document ->atomic_open() changes
->atomic_open(): return 0 in all success cases
get rid of 'opened' in path_openat() and the helpers downstream
...
Parallel to FILE_CREATED, goes into ->f_mode instead of *opened.
NFS is a bit of a wart here - it doesn't have file at the point
where FILE_CREATED used to be set, so we need to propagate it
there (for now). IMA is another one (here and everywhere)...
Note that this needs do_dentry_open() to leave old bits in ->f_mode
alone - we want it to preserve FMODE_CREATED if it had been already
set (no other bit can be there).
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In any case, d_splice_alias() does not drop reference of original
dentry.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
There were no conflicts between this and the contents of linux-next
until just before the merge window, when we saw multiple problems:
- A minor conflict with my own y2038 fixes, which I could address
by adding another patch on top here.
- One semantic conflict with late changes to the NFS tree. I addressed
this by merging Deepa's original branch on top of the changes that
now got merged into mainline and making sure the merge commit includes
the necessary changes as produced by coccinelle.
- A trivial conflict against the removal of staging/lustre.
- Multiple conflicts against the VFS changes in the overlayfs tree.
These are still part of linux-next, but apparently this is no longer
intended for 4.18 [1], so I am ignoring that part.
As Deepa writes:
The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64 timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual
replacement becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions.
Thomas Gleixner adds:
I think there is no point to drag that out for the next merge window.
The whole thing needs to be done in one go for the core changes which
means that you're going to play that catchup game forever. Let's get
over with it towards the end of the merge window.
[1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
A3HkjIBrKW5AgQDxfgvm
=CZX2
-----END PGP SIGNATURE-----
Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground
Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
"This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
As Deepa writes:
'The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64
timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual replacement
becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions'
Thomas Gleixner adds:
'I think there is no point to drag that out for the next merge
window. The whole thing needs to be done in one go for the core
changes which means that you're going to play that catchup game
forever. Let's get over with it towards the end of the merge window'"
* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
pstore: Remove bogus format string definition
vfs: change inode times to use struct timespec64
pstore: Convert internal records to timespec64
udf: Simplify calls to udf_disk_stamp_to_time
fs: nfs: get rid of memcpys for inode times
ceph: make inode time prints to be long long
lustre: Use long long type to print inode time
fs: add timespec64_truncate()
requests are aborted, improving CephFS ENOSPC handling and making
"umount -f" actually work (Zheng and myself). The rest is mostly
mount option handling cleanups from Chengguang and assorted fixes
from Zheng, Luis and Dongsheng.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJbIkigAAoJEEp/3jgCEfOL3EUH/1s7Ib3FgFzG/SPPKISxZOGr
ndZGg0rPT9mPIQ4rp6t0z/cDlMrluPmCK3sWrAPe//sZz9iZiuip+mCL0gUFXFNr
1kL2xDKkJzGxtP3UlUvr5CC6bnxLdeBXJRBDLk/swtphuqArKndlbN/iLZnCZivT
uJDk+vZTwNJ3UhQP4QdnOQLV60NYs+q4euTqbZF3+pDiRiONbxRfXC3adFsc8zL9
zlie3CHPbrQHWMsfNvbfM3rBH1WhTwEssDm+IEFlKl19q9SKP2WPZfmBcE1pmZ58
AhIMoNGdQha1FXS6N96kaPaqFgeysPnEPoyHDqLxsUMKqsvJlOEZsK1jujza4rE=
=EfXm
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.18-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The main piece is a set of libceph changes that revamps how OSD
requests are aborted, improving CephFS ENOSPC handling and making
"umount -f" actually work (Zheng and myself).
The rest is mostly mount option handling cleanups from Chengguang and
assorted fixes from Zheng, Luis and Dongsheng.
* tag 'ceph-for-4.18-rc1' of git://github.com/ceph/ceph-client: (31 commits)
rbd: flush rbd_dev->watch_dwork after watch is unregistered
ceph: update description of some mount options
ceph: show ino32 if the value is different with default
ceph: strengthen rsize/wsize/readdir_max_bytes validation
ceph: fix alignment of rasize
ceph: fix use-after-free in ceph_statfs()
ceph: prevent i_version from going back
ceph: fix wrong check for the case of updating link count
libceph: allocate the locator string with GFP_NOFAIL
libceph: make abort_on_full a per-osdc setting
libceph: don't abort reads in ceph_osdc_abort_on_full()
libceph: avoid a use-after-free during map check
libceph: don't warn if req->r_abort_on_full is set
libceph: use for_each_request() in ceph_osdc_abort_on_full()
libceph: defer __complete_request() to a workqueue
libceph: move more code into __complete_request()
libceph: no need to call flush_workqueue() before destruction
ceph: flush pending works before shutdown super
ceph: abort osd requests on force umount
libceph: introduce ceph_osdc_abort_requests()
...
In current ceph_show_options(), there is no item for showing 'ino32',
so add showing mount option 'ino32' if the value is different with
default.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The check (intval < PAGE_SIZE) will involve type cast, so even when
specifying negative value to rsize/wsize/readdir_max_bytes, it will
pass the validation check successfully.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
On currently logic:
when I specify rasize=0~1 then it will be 4096.
when I specify rasize=2~4097 then it will be 8192.
Make it the same as rsize & wsize.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
KASAN found an UAF in ceph_statfs. This was a one-off bug but looking at
the code it looks like the monmap access needs to be protected as it can
be modified while we're accessing it. Fix this by protecting the access
with the monc->mutex.
BUG: KASAN: use-after-free in ceph_statfs+0x21d/0x2c0
Read of size 8 at addr ffff88006844f2e0 by task trinity-c5/304
CPU: 0 PID: 304 Comm: trinity-c5 Not tainted 4.17.0-rc6+ #172
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0xa5/0x11b
? show_regs_print_info+0x5/0x5
? kmsg_dump_rewind+0x118/0x118
? ceph_statfs+0x21d/0x2c0
print_address_description+0x73/0x2b0
? ceph_statfs+0x21d/0x2c0
kasan_report+0x243/0x360
ceph_statfs+0x21d/0x2c0
? ceph_umount_begin+0x80/0x80
? kmem_cache_alloc+0xdf/0x1a0
statfs_by_dentry+0x79/0xb0
vfs_statfs+0x28/0x110
user_statfs+0x8c/0xe0
? vfs_statfs+0x110/0x110
? __fdget_raw+0x10/0x10
__se_sys_statfs+0x5d/0xa0
? user_statfs+0xe0/0xe0
? mutex_unlock+0x1d/0x40
? __x64_sys_statfs+0x20/0x30
do_syscall_64+0xee/0x290
? syscall_return_slowpath+0x1c0/0x1c0
? page_fault+0x1e/0x30
? syscall_return_slowpath+0x13c/0x1c0
? prepare_exit_to_usermode+0xdb/0x140
? syscall_trace_enter+0x330/0x330
? __put_user_4+0x1c/0x30
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Allocated by task 130:
__kmalloc+0x124/0x210
ceph_monmap_decode+0x1c1/0x400
dispatch+0x113/0xd20
ceph_con_workfn+0xa7e/0x44e0
process_one_work+0x5f0/0xa30
worker_thread+0x184/0xa70
kthread+0x1a0/0x1c0
ret_from_fork+0x35/0x40
Freed by task 130:
kfree+0xb8/0x210
dispatch+0x15a/0xd20
ceph_con_workfn+0xa7e/0x44e0
process_one_work+0x5f0/0xa30
worker_thread+0x184/0xa70
kthread+0x1a0/0x1c0
ret_from_fork+0x35/0x40
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The intent behind making it a per-request setting was that it would be
set for writes, but not for reads. As it is, the flag is set for all
fs/ceph requests except for pool perm check stat request (technically
a read).
ceph_osdc_abort_on_full() skips reads since the previous commit and
I don't see a use case for marking individual requests.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Pending works hold inode references, which cause "Busy inodes after
unmount" warning.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently, calling stat on a cephfs directory returns 1 for st_nlink.
This behaviour has recently changed in the fuse client, as some
applications seem to expect this value to be either 0 (if it's
unlinked) or 2 + number of subdirectories. This behaviour was changed
in the fuse client with commit 67c7e4619188 ("client: use common
interp of st_nlink for dirs").
This patch modifies the kernel client to have a similar behaviour.
Link: https://tracker.ceph.com/issues/23873
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Without these new fields, stale st_size is returned in following
case.
1. MDS modifies a directory
2. MDS issues CEPH_CAP_ANY_SHARED to client
3. The client satifies stat(2) by its cached metadata. set st_size
to "i_files + i_subdirs".
Link: http://tracker.ceph.com/issues/23855
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The data structure includes the versioned feilds of cap message.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In MDS, file/subdir counts of a directory inode are protected by
filelock. In request reply without Fs cap, nfiles/nsubdirs can be
stale.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
rstat is not tracked by capability. client can't know if rstat from
non-auth mds is uptodate or not.
Link: http://tracker.ceph.com/issues/23538
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Subsequent patches in the series convert inode timestamps
to use struct timespec64 instead of struct timespec as
part of solving the y2038 problem.
Convert these print formats to use long long types to
avoid warnings and errors on conversion.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: zyan@redhat.com
Cc: ceph-devel@vger.kernel.org
dio_get_pagev_size() and dio_get_pages_alloc() introduced in commit
b5b98989dc ("ceph: combine as many iovec as possile into one OSD
request") assume that the passed iov_iter is ITER_IOVEC. This isn't
the case with splice where it ends up poking into the guts of ITER_BVEC
or ITER_PIPE iterators, causing lockups and crashes easily reproduced
with generic/095.
Rather than trying to figure out gap alignment and stuff pages into
a page vector, add a helper for going from iov_iter to a bio_vec array
and make use of the new CEPH_OSD_DATA_TYPE_BVECS code.
Fixes: b5b98989dc ("ceph: combine as many iovec as possile into one OSD request")
Link: http://tracker.ceph.com/issues/18130
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Tested-by: Luis Henriques <lhenriques@suse.com>
rsize/wsize cap should be applied before ceph_osdc_new_request() is
called. Otherwise, if the size is limited by the cap instead of the
stripe unit, ceph_osdc_new_request() would setup an extent op that is
bigger than what dio_get_pages_alloc() would pin and add to the page
vector, triggering asserts in the messenger.
Cc: stable@vger.kernel.org
Fixes: 95cca2b44e ("ceph: limit osd write size")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
- support for rbd "fancy" striping (myself). The striping feature bit
is now fully implemented, allowing mapping v2 images with non-default
striping patterns. This completes support for --image-format 2.
- CephFS quota support (Luis Henriques and Zheng Yan). This set is
based on the new SnapRealm code in the upcoming v13.y.z ("Mimic")
release. Quota handling will be rejected on older filesystems.
- memory usage improvements in CephFS (Chengguang Xu). Directory
specific bits have been split out of ceph_file_info and some effort
went into improving cap reservation code to avoid OOM crashes.
Also included a bunch of assorted fixes all over the place from
Chengguang and others.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJazOI/AAoJEEp/3jgCEfOLOu0IAKGFkcCo0UdQDGHHJZHn2rAm
CSWMMwyYGAhoWI6Gva0jx1A2omZLFSeq/MC8dWLL/MNAKt8i/qo8bTsTrwCHMR2Q
D0FsvMWIhkWRS1/FcD1uVDhn0a/DFm5Kfy8kzz3v695TDCt+BYWrCqyHTB/wSdRR
VpO3KdpHQ9h3ojNBRgIniOCNPeQP+QzLXy+P0h0oKbP2Y03mwJlsWG4L6zakkkwT
e2I+RVdlOMUDJ7rZxiXESBr6BuLI4oOkPe8roQGmZPy1Xe17xa9M5iWVNuM6RUhO
Z9bS2aLMhbDyeCPqvzgAnsUtFT0PAQjB5NYw2yqisbHs/wrU5kMOOpcLqz/Ls/s=
=v1I9
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The big ticket items are:
- support for rbd "fancy" striping (myself).
The striping feature bit is now fully implemented, allowing mapping
v2 images with non-default striping patterns. This completes
support for --image-format 2.
- CephFS quota support (Luis Henriques and Zheng Yan).
This set is based on the new SnapRealm code in the upcoming v13.y.z
("Mimic") release. Quota handling will be rejected on older
filesystems.
- memory usage improvements in CephFS (Chengguang Xu).
Directory specific bits have been split out of ceph_file_info and
some effort went into improving cap reservation code to avoid OOM
crashes.
Also included a bunch of assorted fixes all over the place from
Chengguang and others"
* tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client: (67 commits)
ceph: quota: report root dir quota usage in statfs
ceph: quota: add counter for snaprealms with quota
ceph: quota: cache inode pointer in ceph_snap_realm
ceph: fix root quota realm check
ceph: don't check quota for snap inode
ceph: quota: update MDS when max_bytes is approaching
ceph: quota: support for ceph.quota.max_bytes
ceph: quota: don't allow cross-quota renames
ceph: quota: support for ceph.quota.max_files
ceph: quota: add initial infrastructure to support cephfs quotas
rbd: remove VLA usage
rbd: fix spelling mistake: "reregisteration" -> "reregistration"
ceph: rename function drop_leases() to a more descriptive name
ceph: fix invalid point dereference for error case in mdsc destroy
ceph: return proper bool type to caller instead of pointer
ceph: optimize memory usage
ceph: optimize mds session register
libceph, ceph: add __init attribution to init funcitons
ceph: filter out used flags when printing unused open flags
ceph: don't wait on writeback when there is no more dirty pages
...
-----BEGIN PGP SIGNATURE-----
iQIVAwUAWsdxrvu3V2unywtrAQJVmQ/9Fv8d/Ecdwv5nxVBmN7uA8lOYcHEbZWmd
FhFQE8qYLjKMo9Fy4tPkBbu1l6CVnetaTRE5qwixACJAftrdjABKJAazGR3Uxief
0jMSWScrV1XCeRErPcczHcx52Hefl8f1DQdA3zpoF0ewz7CjyxMxkl67bsYJbNKE
T4ebCu5IJk+5PPwwMM3REKjQbunSXXnzgCLUI2cc0Yf76CTVpx6p+NpxV+2wq0p7
vym83F68qACAEzNH+oozN7IwqjkWyYOnTtCLiMsh4iq30jP6ohtLom6RcRp7QUxM
Z9hxgG3NptypuVBO1jKxaQ6XZGgAasYmppOmJ/SoALv2PKsAbxi372lTR4ikceKq
H4oNTbs5tVmyvu3qFwtLN+vX+GdfaoSUnUG8vTvnCB3tHHtYj7q5QeFE0HaX4QSq
oLANkCOZU8TJsT30pxsCNYiqc5HK9kaLjUQId9K+xq7mM/IuhtNtBQ+ZpqAh5IxB
4bXKYLdeJ1myZrkYTa6gcTqeFax3djCBJ3UvjTnuqRZAaQg079WkG84Kdq1ZjDRp
IQpKQnPX9JGhjW1zqLK1Ay8h+HFPgWR5BBVOaLwImr1mH+ccG0iNIeDjrOc8h6J5
e60XM/x2dIYxpXyFYAkldbAI24aRg1FNzfniG4rSAPecf3SwWrxg/qK7uujLbJHM
fKNA80yifHo=
=ukqs
-----END PGP SIGNATURE-----
Merge tag 'fscache-next-20180406' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull fscache updates from David Howells:
"Three patches that fix some of AFS's usage of fscache:
(1) Need to invalidate the cache if a foreign data change is detected
on the server.
(2) Move the vnode ID uniquifier (equivalent to i_generation) from
the auxiliary data to the index key to prevent a race between
file delete and a subsequent file create seeing the same index
key.
(3) Need to retire cookies that correspond to files that we think got
deleted on the server.
Four patches to fix some things in fscache and cachefiles:
(4) Fix a couple of checker warnings.
(5) Correctly indicate to the end-of-operation callback whether an
operation completed or was cancelled.
(6) Add a check for multiple cookie relinquishment.
(7) Fix a path through the asynchronous write that doesn't wake up a
waiter for a page if the cache decides not to write that page,
but discards it instead.
A couple of patches to add tracepoints to fscache and cachefiles:
(8) Add tracepoints for cookie operators, object state machine
execution, cachefiles object management and cachefiles VFS
operations.
(9) Add tracepoints for fscache operation management and page
wrangling.
And then three development patches:
(10) Attach the index key and auxiliary data to the cookie, pass this
information through various fscache-netfs API functions and get
rid of the callbacks to the netfs to get it.
This means that the cache can get at this information, even if
the netfs goes away. It also means that the cache can be lazy in
updating the coherency data.
(11) Pass the object data size through various fscache-netfs API
rather than calling back to the netfs for it, and store the value
in the object.
This makes it easier to correctly resize the object, as the size
is updated on writes to the cache, rather than calling back out
to the netfs.
(12) Maintain a catalogue of allocated cookies. This makes it possible
to catch cookie collision up front rather than down in the bowels
of the cache being run from a service thread from the object
state machine.
This will also make it possible in the future to reconnect to a
cookie that's not gone dead yet because it's waiting for
finalisation of the storage and also make it possible to bring
cookies online if the cache is added after the cookie has been
obtained"
* tag 'fscache-next-20180406' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
fscache: Maintain a catalogue of allocated cookies
fscache: Pass object size in rather than calling back for it
fscache: Attach the index key and aux data to the cookie
fscache: Add more tracepoints
fscache: Add tracepoints
fscache: Fix hanging wait on page discarded by writeback
fscache: Detect multiple relinquishment of a cookie
fscache: Pass the correct cancelled indications to fscache_op_complete()
fscache, cachefiles: Fix checker warnings
afs: Be more aggressive in retiring cached vnodes
afs: Use the vnode ID uniquifier in the cache key not the aux data
afs: Invalidate cache on server data change
Pass the object size in to fscache_acquire_cookie() and
fscache_write_page() rather than the netfs providing a callback by which it
can be received. This makes it easier to update the size of the object
when a new page is written that extends the object.
The current object size is also passed by fscache to the check_aux
function, obviating the need to store it in the aux data.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Anna Schumaker <anna.schumaker@netapp.com>
Tested-by: Steve Dickson <steved@redhat.com>
Attach copies of the index key and auxiliary data to the fscache cookie so
that:
(1) The callbacks to the netfs for this stuff can be eliminated. This
can simplify things in the cache as the information is still
available, even after the cache has relinquished the cookie.
(2) Simplifies the locking requirements of accessing the information as we
don't have to worry about the netfs object going away on us.
(3) The cache can do lazy updating of the coherency information on disk.
As long as the cache is flushed before reboot/poweroff, there's no
need to update the coherency info on disk every time it changes.
(4) Cookies can be hashed or put in a tree as the index key is easily
available. This allows:
(a) Checks for duplicate cookies can be made at the top fscache layer
rather than down in the bowels of the cache backend.
(b) Caching can be added to a netfs object that has a cookie if the
cache is brought online after the netfs object is allocated.
A certain amount of space is made in the cookie for inline copies of the
data, but if it won't fit there, extra memory will be allocated for it.
The downside of this is that live cache operation requires more memory.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Anna Schumaker <anna.schumaker@netapp.com>
Tested-by: Steve Dickson <steved@redhat.com>
This commit changes statfs default behaviour when reporting usage
statistics. Instead of using the overall filesystem usage, statfs now
reports the quota for the filesystem root, if ceph.quota.max_bytes has
been set for this inode. If quota hasn't been set, it falls back to the
old statfs behaviour.
A new mount option is also added ('noquotadf') to disable this behaviour.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
By keeping a counter with the number of snaprealms that have quota set
allows to optimize the functions that need to walk throught the realms
hierarchy looking for quotas. Thus, if this counter is zero it's safe to
assume that there are no realms with quota.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Keep a pointer to the inode in struct ceph_snap_realm. This allows to
optimize functions that walk the realms hierarchy (e.g. in quotas).
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When we're reaching the ceph.quota.max_bytes limit, i.e., when writing
more than 1/16th of the space left in a quota realm, update the MDS with
the new file size.
This mirrors the fuse-client approach with commit 122c50315ed1 ("client:
Inform mds file size when approaching quota limit"), in the ceph git tree.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This patch changes ceph_rename so that -EXDEV is returned if an attempt is
made to mv a file between two different dir trees with different quotas
setup.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This patch adds support for the max_files quota. It hooks into all the
ceph functions that add new filesystem objects that need to be checked
against the quota limits. When these limits are hit, -EDQUOT is returned.
Note that we're not checking quotas on ceph_link(). ceph_link doesn't
really create a new inode, and since the MDS doesn't update the directory
statistics when a new (hard) link is created (only with symlinks), they
are not accounted as a new file.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This patch adds the infrastructure required to support cephfs quotas as it
is currently implemented in the ceph fuse client. Cephfs quotas can be
set on any directory, and can restrict the number of bytes or the number
of files stored beneath that point in the directory hierarchy.
Quotas are set using the extended attributes 'ceph.quota.max_files' and
'ceph.quota.max_bytes', and can be removed by setting these attributes to
'0'.
Link: http://tracker.ceph.com/issues/22372
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
1. set fsc->mdsc after successfully allocate all necessary memory
in mdsc init.
2. if fsc->mdsc is NULL, just skip destroy operation in mdsc destroy.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In current code, regular file and directory use same struct
ceph_file_info to store fs specific data so the struct has to
include some fields which are only used for directory
(e.g., readdir related info), when having plenty of regular files,
it will lead to memory waste.
This patch introduces dedicated ceph_dir_file_info cache for
readdir related thins. So that regular file does not include those
unused fields anymore.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>