remove_session_caps() uses iterate_session_caps() to remove caps,
but iterate_session_caps() skips inodes that are being deleted.
So session->s_nr_caps can be non-zero after iterate_session_caps()
return.
We can fix the issue by waiting until deletions are complete.
__wait_on_freeing_inode() is designed for the job, but it is not
exported, so we use lookup inode function to access it.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
To write data, the writer first acquires the i_mutex, then try getting
caps. The writer may sleep while holding the i_mutex. If the MDS revokes
Fb cap in this case, vmtruncate work can't do its job because i_mutex
is locked. We should wake up the writer and let it truncate the pages.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Pull Ceph updates from Sage Weil:
"There is some follow-on RBD cleanup after the last window's code drop,
a series from Yan fixing multi-mds behavior in cephfs, and then a
sprinkling of bug fixes all around. Some warnings, sleeping while
atomic, a null dereference, and cleanups"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (36 commits)
libceph: fix invalid unsigned->signed conversion for timespec encoding
libceph: call r_unsafe_callback when unsafe reply is received
ceph: fix race between cap issue and revoke
ceph: fix cap revoke race
ceph: fix pending vmtruncate race
ceph: avoid accessing invalid memory
libceph: Fix NULL pointer dereference in auth client code
ceph: Reconstruct the func ceph_reserve_caps.
ceph: Free mdsc if alloc mdsc->mdsmap failed.
ceph: remove sb_start/end_write in ceph_aio_write.
ceph: avoid meaningless calling ceph_caps_revoking if sync_mode == WB_SYNC_ALL.
ceph: fix sleeping function called from invalid context.
ceph: move inode to proper flushing list when auth MDS changes
rbd: fix a couple warnings
ceph: clear migrate seq when MDS restarts
ceph: check migrate seq before changing auth cap
ceph: fix race between page writeback and truncate
ceph: reset iov_len when discarding cap release messages
ceph: fix cap release race
libceph: fix truncate size calculation
...
The locking order for pending vmtruncate is wrong, it can lead to
following race:
write wmtruncate work
------------------------ ----------------------
lock i_mutex
check i_truncate_pending check i_truncate_pending
truncate_inode_pages() lock i_mutex (blocked)
copy data to page cache
unlock i_mutex
truncate_inode_pages()
The fix is take i_mutex before calling __ceph_do_pending_vmtruncate()
Fixes: http://tracker.ceph.com/issues/5453
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Current ceph code tracks directory's completeness in two places.
ceph_readdir() checks i_release_count to decide if it can set the
I_COMPLETE flag in i_ceph_flags. All other places check the I_COMPLETE
flag. This indirection introduces locking complexity.
This patch adds a new variable i_complete_count to ceph_inode_info.
Set i_release_count's value to it when marking a directory complete.
By comparing the two variables, we know if a directory is complete
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
make __ceph_do_pending_vmtruncate() acquire the i_mutex if the caller
does not hold the i_mutex, so ceph_aio_read() can call safely.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
commit c6ffe10015 moved the flag that tracks if the dcache contents
for a directory are complete to dentry. The problem is there are
lots of places that use ceph_dir_{set,clear,test}_complete() while
holding i_ceph_lock. but ceph_dir_{set,clear,test}_complete() may
sleep because they call dput().
This patch basically reverts that commit. For ceph_d_prune(), it's
called with both the dentry to prune and the parent dentry are
locked. So it's safe to access the parent dentry's d_inode and
clear I_COMPLETE flag.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
If r_aborted is true, we do not hold the dir i_mutex, and cannot touch
the dcache. However, we still need to update the inodes with the state
returned by the MDS.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Before printing kuid and kgids values convert them into
the initial user namespace.
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- In fill_inode() transate uids and gids in the initial user namespace
into kuids and kgids stored in inode->i_uid and inode->i_gid.
- In ceph_setattr() if they have changed convert inode->i_uid and
inode->i_gid into initial user namespace uids and gids for
transmission.
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull Ceph update from Sage Weil:
"There are a few different groups of commits here. The largest is
Alex's ongoing work to enable the coming RBD features (cloning,
striping). There is some cleanup in libceph that goes along with it.
Cyril and David have fixed some problems with NFS reexport (leaking
dentries and page locks), and there is a batch of patches from Yan
fixing problems with the fs client when running against a clustered
MDS. There are a few bug fixes mixed in for good measure, many of
which will be going to the stable trees once they're upstream.
My apologies for the late pull. There is still a gremlin in the rbd
map/unmap code and I was hoping to include the fix for that as well,
but we haven't been able to confirm the fix is correct yet; I'll send
that in a separate pull once it's nailed down."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits)
rbd: get rid of rbd_{get,put}_dev()
libceph: register request before unregister linger
libceph: don't use rb_init_node() in ceph_osdc_alloc_request()
libceph: init event->node in ceph_osdc_create_event()
libceph: init osd->o_node in create_osd()
libceph: report connection fault with warning
libceph: socket can close in any connection state
rbd: don't use ENOTSUPP
rbd: remove linger unconditionally
rbd: get rid of RBD_MAX_SEG_NAME_LEN
libceph: avoid using freed osd in __kick_osd_requests()
ceph: don't reference req after put
rbd: do not allow remove of mounted-on image
libceph: Unlock unprocessed pages in start_read() error path
ceph: call handle_cap_grant() for cap import message
ceph: Fix __ceph_do_pending_vmtruncate
ceph: Don't add dirty inode to dirty list if caps is in migration
ceph: Fix infinite loop in __wake_requests
ceph: Don't update i_max_size when handling non-auth cap
bdi_register: add __printf verification, fix arg mismatch
...
we should set i_truncate_pending to 0 after page cache is truncated
to i_truncate_size
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
If the MDS gives us a dentry and we weren't prepared to handle it,
WARN_ON_ONCE instead of crashing.
Reported-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Return -EINVAL rather than panic if iinfo->symlink_len and inode->i_size
do not match.
Also use kstrndup rather than kmalloc/memcpy.
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: ensure prealloc_blob is in place when removing xattr
rbd: initialize snap_rwsem in rbd_add()
ceph: enable/disable dentry complete flags via mount option
vfs: export symbol d_find_any_alias()
ceph: always initialize the dentry in open_root_dentry()
libceph: remove useless return value for osd_client __send_request()
ceph: avoid iput() while holding spinlock in ceph_dir_fsync
ceph: avoid useless dget/dput in encode_fh
ceph: dereference pointer after checking for NULL
crush: fix force for non-root TAKE
ceph: remove unnecessary d_fsdata conditional checks
ceph: Use kmemdup rather than duplicating its implementation
Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs
always initialize the dentry in open_root_dentry)
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else. Not to mention the removal of
boilerplate code from ->destroy_inode() instances...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We have been using i_lock to protect all kinds of data structures in the
ceph_inode_info struct, including lists of inodes that we need to iterate
over while avoiding races with inode destruction. That requires grabbing
a reference to the inode with the list lock protected, but igrab() now
takes i_lock to check the inode flags.
Changing the list lock ordering would be a painful process.
However, using a ceph-specific i_ceph_lock in the ceph inode instead of
i_lock is a simple mechanical change and avoids the ordering constraints
imposed by igrab().
Reported-by: Amon Ott <a.ott@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>
If we queue a work item that calls iput(), make sure we ihold() before
attempting to queue work. Otherwise our queued work might miraculously run
before we notice the queue_work() succeeded and call ihold(), allowing the
inode to be destroyed.
That is, instead of
if (queue_work(...))
ihold();
we need to do
ihold();
if (!queue_work(...))
iput();
Reported-by: Amon Ott <a.ott@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>
We used to use a flag on the directory inode to track whether the dcache
contents for a directory were a complete cached copy. Switch to a dentry
flag CEPH_D_COMPLETE that is safely updated by ->d_prune().
Signed-off-by: Sage Weil <sage@newdream.net>
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This reverts commit c9af9fb68e.
We need to block and truncate all pages in order to reliably invalidate
them. Otherwise, we could:
- have some uptodate pages in the cache
- queue an invalidate
- write(2) locks some pages
- invalidate_work skips them
- write(2) only overwrites part of the page
- page now dirty and uptodate
-> partial leakage of invalidated data
It's not entirely clear why we started skipping locked pages in the first
place. I just ran this through fsx and didn't see any problems.
Signed-off-by: Sage Weil <sage@newdream.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
ceph: document unlocked d_parent accesses
ceph: explicitly reference rename old_dentry parent dir in request
ceph: document locking for ceph_set_dentry_offset
ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
ceph: protect d_parent access in ceph_d_revalidate
ceph: protect access to d_parent
ceph: handle racing calls to ceph_init_dentry
ceph: set dir complete frag after adding capability
rbd: set blk_queue request sizes to object size
ceph: set up readahead size when rsize is not passed
rbd: cancel watch request when releasing the device
ceph: ignore lease mask
ceph: fix ceph_lookup_open intent usage
ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
ceph: fix bad parent_inode calc in ceph_lookup_open
ceph: avoid carrying Fw cap during write into page cache
libceph: don't time out osd requests that haven't been received
ceph: report f_bfree based on kb_avail rather than diffing.
ceph: only queue capsnap if caps are dirty
ceph: fix snap writeback when racing with writes
...
d_parent is protected by d_lock: use it when looking up a dentry's parent
directory inode. Also take a reference and drop it in the caller to avoid
a use-after-free.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Curretly ceph_add_cap clears the complete bit if we are newly issued the
FILE_SHARED cap, which is normally the case for a newly issue cap on a new
directory. That means we clear the just-set bit. Move the check that sets
the flag to after the cap is added/updated.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
The lease mask is no longer used (and it changed a while back). Instead,
use a non-zero duration to indicate that there is a lease being issued.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We should use ihold whenever we already have a stable inode ref, even
when we aren't holding i_lock. This avoids adding new and unnecessary
locking dependencies.
Signed-off-by: Sage Weil <sage@newdream.net>
We increments i_wrbuffer_ref when taking the Fb cap. This breaks
the dirty page accounting and causes looping in
__ceph_do_pending_vmtruncate, and ceph client hangs.
This bug can be reproduced occasionally by running blogbench.
Add a new field i_wb_ref to inode and dedicate it to Fb reference
counting.
Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
The __mark_dirty_inode helper now takes i_lock as of 250df6ed. Fix the
one ceph callers that held i_lock (__ceph_mark_dirty_caps) to return the
flags value so that the callers can do it outside of i_lock.
Signed-off-by: Sage Weil <sage@newdream.net>
The ino32 mount option forces the ceph fs to report 32 bit
ino values. This is useful for 64 bit kernels with 32 bit userspace.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
d_move puts the renamed dentry at the end of d_subdirs, screwing with our
cached dentry directory offsets. We were just clearing I_COMPLETE to avoid
any possibility of trouble. However, assigning the renamed dentry an
offset at the end of the directory (to match it's new d_subdirs position)
is sufficient to maintain correct behavior and hold onto I_COMPLETE.
This is especially important for workloads like rsync, which renames files
into place. Before, we would lose I_COMPLETE and do MDS lookups for each
file. With this patch we only talk to the MDS on create and rename.
Signed-off-by: Sage Weil <sage@newdream.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: avoid picking MDS that is not active
ceph: avoid immediate cap check after import
ceph: fix flushing of caps vs cap import
ceph: fix erroneous cap flush to non-auth mds
ceph: fix cap_wanted_delay_{min,max} mount option initialization
ceph: fix xattr rbtree search
ceph: fix getattr on directory when using norbytes
The norbytes mount option was broken, and when doing getattr
on a directory it return the rbytes instead of the number of
entities. This commit fixes it.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: fix cleanup when trying to mount inexistent image
net/ceph: make ceph_msgr_wq non-reentrant
ceph: fsc->*_wq's aren't used in memory reclaim path
ceph: Always free allocated memory in osdmap_decode()
ceph: Makefile: Remove unnessary code
ceph: associate requests with opening sessions
ceph: drop redundant r_mds field
ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
ceph: add dir_layout to inode
This implements the DIRLAYOUTHASH protocol feature, which passes the dir
layout over the wire from the MDS. This gives the client knowledge
of the correct hash function to use for mapping dentries among dir
fragments.
Note that if this feature is _not_ present on the client but is on the
MDS, the client may misdirect requests. This will result in a forward
and degrade performance. It may also result in inaccurate NFS filehandle
generation, which will prevent fh resolution when the inode is not present
in the client cache and the parent directories have been fragmented.
Signed-off-by: Sage Weil <sage@newdream.net>
Add a ceph_dir_layout to the inode, and calculate dentry hash values based
on the parent directory's specified dir_hash function. This is needed
because the old default Linux dcache hash function is extremely week and
leads to a poor distribution of files among dir fragments.
Signed-off-by: Sage Weil <sage@newdream.net>
RCU free the struct inode. This will allow:
- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.
The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.
In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.
The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).
Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>