Commit Graph

855 Commits

Author SHA1 Message Date
Josef Bacik 02c24a8218 fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers.  Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2.  For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 20:47:59 -04:00
Josef Bacik 06222e491e fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
This converts everybody to handle SEEK_HOLE/SEEK_DATA properly.  In some cases
we just return -EINVAL, in others we do the normal generic thing, and in others
we're simply making sure that the properly due-dilligence is done.  For example
in NFS/CIFS we need to make sure the file size is update properly for the
SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
that is all we have to do.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 20:47:58 -04:00
Al Viro b85fd6bdc9 don't open-code parent_ino() in assorted ->readdir()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 20:47:54 -04:00
Al Viro a127e0af59 ceph: LOOKUP_OPEN is set only when it's the last component
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:59 -04:00
Al Viro 8a5e929dd2 don't transliterate lower bits of ->intent.open.flags to FMODE_...
->create() instances are much happier that way...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:52 -04:00
Al Viro 10556cb21a ->permission() sanitizing: don't pass flags to ->permission()
not used by the instances anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:24 -04:00
Al Viro 2830ba7f34 ->permission() sanitizing: don't pass flags to generic_permission()
redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of
them removes that bit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:22 -04:00
Al Viro 178ea73521 kill check_acl callback of generic_permission()
its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:16 -04:00
Al Viro 1b71fe2efa ceph analog of cifs build_path_from_dentry() race fix
... unfortunately, cifs bug got copied.  Fix is essentially the same.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-16 23:43:58 -04:00
Sage Weil d7f124f129 ceph: fix sync and dio writes across stripe boundaries
We were iterating across stripe boundaries properly, but not moving the
write buffer pointer forward.  This caused us to rewrite the same data
after the break.  Fix by adjusting the data pointer forward, and
recalculating the io and buffer alignment after the break.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-13 16:26:22 -07:00
Sage Weil 773e9b4426 ceph: fix page alignment corrections
dd if=/dev/urandom of=/mnt/fs_depot/dd10 bs=500 seek=8388 count=1
 dd if=/mnt/fs_depot/dd10 of=/root/dd10out bs=500 skip=8388 count=1

Reported-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-13 16:26:10 -07:00
Sage Weil 0c1f91f271 ceph: unwind canceled flock state
If we request a lock and then abort (e.g., ^C), we need to send a matching
unlock request to the MDS to unwind our lock attempt to avoid indefinitely
blocking other clients.

Reported-by: Brian Chrisman <brchrisman@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07 21:36:45 -07:00
Sage Weil 0e98728fa3 ceph: fix ENOENT logic in striped_read
Getting ENOENT is equivalent to reading 0 bytes.  Make that correction
before setting up the hit_stripe and was_short flags.

Fixes the following case:
 dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0
 dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2 iflag=direct

Reported-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07 21:34:16 -07:00
Sage Weil c3cd62839a ceph: fix short sync reads from the OSD
If we get a short read from the OSD because the object is small, we need to
zero the remainder of the buffer.  For O_DIRECT reads, the attempted range
is not trimmed to i_size by the VFS, so we were actually looping
indefinitely.

Fix by trimming by i_size, and the unconditionally zeroing the trailing
range.

Reported-by: Jeff Wu <cpwu@tnsoft.com.cn>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07 21:34:14 -07:00
Sage Weil 70b666c3b4 ceph: use ihold when we already have an inode ref
We should use ihold whenever we already have a stable inode ref, even
when we aren't holding i_lock.  This avoids adding new and unnecessary
locking dependencies.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-06-07 21:34:11 -07:00
Sage Weil db3540522e ceph: fix cap flush race reentrancy
In e9964c10 we change cap flushing to do a delicate dance because some
inodes on the cap_dirty list could be in a migrating state (got EXPORT but
not IMPORT) in which we couldn't actually flush and move from
dirty->flushing, breaking the while (!empty) { process first } loop
structure.  It worked for a single sync thread, but was not reentrant and
triggered infinite loops when multiple syncers came along.

Instead, move inodes with dirty to a separate cap_dirty_migrating list
when in the limbo export-but-no-import state, allowing us to go back to
the simple loop structure (which was reentrant).  This is cleaner and more
robust.

Audited the cap_dirty users and this looks fine:
list_empty(&ci->i_dirty_item) is still a reliable indicator of whether we
have dirty caps (which list we're on is irrelevant) and list_del_init()
calls still do the right thing.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:12 -07:00
Sage Weil 45e3d3eeb6 ceph: avoid inode lookup on nfs fh reconnect
If we get the inode from the MDS, we have a reference in req; don't do a
fresh lookup.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:06 -07:00
Sage Weil 3c454cf216 ceph: use LOOKUPINO to make unconnected nfs fh more reliable
If we are unable to locate an inode by ino, ask the MDS using the new
LOOKUPINO command.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:05 -07:00
Sage Weil 9d6fcb081a ceph: check return value for start_request in writepages
Since we pass the nofail arg, we should never get an error; BUG if we do.
(And fix the function to not return an error if __map_request fails.)

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:05 -07:00
Sage Weil 6b4a3b517a ceph: remove useless check
rc is only ever 0 or negative in this method.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:05 -07:00
Sage Weil da39822c65 ceph: fix broken comparison in readdir loop
Both off and fi->offset are unsigned, so the difference is always >= 0.
Compare them directly instead of the sign of the difference.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:04 -07:00
Sage Weil 3540303f87 ceph: fix rare potential cap leak
If we grab new_cap, retake the lock, and find we already have a cap now
for the given mds, release new_cap.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:03 -07:00
Sage Weil ae59808301 ceph: use snprintf for dirstat content
We allocate a buffer for rstats if the dirstat option is enabled.  Use
snprintf.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:02 -07:00
Sage Weil 1b36698577 libceph: remove unused variable
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:24:17 -07:00
Sage Weil 3b66378034 ceph: take reference on mds request r_unsafe_dir
We put ourselves on an inode list for the parent directory of metadata
operations so that an fsync on the directory will wait for metadata updates
to commit to disk.  We weren't holding a reference to that directory,
however, and under certain workloads (fsstress in this case) the directory
can go away.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:20:07 -07:00
Henry C Chang d3d0720d4a ceph: do not use i_wrbuffer_ref as refcount for Fb cap
We increments i_wrbuffer_ref when taking the Fb cap. This breaks
the dirty page accounting and causes looping in
__ceph_do_pending_vmtruncate, and ceph client hangs.

This bug can be reproduced occasionally by running blogbench.

Add a new field i_wb_ref to inode and dedicate it to Fb reference
counting.

Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-11 10:44:48 -07:00
Henry C Chang a26a185d27 ceph: fix list_add in ceph_put_snap_realm
Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-11 10:44:36 -07:00
Henry C Chang 7d8e18a69d ceph: print debug message before put mds session
The mds session, s, could be freed during ceph_put_mds_session.
Move dout before ceph_put_mds_session.

Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-11 10:44:34 -07:00
Sage Weil fca65b4ad7 ceph: do not call __mark_dirty_inode under i_lock
The __mark_dirty_inode helper now takes i_lock as of 250df6ed.  Fix the
one ceph callers that held i_lock (__ceph_mark_dirty_caps) to return the
flags value so that the callers can do it outside of i_lock.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-04 12:56:45 -07:00
Henry C Chang 8c71897be2 ceph: handle ceph_osdc_new_request failure in ceph_writepages_start
We should unlock the page and return -ENOMEM if ceph_osdc_new_request
failed.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-03 09:28:12 -07:00
Sage Weil 3772d26d87 ceph: use ihold() when i_lock is held
See 0444d76ae6.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-03 09:28:08 -07:00
Linus Torvalds 42933bac11 Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
  Fix common misspellings
2011-04-07 11:14:49 -07:00
Lucas De Marchi 25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Linus Torvalds 50f3515828 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: Create a new key type "ceph".
  libceph: Get secret from the kernel keys api when mounting with key=NAME.
  ceph: Move secret key parsing earlier.
  libceph: fix null dereference when unregistering linger requests
  ceph: unlock on error in ceph_osdc_start_request()
  ceph: fix possible NULL pointer dereference
  ceph: flush msgr_wq during mds_client shutdown
2011-03-30 09:46:09 -07:00
Tommi Virtanen 8323c3aa74 ceph: Move secret key parsing earlier.
This makes the base64 logic be contained in mount option parsing,
and prepares us for replacing the homebew key management with the
kernel key retention service.

Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29 12:11:16 -07:00
Dave Chinner 0444d76ae6 fs: don't use igrab() while holding i_lock
Fix the incorrect use of igrab() inside the i_lock in NFS and Ceph‥

If we are already holding the i_lock, we have a reference to the
inode so we can safely use ihold() to gain an extra reference. This
avoids hangs due to lock recursion on the i_lock now that the
inode_lock is gone and igrab() uses the i_lock itself.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Ryan Mallon <ryan@bluewatersys.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-29 07:50:34 -07:00
Sage Weil ef550f6f4f ceph: flush msgr_wq during mds_client shutdown
The release method for mds connections uses a backpointer to the
mds_client, so we need to flush the workqueue of any pending work (and
ceph_connection references) prior to freeing the mds_client.  This fixes
an oops easily triggered under UML by

 while true ; do mount ... ; umount ... ; done

Also fix an outdated comment: the flush in ceph_destroy_client only flushes
OSD connections out.  This bug is basically an artifact of the ceph ->
ceph+libceph conversion.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-25 13:27:48 -07:00
Sage Weil 147851d2dc ceph: rename dentry_release -> d_release, fix comment
Just for consistency's sake.  Fix obsolete comment too.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21 12:24:26 -07:00
Henry C Chang 49bcb93236 ceph: add request to the tail of unsafe write list
In sync_write_wait(), we assume that the newest request is at the
tail of unsafe write list. We should maintain the semantics here.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21 12:24:25 -07:00
Henry C Chang 78a255654f ceph: remove request from unsafe list if it is canceled/timed out
This fixes the list corruption warning like this:

------------[ cut here ]------------
WARNING: at lib/list_debug.c:30 __list_add+0x68/0x81()
Hardware name: X8DTU
list_add corruption. prev->next should be next (ffff880618931250), but was (null). (prev=ffff880c188b9130).
Modules linked in: nfsd lockd nfs_acl auth_rpcgss exportfs ceph libceph libcrc32c sunrpc ipv6 fuse igb i2c_i801 ioatdma i2c_core iTCO_wdt iTCO_vendor_support joydev dca serio_raw usb_storage [last unloaded: scsi_wait_scan]
Pid: 10977, comm: smbd Tainted: G        W  2.6.32.23-170.Elaster.xendom0.fc12.x86_64 #1
Call Trace:
[<ffffffff8105753c>] warn_slowpath_common+0x7c/0x94
[<ffffffff810575ab>] warn_slowpath_fmt+0x41/0x43
[<ffffffff812351a3>] __list_add+0x68/0x81
[<ffffffffa014799d>] ceph_aio_write+0x614/0x8a2 [ceph]
[<ffffffff8111d2a0>] do_sync_write+0xe8/0x125
[<ffffffff81075a1f>] ? autoremove_wake_function+0x0/0x39
[<ffffffff811f21ec>] ? selinux_file_permission+0x5c/0xb3
[<ffffffff811e8521>] ? security_file_permission+0x16/0x18
[<ffffffff8111d864>] vfs_write+0xae/0x10b
[<ffffffff8111d91b>] sys_pwrite64+0x5a/0x76
[<ffffffff81012d32>] system_call_fastpath+0x16/0x1b
---[ end trace 08573eb9f07ff6f4 ]---

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21 12:24:24 -07:00
Sage Weil 80456f8672 ceph: move readahead default to fs/ceph from libceph
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21 12:24:23 -07:00
Yehuda Sadeh ad1fee96cb ceph: add ino32 mount option
The ino32 mount option forces the ceph fs to report 32 bit
ino values.  This is useful for 64 bit kernels with 32 bit userspace.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2011-03-21 12:24:22 -07:00
Sage Weil 21f3b5f1bb ceph: remove debugfs debug cruft
Whoops!

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21 12:24:20 -07:00
Sage Weil 09adc80c61 ceph: preserve I_COMPLETE across rename
d_move puts the renamed dentry at the end of d_subdirs, screwing with our
cached dentry directory offsets.  We were just clearing I_COMPLETE to avoid
any possibility of trouble.  However, assigning the renamed dentry an
offset at the end of the directory (to match it's new d_subdirs position)
is sufficient to maintain correct behavior and hold onto I_COMPLETE.

This is especially important for workloads like rsync, which renames files
into place.  Before, we would lose I_COMPLETE and do MDS lookups for each
file.  With this patch we only talk to the MDS on create and rename.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-15 09:14:03 -07:00
Al Viro 0eb980e317 ceph: fix d_revalidate oopsen on NFS exports
can't blindly check nd->flags in ->d_revalidate()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-10 03:44:05 -05:00
Sage Weil 455cec0abf ceph: no .snap inside of snapped namespace
Otherwise you can do things like

# mkdir .snap/foo
# cd .snap/foo/.snap
# ls
<badness>

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-04 12:25:09 -08:00
Sage Weil 16a8b70a5a ceph: do not clear I_COMPLETE from d_release
First, this was racy anyway: d_release isn't called until well after the
dentry is unhashed.  Second, this runs afoul of the recent dcache change
that clears d_parent prior to calling d_release (949854d0), causing a NULL
pointer dereference.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-03 10:09:52 -08:00
Sage Weil b545cc1505 ceph: do not set I_COMPLETE
Do not set the I_COMPLETE flag on directories until we resolve races with
dcache pruning.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-03 10:09:51 -08:00
Sage Weil 9bde178d05 Revert "ceph: keep reference to parent inode on ceph_dentry"
This reverts commit 97d79b403e.

This fails to account for d_parent changes due to rename or disconnected
dentries due to submounts or NFS reexports.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-03 10:09:50 -08:00
Linus Torvalds 8bd89ca220 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: keep reference to parent inode on ceph_dentry
  ceph: queue cap_snaps once per realm
  libceph: fix socket write error handling
  libceph: fix socket read error handling
2011-02-21 15:01:38 -08:00
Yehuda Sadeh 97d79b403e ceph: keep reference to parent inode on ceph_dentry
When creating a new dentry we now hold a reference to the parent
inode in the ceph_dentry.  This is required due to the new RCU
changes from 949854d0, which set dentry->d_parent to NULL in d_kill before
calling the ->release() callback.  If/when that behavior is changed, we can
revert this hack.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-02-19 19:59:14 -08:00
Sage Weil e8e1ba96b2 ceph: queue cap_snaps once per realm
We were forming a dirty list, and then queueing cap_snaps for each realm
_and_ its children, regardless of whether the children were already in the
dirty list.  This meant we did it twice for some realms.  Which in turn
meant we corrupted mdsc->snap_flush_list when the cap_snap was re-added to
the list it was already on, and could trigger an infinite loop.

We were also using recursion to do reach all the children, a no-no when
stack is limited.

Instead, (re)queue any children on the dirty list, avoiding processing
anything twice and avoiding any recursion.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-02-04 20:45:58 -08:00
Linus Torvalds b12ece7d85 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: avoid picking MDS that is not active
  ceph: avoid immediate cap check after import
  ceph: fix flushing of caps vs cap import
  ceph: fix erroneous cap flush to non-auth mds
  ceph: fix cap_wanted_delay_{min,max} mount option initialization
  ceph: fix xattr rbtree search
  ceph: fix getattr on directory when using norbytes
2011-01-28 12:12:58 +10:00
Sage Weil d66bbd441c ceph: avoid picking MDS that is not active
Ignore replication or auth frag data if it indicates an MDS that is not
active.  This can happen if the MDS shuts down and the client has stale
data about the namespace distribution across the MDS cluster.  If that's
the case, fall back to directing the request based on the auth cap (which
should always be accurate).

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-25 08:16:37 -08:00
Sage Weil 7e57b81c76 ceph: avoid immediate cap check after import
The NODELAY flag avoids the heuristics that delay cap (issued/wanted)
release.  There's no reason for that after we import a cap, and it kills
whatever benefit we get from those delays.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-19 09:23:26 -08:00
Sage Weil 088b3f5e9e ceph: fix flushing of caps vs cap import
If we are mid-flush and a cap is migrated to another node, we need to
resend the cap flush message to the new MDS, and do so with the original
flush_seq to avoid leaking across a sync boundary.  Previously we didn't
redo the flush (we only flushed newly dirty data), which would cause a
later sync to hang forever.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-19 09:23:25 -08:00
Sage Weil 24be0c4810 ceph: fix erroneous cap flush to non-auth mds
The int flushing is global and not clear on each iteration of the loop,
which can cause a second flush of caps to any MDSs with ids greater than
the auth.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-19 09:23:24 -08:00
Sage Weil 50aac4fec5 ceph: fix cap_wanted_delay_{min,max} mount option initialization
These were initialized to 0 instead of the default, fallout from the RBD
refactor in 3d14c5d2b6.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-19 09:23:22 -08:00
Sage Weil 17db143fc0 ceph: fix xattr rbtree search
Fix xattr name comparison in rbtree search for strings that share a prefix.
The *name argument is null terminated, but the xattr name is not, so we
need to use strncmp, but that means adjusting for the case where name is
a prefix of xattr->name.

The corresponding case in __set_xattr() already handles this properly
(although in that case *name is also not null terminated).

Reported-by: Sergiy Kibrik <sakib@meta.ua>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-13 15:50:11 -08:00
Yehuda Sadeh 1c1266bb91 ceph: fix getattr on directory when using norbytes
The norbytes mount option was broken, and when doing getattr
on a directory it return the rbytes instead of the number of
entities. This commit fixes it.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-13 15:50:06 -08:00
Linus Torvalds a170315420 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix cleanup when trying to mount inexistent image
  net/ceph: make ceph_msgr_wq non-reentrant
  ceph: fsc->*_wq's aren't used in memory reclaim path
  ceph: Always free allocated memory in osdmap_decode()
  ceph: Makefile: Remove unnessary code
  ceph: associate requests with opening sessions
  ceph: drop redundant r_mds field
  ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
  ceph: add dir_layout to inode
2011-01-13 10:25:24 -08:00
Tejun Heo 01e6acc4ea ceph: fsc->*_wq's aren't used in memory reclaim path
fsc->*_wq's aren't depended upon during memory reclaim.  Convert to
alloc_workqueue() w/o WQ_MEM_RECLAIM.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Sage Weil <sage@newdream.net>
Cc: ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:14 -08:00
Tracey Dent 582c86e690 ceph: Makefile: Remove unnessary code
Remove the if and else conditional because the code is in mainline and there
is no need in it being there.

Also, Changed Makefile to use <modules>-y instead of <modules>-objs
because -objs is deprecated and not mentioned in
 Documentation/kbuild/makefiles.txt.

Signed-off-by: Tracey Dent <tdent48227@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:13 -08:00
Sage Weil dc69e2e9fc ceph: associate requests with opening sessions
Associate request with sessions that aren't yep open.  This makes the
debugfs mdsc request list more informative.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:13 -08:00
Sage Weil 4af25fdda6 ceph: drop redundant r_mds field
The r_mds field is redundant, since we can find the same information at
r_session->s_mds, and when r_session is NULL then r_mds is meaningless.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:13 -08:00
Sage Weil 14303d20f3 ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
This implements the DIRLAYOUTHASH protocol feature, which passes the dir
layout over the wire from the MDS.  This gives the client knowledge
of the correct hash function to use for mapping dentries among dir
fragments.

Note that if this feature is _not_ present on the client but is on the
MDS, the client may misdirect requests.  This will result in a forward
and degrade performance.  It may also result in inaccurate NFS filehandle
generation, which will prevent fh resolution when the inode is not present
in the client cache and the parent directories have been fragmented.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:13 -08:00
Sage Weil 6c0f3af72c ceph: add dir_layout to inode
Add a ceph_dir_layout to the inode, and calculate dentry hash values based
on the parent directory's specified dir_hash function.  This is needed
because the old default Linux dcache hash function is extremely week and
leads to a poor distribution of files among dir fragments.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:12 -08:00
Nick Piggin b74c79e993 fs: provide rcu-walk aware permission i_ops
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:29 +11:00
Nick Piggin 34286d6662 fs: rcu-walk aware d_revalidate method
Require filesystems be aware of .d_revalidate being called in rcu-walk
mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
-ECHILD from all implementations.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:29 +11:00
Nick Piggin fb045adb99 fs: dcache reduce branches in lookup path
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.

Patched with:

git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:28 +11:00
Nick Piggin fa0d7e3de6 fs: icache RCU free inodes
RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
  permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
  page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:26 +11:00
Nick Piggin b5c84bf6f6 fs: dcache remove dcache_lock
dcache_lock no longer protects anything. remove it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:23 +11:00
Nick Piggin 2fd6b7f507 fs: dcache scale subdirs
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).

Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:21 +11:00
Nick Piggin da5029563a fs: dcache scale d_unhashed
Protect d_unhashed(dentry) condition with d_lock. This means keeping
DCACHE_UNHASHED bit in synch with hash manipulations.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:21 +11:00
Nick Piggin b7ab39f631 fs: dcache scale dentry refcount
Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:21 +11:00
Henry C Chang b6aa5901c7 ceph: mark user pages dirty on direct-io reads
For read operation, we have to set the argument _write_ of get_user_pages
to 1 since we will write data to pages. Also, we need to SetPageDirty before
releasing these pages.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 09:54:40 -08:00
Sage Weil 92cf765237 ceph: fix null pointer dereference in ceph_init_dentry for nfs reexport
The fh_to_dentry etc. methods use ceph_init_dentry(), which assumes that
d_parent is defined.  It isn't for those callers, so check!

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 09:53:48 -08:00
Henry C Chang ab226e21ad ceph: fix direct-io on non-page-aligned buffers
The user buffer may be 512-byte aligned, not page-aligned.  We were
assuming the buffer was page-aligned and only accounting for
non-page-aligned io offsets.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-15 20:46:16 -08:00
Sage Weil 1cd275f609 ceph: fix ioctl magic
The ioctl magic was inadvertently changed in 571dba52.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-06 09:45:22 -08:00
Herb Shiu a5b10629ed ceph: Behave better when handling file lock replies.
Fill in the local lock with response data if appropriate,
and don't call posix_lock_file when reading locks.

Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:22:34 -08:00
Herb Shiu 637ae8d547 ceph: pass lock information by struct file_lock instead of as individual params.
Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:22:34 -08:00
Herb Shiu 25933abdd8 ceph: Handle file locks in replies from the MDS.
Previously the kernel client incorrectly assumed everything was a directory.

Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:22:27 -08:00
Sage Weil 884ea89276 ceph: avoid possible null deref in readdir after dir llseek
last may be NULL, but we dereference it in the else branch without
checking.  Normally it doesn't trigger because last == NULL when fpos == 2,
but it could happen on a newly opened dir if the user seeks forward.

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:15:31 -08:00
Linus Torvalds 76db8ac45f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: fix readdir EOVERFLOW on 32-bit archs
  ceph: fix frag offset for non-leftmost frags
  ceph: fix dangling pointer
  ceph: explicitly specify page alignment in network messages
  ceph: make page alignment explicit in osd interface
  ceph: fix comment, remove extraneous args
  ceph: fix update of ctime from MDS
  ceph: fix version check on racing inode updates
  ceph: fix uid/gid on resent mds requests
  ceph: fix rdcache_gen usage and invalidate
  ceph: re-request max_size if cap auth changes
  ceph: only let auth caps update max_size
  ceph: fix open for write on clustered mds
  ceph: fix bad pointer dereference in ceph_fill_trace
  ceph: fix small seq message skipping
  Revert "ceph: update issue_seq on cap grant"
2010-11-19 15:32:22 -08:00
Sage Weil 3105c19c45 ceph: fix readdir EOVERFLOW on 32-bit archs
One of the readdir filldir_t callers was passing the raw ceph 64-bit ino
instead of the hashed 32-bit one, producing an EOVERFLOW in the filler
callback.  Fix this by calling the ceph_vino_to_ino() helper to do the
conversion.

Reported-by: Jan Smets <jan.smets@alcatel-lucent.com>
Tested-by: Jan Smets <jan.smets@alcatel-lucent.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-18 09:15:07 -08:00
Arnd Bergmann 451a3c24b0 BKL: remove extraneous #include <smp_lock.h>
The big kernel lock has been removed from all these files at some point,
leaving only the #include.

Remove this too as a cleanup.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-17 08:59:32 -08:00
Sage Weil 7b88dadc13 ceph: fix frag offset for non-leftmost frags
We start at offset 2 for the leftmost frag, and 0 for subsequent frags.
When we reach the end (rightmost), we go back to 2.  This fixes readdir on
fragmented (large) directories.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-11 16:48:59 -08:00
Sage Weil a1629c3b24 ceph: fix dangling pointer
Clear fi->last_name when it's freed.  The only caller is rewinddir() (or
equivalent lseek).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-11 15:24:06 -08:00
Sage Weil b7495fc2ff ceph: make page alignment explicit in osd interface
We used to infer alignment of IOs within a page based on the file offset,
which assumed they matched.  This broke with direct IO that was not aligned
to pages (e.g., 512-byte aligned IO).  We were also trusting the alignment
specified in the OSD reply, which could have been adjusted by the server.

Explicitly specify the page alignment when setting up OSD IO requests.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-09 12:43:12 -08:00
Sage Weil e98b6fed84 ceph: fix comment, remove extraneous args
The offset/length arguments aren't used.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-09 12:24:53 -08:00
Sage Weil d8672d64b8 ceph: fix update of ctime from MDS
The client can have a newer ctime than the MDS due to AUTH_EXCL and
XATTR_EXCL caps as well; update the check in ceph_fill_file_time
appropriately.

This fixes cases where ctime/mtime goes backward under the right sequence
of local updates (e.g. chmod) and mds replies (e.g. subsequent stat that
goes to the MDS).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-08 09:24:34 -08:00
Sage Weil 8bd59e0188 ceph: fix version check on racing inode updates
We may get updates on the same inode from multiple MDSs; generally we only
pay attention if the update is newer than what we already have.  The
exception is when an MDS sense unstable information, in which case we
always update.

The old > check got this wrong when our version was odd (e.g. 3) and the
reply version was even (e.g. 2): the older stale (v2) info would be
applied.  Fixed and clarified the comment.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-08 09:23:12 -08:00
Sage Weil cb4276cca4 ceph: fix uid/gid on resent mds requests
MDS requests can be rebuilt and resent in non-process context, but were
filling in uid/gid from current_fsuid/gid.  Put that information in the
request struct on request setup.

This fixes incorrect (and root) uid/gid getting set for requests that
are forwarded between MDSs, usually due to metadata migrations.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-08 07:29:05 -08:00
Sage Weil cd045cb42a ceph: fix rdcache_gen usage and invalidate
We used to use rdcache_gen to indicate whether we "might" have cached
pages.  Now we just look at the mapping to determine that.  However, some
old behavior remains from that transition.

First, rdcache_gen == 0 no longer means we have no pages.  That can happen
at any time (presumably when we carry FILE_CACHE).  We should not reset it
to zero, and we should not check that it is zero.

That means that the only purpose for rdcache_revoking is to resolve races
between new issues of FILE_CACHE and an async invalidate.  If they are
equal, we should invalidate.  On success, we decrement rdcache_revoking,
so that it is no longer equal to rdcache_gen.  Similarly, if we success
in doing a sync invalidate, set revoking = gen - 1.  (This is a small
optimization to avoid doing unnecessary invalidate work and does not
affect correctness.)

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-08 07:29:05 -08:00
Sage Weil feb4cc9bb4 ceph: re-request max_size if cap auth changes
If the auth cap migrates to another MDS, clear requested_max_size so that
we resend any pending max_size increase requests.  This fixes potential
hangs on writes that extend a file and race with an cap migration between
MDSs.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 09:39:23 -08:00
Sage Weil 912a9b0319 ceph: only let auth caps update max_size
Only the auth MDS has a meaningful max_size value for us, so only update it
in fill_inode if we're being issued an auth cap.  Otherwise, a random
stat result from a non-auth MDS can clobber a meaningful max_size, get
the client<->mds cap state out of sync, and make writes hang.

Specifically, even if the client re-requests a larger max_size (which it
will), the MDS won't respond because as far as it knows we already have a
sufficiently large value.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 09:39:21 -08:00
Sage Weil 7421ab8041 ceph: fix open for write on clustered mds
Normally when we open a file we already have a cap, and simply update the
wanted set.  However, if we open a file for write, but don't have an auth
cap, that doesn't work; we need to open a new cap with the auth MDS.  Only
reuse existing caps if we are opening for read or the existing cap is auth.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 09:07:15 -08:00
Sage Weil d8b16b3d1c ceph: fix bad pointer dereference in ceph_fill_trace
We dereference *in a few lines down, but only set it on rename.  It is
apparently pretty rare for this to trigger, but I have been hitting it
with a clustered MDSs.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-07 08:40:43 -08:00
Al Viro a7f9fb205a convert ceph
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:17:18 -04:00
Sage Weil 2f56f56ad9 Revert "ceph: update issue_seq on cap grant"
This reverts commit d91f2438d8.

The intent of issue_seq is to distinguish between mds->client messages that
(re)create the cap and those that do not, which means we should _only_ be
updating that value in the create paths.  By updating it in handle_cap_grant,
we reset it to zero, which then breaks release.

The larger question is what workload/problem made me think it should be
updated here...

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-27 21:05:54 -07:00
Wu Fengguang 1b430beee5 writeback: remove nonblocking/encountered_congestion references
This removes more dead code that was somehow missed by commit 0d99519efe
(writeback: remove unused nonblocking and congestion checks).  There are
no behavior change except for the removal of two entries from one of the
ext4 tracing interface.

The nonblocking checks in ->writepages are no longer used because the
flusher now prefer to block on get_request_wait() than to skip inodes on
IO congestion.  The latter will lead to more seeky IO.

The nonblocking checks in ->writepage are no longer used because it's
redundant with the WB_SYNC_NONE check.

We no long set ->nonblocking in VM page out and page migration, because
a) it's effectively redundant with WB_SYNC_NONE in current code
b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
   that would skip some dirty inodes on congestion and page out others, which
   is unfair in terms of LRU age.

Inspired by Christoph Hellwig. Thanks!

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: David Howells <dhowells@redhat.com>
Cc: Sage Weil <sage@newdream.net>
Cc: Steve French <sfrench@samba.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:05 -07:00
Sage Weil efa4c1206e ceph: do not carry i_lock for readdir from dcache
We were taking dcache_lock inside of i_lock, which introduces a dependency
not found elsewhere in the kernel, complicationg the vfs locking
scalability work.  Since we don't actually need it here anyway, remove
it.

We only need i_lock to test for the I_COMPLETE flag, so be careful to do
so without dcache_lock held.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:27 -07:00
Julia Lawall 61413c2f59 fs/ceph/xattr.c: Use kmemdup
Convert a sequence of kmalloc and memcpy to use kmemdup.

The semantic patch that performs this transformation is:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression a,flag,len;
expression arg,e1,e2;
statement S;
@@

  a =
-  \(kmalloc\|kzalloc\)(len,flag)
+  kmemdup(arg,len,flag)
  <... when != a
  if (a == NULL || ...) S
  ...>
- memcpy(a,arg,len+1);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:26 -07:00
Greg Farnum 571dba52a3 ceph: add CEPH_MDS_OP_SETDIRLAYOUT and associated ioctl.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:23 -07:00
Randy Dunlap 6f453ed6c0 ceph: fix debugfs warnings
Include "super.h" outside of CONFIG_DEBUG_FS to eliminate a compiler warning:

fs/ceph/debugfs.c:266: warning: 'struct ceph_fs_client' declared inside parameter list
fs/ceph/debugfs.c:266: warning: its scope is only this definition or declaration, which is probably not what you want
fs/ceph/debugfs.c:271: warning: 'struct ceph_fs_client' declared inside parameter list

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-10-20 15:38:21 -07:00
Sage Weil 496e59553c ceph: switch from BKL to lock_flocks()
Switch from using the BKL explicitly to the new lock_flocks() interface.
Eventually this will turn into a spinlock.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:18 -07:00
Greg Farnum fca4451acf ceph: preallocate flock state without locks held
When the lock_kernel() turns into lock_flocks() and a spinlock, we won't
be able to do allocations with the lock held.  Preallocate space without
the lock, and retry if the lock state changes out from underneath us.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:17 -07:00
Sage Weil 18a38193ef ceph: use mapping->nrpages to determine if mapping is empty
This is simpler and faster.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:15 -07:00
Sage Weil 93afd449aa ceph: only invalidate on check_caps if we actually have pages
The i_rdcache_gen value only implies we MAY have cached pages; actually
check the mapping to see if it's worth bothering with an invalidate.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:15 -07:00
Sage Weil 4c32f5dda5 ceph: do not hide .snap in root directory
Snaps in the root directory are now supported by the MDS, and harmless on
older versions.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:38:14 -07:00
Yehuda Sadeh 3d14c5d2b6 ceph: factor out libceph from Ceph file system
This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph.  This
is mostly a matter of moving files around.  However, a few key pieces
of the interface change as well:

 - ceph_client becomes ceph_fs_client and ceph_client, where the latter
   captures the mon and osd clients, and the fs_client gets the mds client
   and file system specific pieces.
 - Mount option parsing and debugfs setup is correspondingly broken into
   two pieces.
 - The mon client gets a generic handler callback for otherwise unknown
   messages (mds map, in this case).
 - The basic supported/required feature bits can be expanded (and are by
   ceph_fs_client).

No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:37:28 -07:00
Yehuda Sadeh ae1533b62b ceph-rbd: osdc support for osd call and rollback operations
This will be used for rbd snapshots administration.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-10-20 15:37:25 -07:00
Yehuda Sadeh 68b4476b0b ceph: messenger and osdc changes for rbd
Allow the messenger to send/receive data in a bio.  This is added
so that we wouldn't need to copy the data into pages or some other buffer
when doing IO for an rbd block device.

We can now have trailing variable sized data for osd
ops.  Also osd ops encoding is more modular.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:37:18 -07:00
Yehuda Sadeh 3499e8a5d4 ceph: refactor osdc requests creation functions
The osd requests creation are being decoupled from the
vino parameter, allowing clients using the osd to use
other arbitrary object names that are not necessarily
vino based. Also, calc_raw_layout now takes a snap id.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:36:01 -07:00
Yehuda Sadeh 7669a2c95e ceph: lookup pool in osdmap by name
Implement a pool lookup by name.  This will be used by rbd.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:35:36 -07:00
Sage Weil d91f2438d8 ceph: update issue_seq on cap grant
We need to update the issue_seq on any grant operation, be it via an MDS
reply or a separate grant message.  The update in the grant path was
missing.  This broke cap release for inodes in which the MDS sent an
explicit grant message that was not soon after followed by a successful
MDS reply on the same inode.

Also fix the signedness on seq locals.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:01:50 -07:00
Greg Farnum 21b559de56 ceph: send cap release message early on failed revoke.
If an MDS tries to revoke caps that we don't have, we want to send
releases early since they probably contain the caps message the MDS
is looking for.

Previously, we only sent the messages if we didn't have the inode either. But
in a multi-mds system we can retain the inode after dropping all caps for
a single MDS.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:00:24 -07:00
Aneesh Kumar K.V bba0cd0e3d ceph: Update max_len with minimum required size
encode_fh on error should update max_len with minimum required
size, so that caller can redo the call with the reallocated buffer.
This is required with open by handle patch series

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:00:24 -07:00
Aneesh Kumar K.V 92923dcbfc ceph: Fix return value of encode_fh function
encode_fh function should return 255 on error as done by other file
system to indicate EOVERFLOW. Also max_len is in sizeof(u32) units
and not in bytes.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:00:23 -07:00
Sage Weil 6bc18876ba ceph: avoid null deref in osd request error path
If we interrupt an osd request, we call __cancel_request, but it wasn't
verifying that req->r_osd was non-NULL before dereferencing it.  This could
cause a crash if osds were flapping and we aborted a request on said osd.

Reported-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:00:23 -07:00
Henry C Chang 936aeb5c4a ceph: fix list_add usage on unsafe_writes list
Fix argument order.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:00:23 -07:00
Sage Weil be4f104dfd ceph: select CRYPTO
We select CRYPTO_AES, but not CRYPTO.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-17 12:30:31 -07:00
Sage Weil a43fb73101 ceph: check mapping to determine if FILE_CACHE cap is used
See if the i_data mapping has any pages to determine if the FILE_CACHE
capability is currently in use, instead of assuming it is any time the
rdcache_gen value is set (i.e., issued -> used).

This allows the MDS RECALL_STATE process work for inodes that have cached
pages.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-17 09:54:31 -07:00
Sage Weil e835124c2b ceph: only send one flushsnap per cap_snap per mds session
Sending multiple flushsnap messages is problematic because we ignore
the response if the tid doesn't match, and the server may only respond to
each one once.  It's also a waste.

So, skip cap_snaps that are already on the flushing list, unless the caller
tells us to resend (because we are reconnecting).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-17 08:03:08 -07:00
Sage Weil ae00d4f37f ceph: fix cap_snap and realm split
The cap_snap creation/queueing relies on both the current i_head_snapc
_and_ the i_snap_realm pointers being correct, so that the new cap_snap
can properly reference the old context and the new i_head_snapc can be
updated to reference the new snaprealm's context.  To fix this, we:

 - move inodes completely to the new (split) realm so that i_snap_realm
   is correct, and
 - generate the new snapc's _before_ queueing the cap_snaps in
   ceph_update_snap_trace().

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-16 16:26:51 -07:00
Sage Weil cfc0bf6640 ceph: stop sending FLUSHSNAPs when we hit a dirty capsnap
Stop sending FLUSHSNAP messages when we hit a capsnap that has dirty_pages
or is still writing.  We'll send the newer capsnaps only after the older
ones complete.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-14 15:50:59 -07:00
Sage Weil 8bef9239ee ceph: correctly set 'follows' in flushsnap messages
The 'follows' should match the seq for the snap context for the given snap
cap, which is the context under which we have been dirtying and writing
data and metadata.  The snapshot that _contains_ those updates thus
_follows_ that context's seq #.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-14 15:45:44 -07:00
Sage Weil 467c525109 ceph: fix dn offset during readdir_prepopulate
When adding the readdir results to the cache, ceph_set_dentry_offset was
clobbered our just-set offset.  This can cause the readdir result offsets
to get out of sync with the server.  Add an argument to the helper so
that it does not.

This bug was introduced by 1cd3935bed.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-13 11:40:36 -07:00
Sage Weil a77d9f7dce ceph: fix file offset wrapping at 4GB on 32-bit archs
Cast the value before shifting so that we don't run out of bits with a
32-bit unsigned long.  This fixes wrapping of high file offsets into the
low 4GB of a file on disk, and the subsequent data corruption for large
files.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-11 10:55:25 -07:00
Sage Weil 3612abbd5d ceph: fix reconnect encoding for old servers
Fix the reconnect encoding to encode the cap record when the MDS does not
have the FLOCK capability (i.e., pre v0.22).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-11 10:52:47 -07:00
Yehuda Sadeh 3d4401d9d0 ceph: fix pagelist kunmap tail
A wrong parameter was passed to the kunmap.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-11 10:52:47 -07:00
Sage Weil ca04d9c3ec ceph: fix null pointer deref on anon root dentry release
When we release a root dentry, particularly after a splice, the parent
(actually our) inode was evaluating to NULL and was getting dereferenced
by ceph_snap().  This is reproduced by something as simple as

 mount -t ceph monhost:/a/b mnt
 mount -t ceph monhost:/a mnt2
 ls mnt2

A splice_dentry() would kill the old 'b' inode's root dentry, and we'd
crash while releasing it.

Fix by checking for both the ROOT and NULL cases explicitly.  We only need
to invalidate the parent dir when we have a correct parent to invalidate.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-09-11 10:52:47 -07:00
Dan Carpenter b545787dbb ceph: fix get_ticket_handler() error handling
get_ticket_handler() returns a valid pointer or it returns
ERR_PTR(-ENOMEM) if kzalloc() fails.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-26 09:26:50 -07:00
Sage Weil e072f8aa35 ceph: don't BUG on ENOMEM during mds reconnect
We are in a position to return an error; do that instead.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-26 09:26:37 -07:00
Dan Carpenter f44c3890d9 ceph: ceph_mdsc_build_path() returns an ERR_PTR
ceph_mdsc_build_path() returns an ERR_PTR but this code is set up to
handle NULL returns.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-26 09:24:28 -07:00
Alan Cox ad8453ab0a ceph: Fix warnings
Just scrubbing some warnings so I can see real problem ones in the build
noise. For 32bit we need to coax gcc politely into believing we really
honestly intend to the casts. Using (u64)(unsigned long) means we cast from
a pointer to a type of the right size and then extend it. This stops the
warning spew.

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-25 12:02:14 -07:00
Dan Carpenter ac1f12ef56 ceph: ceph_get_inode() returns an ERR_PTR
ceph_get_inode() returns an ERR_PTR and it doesn't return a NULL.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-25 12:01:54 -07:00
Sage Weil 36e21687e6 ceph: initialize fields on new dentry_infos
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-24 16:24:19 -07:00
Sage Weil 7d8cb26d7d ceph: maintain i_head_snapc when any caps are dirty, not just for data
We used to use i_head_snapc to keep track of which snapc the current epoch
of dirty data was dirtied under.  It is used by queue_cap_snap to set up
the cap_snap.  However, since we queue cap snaps for any dirty caps, not
just for dirty file data, we need to keep a valid i_head_snapc anytime
we have dirty|flushing caps.  This fixes a NULL pointer deref in
queue_cap_snap when writing back dirty caps without data (e.g.,
snaptest-authwb.sh).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-24 16:24:18 -07:00
Henry C Chang 07a27e226d ceph: fix osd request lru adjustment when sending request
Fix argument order.  We want to move the item to the end of the list, not
change the position of the head.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 21:34:27 -07:00
Sage Weil 124514918b ceph: don't improperly set dir complete when holding EXCL cap
If we hold the EXCL cap, we cannot trust the dir stats from the MDS (num
files, subdirs) and must not incorrectly conclude that the directory is
empty.  If we do, we get can bad results from lookup (bad ENOENT) and
bad readdir results.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 21:33:32 -07:00
Michael Rubin 679ceace84 mm: exporting account_page_dirty
This allows code outside of the mm core to safely manipulate page state
and not worry about the other accounting. Not using these routines means
that some code will lose track of the accounting and we get bugs. This
has happened once already.

Signed-off-by: Michael Rubin <mrubin@google.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 15:16:51 -07:00
Sage Weil eb6bb1c5bd ceph: direct requests in snapped namespace based on nonsnap parent
When making a request in the virtual snapdir or a snapped portion of the
namespace, we should choose the MDS based on the first nonsnap parent (and
its caps).  If that is not the best place, we will get forward hints to
find the right MDS in the cluster.  This fixes ESTALE errors when using
the .snap directory and namespace with multiple MDSs.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 15:16:48 -07:00
Sage Weil ed32604448 ceph: queue cap snap writeback for realm children on snap update
When a realm is updated, we need to queue writeback on inodes in that
realm _and_ its children.  Otherwise, if the inode gets cowed on the
server, we can get a hang later due to out-of-sync cap/snap state.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 15:16:47 -07:00
Sage Weil 4a625be472 ceph: include dirty xattrs state in snapped caps
When we snapshot dirty metadata that needs to be written back to the MDS,
include dirty xattr metadata.  Make the capsnap reference the encoded
xattr blob so that it will be written back in the FLUSHSNAP op.

Also fix the capsnap creation guard to include dirty auth or file bits,
not just tests specific to dirty file data or file writes in progress
(this fixes auth metadata writeback).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 15:16:46 -07:00
Sage Weil 082afec92d ceph: fix xattr cap writeback
We should include the xattr metadata blob in the cap update message any
time we are flushing dirty state, NOT just when we are also dropping the
cap.  This fixes async xattr writeback.

Also, clean up the code slightly to avoid duplicating the bit test.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 15:16:41 -07:00
Sage Weil f3c60c5918 ceph: fix multiple mds session shutdown
The use of a completion when waiting for session shutdown during umount is
inappropriate, given the complexity of the condition.  For multiple MDS's,
this resulted in the umount thread spinning, often preventing the session
close message from being processed in some cases.

Switch to a waitqueue and defined a condition helper.  This cleans things
up nicely.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 15:04:43 -07:00
Yehuda Sadeh e56fa10e92 ceph: generalize mon requests, add pool op support
Generalize the current statfs synchronous requests, and support pool_ops.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-10 14:41:25 -07:00
Sage Weil 0eb6cd49f6 ceph: only queue async writeback on cap revocation if there is dirty data
Normally, if the Fb cap bit is being revoked, we queue an async writeback.
If there is no dirty data but we still hold the cap, this leaves the
client sitting around doing nothing until the cap timeouts expire and the
cap is released on its own (as it would have been without the revocation).

Instead, only queue writeback if the bit is actually used (i.e., we have
dirty data).  If not, we can reply to the revocation immediately.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-05 13:53:40 -07:00
Sage Weil e9d1774431 ceph: do not ignore osd_idle_ttl mount option
Actually apply the mount option to the mount_args struct.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-03 12:56:57 -07:00
Sage Weil 52dfb8ac0e ceph: constify dentry_operations
This makes checkpatch happy.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-03 10:25:30 -07:00
Sage Weil 213c99ee0c ceph: whitespace cleanup
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-03 10:25:11 -07:00
Greg Farnum 40819f6fb2 ceph: add flock/fcntl lock support
Implement flock inode operation to support advisory file locking.  All
lock/unlock operations are synchronous with the MDS.  Lock state is
sent when reconnecting to a recovering MDS to restore the shared lock
state.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 16:10:53 -07:00
Greg Farnum fbaad9797a ceph: define on-wire types, constants for file locking support
Define the MDS operations and data types for doing file advisory locking
with the MDS.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 15:48:54 -07:00
Greg Farnum c6f3fdc592 ceph: add CEPH_FEATURE_FLOCK to the supported feature bits
This informs the server that we will accept v2 client_caps format and v2
client_reconnect format messages.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 15:48:51 -07:00
Sage Weil 20cb34ae9e ceph: support v2 reconnect encoding
Encode either old or v2 encoding of client_reconnect message, depending on
whether the peer has the FLOCK feature bit.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 15:48:50 -07:00
Sage Weil ce1fbc8dd6 ceph: support v2 client_caps encoding
Add support for v2 encoding of MClientCaps, which includes a flock blob.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 15:48:49 -07:00
Sage Weil cbbfe49905 ceph: move AES iv definition to shared header
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 15:48:31 -07:00
Sage Weil 73a7e693f9 ceph: fix decoding of pool snap info
The pool info contains a vector for snap_info_t, not snap ids.  This fixes
the broken decoding, which would declare teh update corrupt when a pool
snapshot was created.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 11:10:07 -07:00
Sage Weil 2d9c98ae97 ceph: make ->sync_fs not wait if wait==0
The ->sync_fs() super op only needs to wait if wait is true.  Otherwise,
just get some dirty cap writeback started.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil b8cd07e78e ceph: warn on missing snap realm
Well, this Shouldn't Happen, so it would be helpful to know the caller when
it does.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil effcb9ed43 ceph: print useful error message when crush rule not found
Include the crush_ruleset in the error message.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil a8b763a9b3 ceph: use %pU to print uuid (fsid)
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil f0b18d9f22 ceph: sync header defs with server code
Define ROLLBACK op, IFLOCK inode lock (for advisory file locking).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil 5cd068c200 ceph: clean up header guards
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil 9688f19a18 ceph: strip misleading/obsolete version, feature info
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil 6a2593823a ceph: specify supported features in super.h
Specify the supported/required feature bits in super.h client code instead
of using the definitions from the shared kernel/userspace headers (which
will go away shortly).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil c309f0ab26 ceph: clean up fsid mount option
Specify the fsid mount option in hex, not via the major/minor u64 hackery we had
before.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil e0f9f9ee8f ceph: remove unused 'monport' mount option
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Greg Farnum e55b71f802 ceph: handle ESTALE properly; on receipt send to authority if it wasn't
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Greg Farnum 2bc50259fa ceph: add ceph_get_cap_for_mds function.
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil 154f42c2c3 ceph: connect to export targets on cap export
When we get a cap EXPORT message, make sure we are connected to all export
targets to ensure we can handle the matching IMPORT.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil cb170a2215 ceph: connect to export targets if mds is laggy
If an MDS we are talking to may have failed, we need to open sessions to
its potential export targets to ensure that any in-progress migration that
may have involved some of our caps is properly handled.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Sage Weil ed0552a1a2 ceph: introduce helper to connect to mds export targets
There are a few cases where we need to open sessions with a given mds's
potential export targets.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Sage Weil 796d6955a5 ceph: only set num_pages in calc_layout
Setting it elsewhere is unnecessary and more fragile.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Yehuda Sadeh 37151668ba ceph: do caps accounting per mds_client
Caps related accounting is now being done per mds client instead
of just being global. This prepares ground work for a later revision
of the caps preallocated reservation list.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Sage Weil 0deb01c999 ceph: track laggy state of mds from mdsmap
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Yehuda Sadeh cd84db6e40 ceph: code cleanup
Mainly fixing minor issues reported by sparse.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Sage Weil ca81f3f6bd ceph: skip if no auth cap in flush_snaps
If we have a capsnap but no auth cap (e.g. because it is migrating to
another mds), bail out and do nothing for now.  Do NOT remove the capsnap
from the flush list.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil 3b454c4945 ceph: simplify caps revocation, fix for multimds
The caps revocation should either initiate writeback, invalidateion, or
call check_caps to ack or do the dirty work.  The primary question is
whether we can get away with only checking the auth cap or whether all
caps need to be checked.

The old code was doing...something else.  At the very least, revocations
from non-auth MDSs could break by triggering the "check auth cap only"
case.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil 38e8883ee3 ceph: simplify add_cap_releases
No functional change, aside from more useful debug output.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil ee6b272b9c ceph: drop unused argument
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil 2962507ca2 ceph: perform lazy reads when file mode and caps permit
If the file mode is marked as "lazy," perform cached/buffered reads when
the caps permit it.  Adjust the rdcache_gen and invalidation logic
accordingly so that we manage our cache based on the FILE_CACHE -or-
FILE_LAZYIO cap bits.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil 33caad324b ceph: perform lazy writes when file mode and caps permit
If we have marked a file as "lazy" (using the ceph ioctl), perform buffered
writes when the MDS caps allow it.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil 8c6e9229fc ceph: add LAZYIO ioctl to mark a file description for lazy consistency
Allow an application to mark a file descriptor for lazy file consistency
semantics, allowing buffered reads and writes when multiple clients are
accessing the same file.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil 84d9509234 ceph: request FILE_LAZYIO cap when LAZY file mode is set
Also clean up the file flags -> file mode -> wanted caps functions while
we're at it.  This resyncs this file with userspace.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:38 -07:00
Yehuda Sadeh 03066f2345 ceph: use complete_all and wake_up_all
This fixes an issue triggered by running concurrent syncs. One of the syncs
would go through while the other would just hang indefinitely. In any case, we
never actually want to wake a single waiter, so the *_all functions should
be used.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-27 13:11:17 -07:00
Robert P. J. Day 25848b3ec6 ceph: Correct obvious typo of Kconfig variable "CRYPTO_AES"
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-24 21:36:07 -07:00
Sage Weil 1dadcce358 ceph: fix dentry lease release
When we embed a dentry lease release notification in a request, invalidate
our lease so we don't think we still have it.  Otherwise we can get all
sorts of incorrect client behavior when multiple clients are interacting
with the same part of the namespace.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 13:54:21 -07:00
Sage Weil 8c696737aa ceph: fix leak of dentry in ceph_init_dentry() error path
If we fail to allocate a ceph_dentry_info, don't leak the dn reference.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 10:02:07 -07:00
Sage Weil bc4fdca857 ceph: fix pg_mapping leak on pg_temp updates
Free the ceph_pg_mapping structs when they are removed from the pg_temp
rbtree.  Also fix a leak in the __insert_pg_mapping() error path.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 10:02:06 -07:00
Sage Weil 252af52146 ceph: fix d_release dop for snapdir, snapped dentries
We need to set the d_release dop for snapdir and snapped dentries so that
the ceph_dentry_info struct gets released.  We also use the dcache to
cache readdir results when possible, which only works if we know when
dentries are dropped from the cache.  Since we don't use the dcache for
readdir in the hidden snapdir, avoid that case in ceph_dentry_release.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 10:02:05 -07:00
Sage Weil a0dff78dab ceph: avoid dcache readdir for snapdir
We should always go to the MDS for readdir on the hidden snapdir.  The
set of snapshots can change at any time; the client can't trust its cache
for that.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-22 13:50:45 -07:00
Sage Weil e979cf5039 ceph: do not include cap/dentry releases in replayed messages
Strip the cap and dentry releases from replayed messages.  They can
cause the shared state to get out of sync because they were generated
(with the request message) earlier, and no longer reflect the current
client state.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-16 10:30:18 -07:00
Sage Weil 01a92f174f ceph: reuse request message when replaying against recovering mds
Replayed rename operations (after an mds failure/recovery) were broken
because the request paths were regenerated from the dentry names, which
get mangled when d_move() is called.

Instead, resend the previous request message when replaying completed
operations.  Just make sure the REPLAY flag is set and the target ino is
filled in.

This fixes problems with workloads doing renames when the MDS restarts,
where the rename operation appears to succeed, but on mds restart then
fails (leading to client confusion, app breakage, etc.).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-16 10:30:17 -07:00
Sage Weil f91d3471cc ceph: fix creation of ipv6 sockets
Use the address family from the peer address instead of assuming IPv4.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-09 15:00:20 -07:00
Sage Weil 39139f64e1 ceph: fix parsing of ipv6 addresses
Check for brackets around the ipv6 address to avoid ambiguity with the port
number.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-09 15:00:18 -07:00
Sage Weil d06dbaf6c2 ceph: fix printing of ipv6 addrs
The buffer was too small.  Make it bigger, use snprintf(), put brackets
around the ipv6 address to avoid mixing it up with the :port, and use the
ever-so-handy %pI[46] formats.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-08 16:49:53 -07:00
Dan Carpenter b0bbb0be8f ceph: add kfree() to error path
We leak a "pi" on this error path.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-08 08:03:24 -07:00
Sage Weil 22b1de06c9 ceph: fix leak of mon authorizer
Fix leak of a struct ceph_buffer on umount.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-05 15:36:49 -07:00
Sage Weil ed98adad3d ceph: fix message revocation
A message can be on a queue (pending or sent), or out_msg (sending), or
both.  We were assuming that if it's not on a queue it couldn't be out_msg,
but that was false in the case of lossy connections like the OSD.  Fix
ceph_con_revoke() to treat these cases independently.  Also, fix the
out_kvec_is_message check to only trigger if we are currently sending
_this_ message.

This fixes a GPF in tcp_sendpage, triggered by OSD restarts.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-05 12:16:23 -07:00
Sage Weil 153a10939e ceph: fix crush device 'out' threshold to 1.0, not 0.1
Fix a typo that made any OSD weighted between 0.1 and 1.0 effectively
weighted as 1.0 (fully in).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-05 09:44:17 -07:00
Sage Weil 443b3760a0 ceph: fix caps usage accounting for import (non-reserved) case
We need to increase the total and used counters when allocating a new cap
in the non-reserved (cap import) case.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-29 09:31:56 -07:00
Sage Weil ec97f88ba6 ceph: only release clean, unused caps with mds requests
We can drop caps with an mds request.  Ensure we only drop unused AND
clean caps, since the MDS doesn't support cap writeback in that context,
nor do we track it.  If caps are dirty, and the MDS needs them back, we
it will revoke and we will flush in the normal fashion.

This fixes a possibly loss of metadata.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-29 09:31:55 -07:00
Sage Weil a1a31e7342 ceph: fix crush CHOOSE_LEAF when type is already a leaf
We may not recurse for CHOOSE_LEAF if we start with a leaf node.  When
that happens, the out2 vector needs to be filled in with the result.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-24 12:58:14 -07:00
Sage Weil 55bda7aacd ceph: fix crush recursion
There was a longstanding problem with recursion through intervening
bucket types on complex hierarchies.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-24 12:55:48 -07:00
Yehuda Sadeh bfaf148eb2 ceph: fix caps debugfs entry
The ceph client structure was not set correctly.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-24 09:47:36 -07:00
Sage Weil 17c688c3df ceph: delay umount until all mds requests drop inode+dentry refs
This fixes a race between handle_reply finishing an mds request, signalling
completion, and then dropping the request structing and its dentry+inode
refs, and pre_umount function waiting for requests to finish before
letting the vfs tear down the dcache.  If umount was delayed waiting for
mds requests, we could race and BUG in shrink_dcache_for_umount_subtree
because of a slow dput.

This delays umount until the msgr queue flushes, which means handle_reply
will exit and will have dropped the ceph_mds_request struct.  I'm assuming
the VFS has already ensured that its calls have all completed and those
request refs have thus been dropped as well (I haven't seen that race, at
least).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-21 16:11:50 -07:00
Sage Weil d69ed05a80 ceph: handle splice_dentry/d_materialize_unique error in readdir_prepopulate
Handle a splice_dentry failure (due to a d_materialize_unique error)
without crashing.  (Also, report the error code.)

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-21 16:04:10 -07:00
Sage Weil cebc5be6b6 ceph: fix crush map update decoding
If the incremental osdmap has a new crush map, advance the position after
decoding so that we can parse the rest of the osdmap properly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-17 10:22:48 -07:00
Sage Weil ae32be3134 ceph: fix message memory leak, uninitialized variable
We need to properly initialize skip, as not all alloc_msg op instances
set it.

Also, BUG if someone says skip but also allocates a message.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-13 10:34:36 -07:00
Sage Weil 4a32f93d29 ceph: fix map handler error path
Don't leak message if we receive an unexpected message type.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-13 10:34:36 -07:00
Yehuda Sadeh 0cf5537b15 ceph: some endianity fixes
Fix some problems that came up with sparse.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-13 10:34:36 -07:00
Sage Weil 2b2300d62e ceph: try to send partial cap release on cap message on missing inode
If we have enough memory to allocate a new cap release message, do so, so
that we can send a partial release message immediately.  This keeps us from
making the MDS wait when the cap release it needs is in a partially full
release message.

If we fail because of ENOMEM, oh well, they'll just have to wait a bit
longer.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-10 13:30:25 -07:00
Sage Weil 3d7ded4d81 ceph: release cap on import if we don't have the inode
If we get an IMPORT that give us a cap, but we don't have the inode, queue
a release (and try to send it immediately) so that the MDS doesn't get
stuck waiting for us.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-10 13:30:07 -07:00
Sage Weil 9dbd412f56 ceph: fix misleading/incorrect debug message
Nothing is released here: the caps message is simply ignored in this case.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-10 13:29:59 -07:00
Jeff Mahoney 00d5643e7c ceph: fix atomic64_t initialization on ia64
bdi_seq is an atomic_long_t but we're using ATOMIC_INIT, which causes
 build failures on ia64. This patch fixes it to use ATOMIC_LONG_INIT.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-10 13:29:50 -07:00
Sage Weil 1e5ea23df1 ceph: fix lease revocation when seq doesn't match
If the client revokes a lease with a higher seq than what we have, keep
the mds's seq, so that it honors our release.  Otherwise, we can hang
indefinitely.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-04 10:05:40 -07:00
Sage Weil 558d3499bd ceph: fix f_namelen reported by statfs
We were setting f_namelen in kstatfs to PATH_MAX instead of NAME_MAX.
That disagrees with ceph_lookup behavior (which checks against NAME_MAX),
and also makes the pjd posix test suite spit out ugly errors because with
can't clean up its temporary files.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-01 16:56:03 -07:00
Yehuda Sadeh 205475679a ceph: fix memory leak in statfs
Freeing the statfs request structure when required.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-01 16:56:02 -07:00
Henry C Chang 13a4214cd9 ceph: fix d_subdirs ordering problem
We misused list_move_tail() to order the dentry in d_subdirs.
This will screw up the d_subdirs order.

This bug can be reliably reproduced by:
1. mount ceph fs.
2. on ceph fs, git clone git://ceph.newdream.net/git/ceph.git
3. Run autogen.sh in ceph directory.
(Note: Errors only occur at the first time you run autogen.sh.)

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-01 16:55:55 -07:00
Linus Torvalds b612a05537 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: clean up on forwarded aborted mds request
  ceph: fix leak of osd authorizer
  ceph: close out mds, osd connections before stopping auth
  ceph: make lease code DN specific
  fs/ceph: Use ERR_CAST
  ceph: renew auth tickets before they expire
  ceph: do not resend mon requests on auth ticket renewal
  ceph: removed duplicated #includes
  ceph: avoid possible null dereference
  ceph: make mds requests killable, not interruptible
  sched: add wait_for_completion_killable_timeout
2010-05-30 08:56:39 -07:00
Sage Weil 2a8e5e3637 ceph: clean up on forwarded aborted mds request
If an mds request is aborted (timeout, SIGKILL), it is left registered to
keep our state in sync with the mds.  If we get a forward notification,
though, we know the request didn't succeed and we can unregister it
safely.  We were trying to resend it, but then bailing out (and not
unregistering) in __do_request.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:42:05 -07:00
Sage Weil 79494d1b9b ceph: fix leak of osd authorizer
Release the ceph_authorizer when releasing osd state.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:42:04 -07:00
Sage Weil a922d38fd1 ceph: close out mds, osd connections before stopping auth
The auth module (part of the mon_client) is needed to free any
ceph_authorizer(s) used by the mds and osd connections.  Flush the msgr
workqueue before stopping monc to ensure that the destroy_authorizer
auth op is available when those connections are closed out.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:42:03 -07:00
Sage Weil dd1c905736 ceph: make lease code DN specific
The lease code includes a mask in the CEPH_LOCK_* namespace, but that
namespace is changing, and only one mask (formerly _DN == 1) is used, so
hard code for that value for now.

If we ever extend this code to handle leases over different data types we
can extend it accordingly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:12:42 -07:00
Julia Lawall 7e34bc524e fs/ceph: Use ERR_CAST
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)).  The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.

In the case of fs/ceph/inode.c, ERR_CAST is not needed, because the type of
the returned value is the same as the type of the enclosing function.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
type T;
T x;
identifier f;
@@

T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
 ...+> }

@@
expression x;
@@

- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:12:41 -07:00
Sage Weil a41359fa35 ceph: renew auth tickets before they expire
We were only requesting renewal after our tickets expire; do so before
that.  Most of the low-level logic for this was already there; just use
it.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:12:39 -07:00
Sage Weil 09c4d6a7d4 ceph: do not resend mon requests on auth ticket renewal
We only want to send pending mon requests when we successfully
authenticate.  If we are already authenticated, like when we renew our
ticket, there is no need to resend pending requests.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:12:38 -07:00
Andrea Gelmini 984c76908e ceph: removed duplicated #includes
fs/ceph/auth.c: linux/slab.h is included more than once.
fs/ceph/super.h: linux/slab.h is included more than once.

Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:12:37 -07:00
Sage Weil e95e9a7ae4 ceph: avoid possible null dereference
ac->ops may be null; use protocol id in error message instead.

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:12:36 -07:00
Sage Weil aa91647c89 ceph: make mds requests killable, not interruptible
The underlying problem is that many mds requests can't be restarted.  For
example, a restarted create() would return -EEXIST if the original request
succeeds.  However, we do not want a hung MDS to hang the client too.  So,
use the _killable wait_for_completion variants to abort on SIGKILL but
nothing else.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-29 09:12:35 -07:00
Christoph Hellwig 7ea8085910 drop unused dentry argument to ->fsync
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27 22:05:02 -04:00
Linus Torvalds 6e188240eb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (59 commits)
  ceph: reuse mon subscribe message instead of allocated anew
  ceph: avoid resending queued message to monitor
  ceph: Storage class should be before const qualifier
  ceph: all allocation functions should get gfp_mask
  ceph: specify max_bytes on readdir replies
  ceph: cleanup pool op strings
  ceph: Use kzalloc
  ceph: use common helper for aborted dir request invalidation
  ceph: cope with out of order (unsafe after safe) mds reply
  ceph: save peer feature bits in connection structure
  ceph: resync headers with userland
  ceph: use ceph. prefix for virtual xattrs
  ceph: throw out dirty caps metadata, data on session teardown
  ceph: attempt mds reconnect if mds closes our session
  ceph: clean up send_mds_reconnect interface
  ceph: wait for mds OPEN reply to indicate reconnect success
  ceph: only send cap releases when mds is OPEN|HUNG
  ceph: dicard cap releases on mds restart
  ceph: make mon client statfs handling more generic
  ceph: drop src address(es) from message header [new protocol feature]
  ...
2010-05-24 07:37:52 -07:00
Sage Weil 240ed68eb5 ceph: reuse mon subscribe message instead of allocated anew
Use the same message, allocated during startup.  No need to reallocate a
new one each time around (and potentially ENOMEM).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-21 16:26:11 -07:00
Christoph Hellwig 8018ab0574 sanitize vfs_fsync calling conventions
Now that the last user passing a NULL file pointer is gone we can remove
the redundant dentry argument and associated hacks inside vfs_fsynmc_range.

The next step will be removig the dentry argument from ->fsync, but given
the luck with the last round of method prototype changes I'd rather
defer this until after the main merge window.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-21 18:31:21 -04:00
Al Viro 3981f2e2a0 ceph: should use deactivate_locked_super() on failure exits
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-21 18:31:13 -04:00
Sage Weil 970690012c ceph: avoid resending queued message to monitor
The auth_reply handler will (re)send any pending requests.  For the
initial mon authenticate phase, that's correct, but when a auth ticket
renewal races with an in-flight request, we may resend a request message
that is already in flight.  Avoid this by revoking the message before
sending it.

We should also avoid resending requests at all during ticket renewal; that
will come soon.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-21 15:01:22 -07:00
Tobias Klauser 9e32789f63 ceph: Storage class should be before const qualifier
The C99 specification states in section 6.11.5:

The placement of a storage-class specifier other than at the beginning
of the declaration specifiers in a declaration is an obsolescent
feature.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-21 15:01:21 -07:00
Yehuda Sadeh 34d23762d9 ceph: all allocation functions should get gfp_mask
This is essential, as for the rados block device we'll need
to run in different contexts that would need flags that
are other than GFP_NOFS.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:42 -07:00
Sage Weil 23804d91f1 ceph: specify max_bytes on readdir replies
Specify max bytes in request to bound size of reply.  Add associated
mount option with default value of 512 KB.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:41 -07:00
Sage Weil 366837706b ceph: cleanup pool op strings
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:41 -07:00
Julia Lawall cffe7b6d8c ceph: Use kzalloc
Use kzalloc rather than the combination of kmalloc and memset.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression x,size,flags;
statement S;
@@

-x = kmalloc(size,flags);
+x = kzalloc(size,flags);
 if (x == NULL) S
-memset(x, 0, size);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:40 -07:00
Sage Weil 167c9e352d ceph: use common helper for aborted dir request invalidation
We invalidate I_COMPLETE and dentry leases in two places: on aborted mds
request and on request replay.  Use common helper to avoid duplicate code.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:40 -07:00
Sage Weil 85792d0dd6 ceph: cope with out of order (unsafe after safe) mds reply
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:39 -07:00
Sage Weil aba558e28a ceph: save peer feature bits in connection structure
These are used for adjusting behavior, such as conditionally encoding a
newer message format.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:38 -07:00
Sage Weil ca9d93a292 ceph: resync headers with userland
Notable changes include pool op defines and types, FLOCK feature bit, and
new CMPXATTR osd ops.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:38 -07:00
Sage Weil 1a75627896 ceph: use ceph. prefix for virtual xattrs
Drop the 'user.' prefix and use just 'ceph.' for fs virtual xattrs.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:37 -07:00
Sage Weil 6c99f2545d ceph: throw out dirty caps metadata, data on session teardown
The remove_session_caps() helper is called when an MDS closes out our
session (either normally, or as a result of a failed reconnect), and when
we tear down state for umount.  If we remove the last cap, and there are
no cap migrations in progress, then there is little hope of us flushing
out that data to the mds (without heroic efforts to reconnect and flush).

So, to avoid leaving inodes pinned (due to dirty state) and crashing after
umount, throw out dirty caps state and unpin the inodes.  Print a warning
to the console so we know something was lost.

NOTE: Although we drop wrbuffer refs, we don't actually mark pages clean;
maybe a truncate should be queued?

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:37 -07:00
Sage Weil 7e70f0ed9f ceph: attempt mds reconnect if mds closes our session
Currently, if our session is closed (due to a timeout, or explicit close,
or whatever), we just sit there doing nothing unless/until the MDS
restarts, at which point we try to reconnect.

Change client to attempt an immediate reconnect if our session is closed.

Note that currently the MDS doesn't support this, and our attempt will
fail.  We'll get a session CLOSE, our caps and dirty cap state will be
dropped, and the client will be free to attempt to reconnect.  That's
clearly not as nice as a successful reconnect, but it at least allows us
to try to carry on, and in the future the MDS will support a reconnect
and we will fare better.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:36 -07:00
Sage Weil 34b6c855fa ceph: clean up send_mds_reconnect interface
Pass a ceph_mds_session, since the caller has it.

Remove the dead code for sending empty reconnects.  It used to be used
when the MDS contacted _us_ to solicit a reconnect, and we could reply
saying "go away, I have no session."  Now we only send reconnects based
on the mds map, and only when we do in fact have an open session.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:35 -07:00
Sage Weil 29790f26ab ceph: wait for mds OPEN reply to indicate reconnect success
We used to infer reconnect success by watching the MDS state, essentially
assuming that hearing nothing meant things were ok.  That wasn't
particularly reliable.  Instead, the MDS replies with an explicit OPEN
message to indicate success.

Strictly speaking, this is a protocol change, but it is a backwards
compatible one that does not break new clients + old servers or old
clients + new servers.  At least not yet.

Drop unused @all argument from kick_requests while we're at it.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:35 -07:00
Sage Weil aab53dd9e8 ceph: only send cap releases when mds is OPEN|HUNG
On OPENING we shouldn't have any caps (or releases).
On CLOSING, we should wait until we succeed (and throw it all out), or
don't (and are OPEN again).
On RECONNECTING we can wait until we are OPEN.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:34 -07:00
Sage Weil e01a594646 ceph: dicard cap releases on mds restart
If the MDS restarts, the expire caps state is no longer shared, and can be
thrown out.  Caps state will be rebuilt on the MDS during the reconnect
process that follows.  Zero out any release messages and adjust the
release counter accordingly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:33 -07:00
Yehuda Sadeh f8c76f6f25 ceph: make mon client statfs handling more generic
This is being done so that we could reuse the statfs
infrastructure with other requests that return values.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:33 -07:00
Sage Weil dbad185d49 ceph: drop src address(es) from message header [new protocol feature]
The CEPH_FEATURE_NOSRCADDR protocol feature avoids putting the full source
address in each message header (twice).  This patch switches the client to
the new scheme, and _requires_ this feature on the server.  The server
will support both the old and new schemes.  That means an old client will
work with a new server, but a new client will not work with an old server.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:32 -07:00
Dan Carpenter a5ee751c15 ceph: cleanup: remove unused assignement
We don't ever use "dirty" so we can remove it.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:32 -07:00
Sage Weil 0f8605f2bd ceph: clean up cap release loop vs spinlock
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:31 -07:00
Sage Weil 31e0cf8f6a ceph: name bdi ceph-%d instead of major:minor
The bdi_setup_and_register() helper doesn't help us since we bdi_init() in
create_client() and bdi_register() only when sget() succeeds.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:30 -07:00
Sage Weil 56b7cf9581 ceph: skip mds sync on forced unmount
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:30 -07:00
Sage Weil b736b3d9d0 ceph: adjust masked struct_v variable names
Reported-by: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:29 -07:00
Sage Weil 6e19a16ef2 ceph: clean up mount options, ->show_options()
Ensure all options are included in /proc/mounts.  Some cleanup.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:29 -07:00
Sage Weil 1cd3935bed ceph: set dn offset when spliced
We want to assign an offset when the dentry goes from null to linked, which
is always done by splice_dentry().  Notably, we should NOT assign an
offset when a dentry is first created and is still null.

BUG if we try to splice a non-null dentry (we shouldn't).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:28 -07:00
Sage Weil 1b7facc41b ceph: don't clobber i_max_offset on already complete dir
This can screw up offsets assigned to new dentries and break dcache
readdir results.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:27 -07:00
Sage Weil e8a7498715 ceph: skip set_dentry_offset work if directory not I_COMPLETE
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:27 -07:00
Sage Weil f1f2765fae ceph: set next_offset on readdir finish
Set next_offset to 2 (always 2!), not 0, on readdir finish.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:26 -07:00
Henry C Chang bddfa3cc18 ceph: listxattr should compare version by >=
If the version hasn't changed, don't rebuild the index.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:26 -07:00
Sage Weil a6424e48c8 ceph: fix xattr dangling pointer / double free
If we use the xattr_blob, clear the pointer so we don't release the memory
at the bottom of the fuction.

Reported-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:25 -07:00
Sage Weil 9dd4658db1 ceph: close messenger race
Simplify messenger locking, and close race between ceph_con_close() setting
the CLOSED bit and con_work() checking the bit, then taking the mutex.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:25 -07:00
Sage Weil 4f48280ee1 ceph: name msgpools; useful error messages
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:24 -07:00
Sage Weil 8c6efb58a5 ceph: fix memory leak due to possible dentry init race
Free dentry_info in error path.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:23 -07:00
Sage Weil 559c1e0073 ceph: include auth method in error messages
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:23 -07:00
Sage Weil f26e681d52 ceph: osdtimeout=0 for now timeout
Allow the osd reset timeout to be disabled.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:22 -07:00
Dan Carpenter 0d509c949a ceph: d_obtain_alias() returns ERR_PTR()
d_obtain_alias() doesn't return NULL, it returns an ERR_PTR().

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:22 -07:00
Yehuda Sadeh c473ad927e ceph: wake up mount thread when getting osdmap
Now that the mount thread waits for the osdmap, it needs
to be awaken.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-05-17 15:25:21 -07:00
Huang Weiyi 1bb71637d0 ceph: remove unused #includes
Remove unused #include's in
  fs/ceph/super.c

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:21 -07:00
Sage Weil 6822d00b54 ceph: wait for both monmap and osdmap when opening session
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-05-17 15:25:20 -07:00
Sage Weil 6f2bc3ff4c ceph: clean up connection reset
Reset out_keepalive_pending and peer_global_seq, and drop unused var.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:20 -07:00
Sage Weil bb257664f7 ceph: simplify ceph_msg_new
We only need to pass in front_len.  Callers can attach any other payload
pieces (middle, data) as they see fit.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:19 -07:00
Sage Weil a79832f26b ceph: make ceph_msg_new return NULL on failure; clean up, fix callers
Returning ERR_PTR(-ENOMEM) is useless extra work.  Return NULL on failure
instead, and fix up the callers (about half of which were wrong anyway).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:18 -07:00
Sage Weil d52f847a84 ceph: rewrite msgpool using mempool_t
Since we don't need to maintain large pools of messages, we can just
use the standard mempool_t.  We maintain a msgpool 'wrapper' because we
need the mempool_t* in the alloc function, and mempool gives us only
pool_data.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:18 -07:00
Cheng Renquan 640ef79d27 ceph: use ceph_sb_to_client instead of ceph_client
ceph_sb_to_client and ceph_client are really identical, we need to dump
one; while function ceph_client is confusing with "struct ceph_client",
ceph_sb_to_client's definition is more clear; so we'd better switch all
call to ceph_sb_to_client.

  -static inline struct ceph_client *ceph_client(struct super_block *sb)
  -{
  -	return sb->s_fs_info;
  -}

Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:17 -07:00
Cheng Renquan 2d06eeb877 ceph: handle kzalloc() failure
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:16 -07:00
Sage Weil 7c315c552c ceph: drop unnecessary msgpool for mon_client subscribe_ack
Preallocate a single message to reuse instead.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:16 -07:00
Sage Weil 6694d6b95c ceph: drop unnecessary msgpool for mon_client auth_reply
Preallocate a single reply message that we can reuse instead.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:15 -07:00
Sage Weil 3143edd3a1 ceph: clean up statfs
Avoid unnecessary msgpool.  Preallocate reply.  Fix use-after-free race.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:15 -07:00
Sage Weil 6f46cb2935 ceph: fix theoretically possible double-put on connection
This would only trigger if we bailed out before resetting r_con_filling_msg
because the server reply was corrupt (oversized).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:14 -07:00
Dan Carpenter c7708075f1 ceph: cleanup: remove dead code
"xattr" is never NULL here.  We took care of that in the previous
if statement block.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:14 -07:00
Sage Weil 104648ad3f ceph: reduce build_path debug output
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:13 -07:00
Yehuda Sadeh 31459fe4b2 ceph: use __page_cache_alloc and add_to_page_cache_lru
Following Nick Piggin patches in btrfs, pagecache pages should be
allocated with __page_cache_alloc, so they obey pagecache memory
policies.

Also, using add_to_page_cache_lru instead of using a private
pagevec where applicable.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:12 -07:00
Stephen Rothwell f553069e5d ceph: update for removal of kref_set
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:12 -07:00
Sage Weil 21b667f69b ceph: simplify page setup for incoming data
Drop largely useless helper __prepare_pages(), and simplify sanity checks.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:11 -07:00
Sage Weil 81a6cf2d30 ceph: invalidate affected dentry leases on aborted requests
If we abort a request, we return to caller, but the request may still
complete.  And if we hold the dir FILE_EXCL bit, we may not release a
lease when sending a request.  A simple un-tar, control-c, un-tar again
will reproduce the bug (manifested as a 'Cannot open: File exists').

Ensure we invalidate affected dentry leases (as well dir I_COMPLETE) so
we don't have valid (but incorrect) leases.  Do the same, consistently, at
other sites where I_COMPLETE is similarly cleared.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 10:25:45 -07:00
Sage Weil b4556396fa ceph: fix race between aborted requests and fill_trace
When we abort requests we need to prevent fill_trace et al from doing
anything that relies on locks held by the VFS caller.  This fixes a race
between the reply handler and the abort code, ensuring that continue
holding the dir mutex until the reply handler completes.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 10:25:45 -07:00
Sage Weil e1518c7c0a ceph: clean up mds reply, error handling
We would occasionally BUG out in the reply handler because r_reply was
nonzero, due to a race with ceph_mdsc_do_request temporarily setting
r_reply to an ERR_PTR value.  This is unnecessary, messy, and also wrong
in the EIO case.

Clean up by consistently using r_err for errors and r_reply for messages.
Also fix the abort logic to trigger consistently for all errors that return
to the caller early (e.g., EIO from timeout case).  If an abort races with
a reply, use the result from the reply.

Also fix locking for r_err, r_reply update in the reply handler.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 10:25:44 -07:00
Sage Weil e84346b726 ceph: preserve seq # on requeued messages after transient transport errors
If the tcp connection drops and we reconnect to reestablish a stateful
session (with the mds), we need to resend previously sent (and possibly
received) messages with the _same_ seq # so that they can be dropped on
the other end if needed.  Only assign a new seq once after the message is
queued.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11 21:20:38 -07:00
Sage Weil f818a73674 ceph: fix cap removal races
The iterate_session_caps helper traverses the session caps list and tries
to grab an inode reference.  However, the __ceph_remove_cap was clearing
the inode backpointer _before_ removing itself from the session list,
causing a null pointer dereference.

Clear cap->ci under protection of s_cap_lock to avoid the race, and to
tightly couple the list and backpointer state.  Use a local flag to
indicate whether we are releasing the cap, as cap->session may be modified
by a racing thread in iterate_session_caps.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11 20:56:31 -07:00
Sage Weil 45c6ceb547 ceph: zero unused message header, footer fields
We shouldn't leak any prior memory contents to other parties.  And random
data, particularly in the 'version' field, can cause problems down the
line.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11 15:17:40 -07:00
Sage Weil 9abf82b8bc ceph: fix locking for waking session requests after reconnect
The session->s_waiting list is protected by mdsc->mutex, not s_mutex.  This
was causing (rare) s_waiting list corruption.

Fix errors paths too, while we're here.  A more thorough cleanup of this
function is coming soon.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11 09:53:57 -07:00
Sage Weil d85b705663 ceph: resubmit requests on pg mapping change (not just primary change)
OSD requests need to be resubmitted on any pg mapping change, not just when
the pg primary changes.  Resending only when the primary changes results in
occasional 'hung' requests during osd cluster recovery or rebalancing.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11 09:53:56 -07:00
Sage Weil 04d000eb35 ceph: fix open file counting on snapped inodes when mds returns no caps
It's possible the MDS will not issue caps on a snapped inode, in which case
an open request may not __ceph_get_fmode(), botching the open file
counting.  (This is actually a server bug, but the client shouldn't BUG out
in this case.)

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11 09:53:55 -07:00
Sage Weil 0ceed5db32 ceph: unregister osd request on failure
The osd request wasn't being unregistered when the osd returned a failure
code, even though the result was returned to the caller.  This would cause
it to eventually time out, and then crash the kernel when it tried to
resend the request using a stale page vector.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-11 09:53:18 -07:00
Sage Weil 54ad023ba8 ceph: don't use writeback_control in writepages completion
The ->writepages writeback_control is not still valid in the writepages
completion.  We were touching it solely to adjust pages_skipped when there
was a writeback error (EIO, ENOSPC, EPERM due to bad osd credentials),
causing an oops in the writeback code shortly thereafter.  Updating
pages_skipped on error isn't correct anyway, so let's just rip out this
(clearly broken) code to pass the wbc to the completion.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-05 21:31:40 -07:00
Sage Weil 5dfc589a84 ceph: unregister bdi before kill_anon_super releases device name
Unregister and destroy the bdi in put_super, after mount is r/o, but before
put_anon_super releases the device name.

For symmetry, bdi_destroy in destroy_client (we bdi_init in create_client).

Only set s_bdi if bdi_register succeeds, since we use it to decide whether
to bdi_unregister.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-04 16:14:46 -07:00
Sage Weil b0930f8d38 ceph: remove bad auth_x kmem_cache
It's useless, since our allocations are already a power of 2.  And it was
allocated per-instance (not globally), which caused a name collision when
we tried to mount a second file system with auth_x enabled.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:25 -07:00
Sage Weil 7ff899da02 ceph: fix lockless caps check
The __ variant requires caller to hold i_lock.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:25 -07:00
Sage Weil ea1409f961 ceph: clear dir complete, invalidate dentry on replayed rename
If a rename operation is resent to the MDS following an MDS restart, the
client does not get a full reply (containing the resulting metadata) back.
In that case, a ceph_rename() needs to compensate by doing anything useful
that fill_inode() would have, like d_move().

It also needs to invalidate the dentry (to workaround the vfs_rename_dir()
bug) and clear the dir complete flag, just like fill_trace().

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:25 -07:00
Sage Weil 5c6a2cdb4f ceph: fix direct io truncate offset
truncate_inode_pages_range wants the end offset to align with the last byte
in a page.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:25 -07:00
Sage Weil ae18756b9f ceph: discard incoming messages with bad seq #
We can get old message seq #'s after a tcp reconnect for stateful sessions
(i.e., the MDS).  If we get a higher seq #, that is an error, and we
shouldn't see any bad seq #'s for stateless (mon, osd) connections.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:24 -07:00
Sage Weil 684be25c52 ceph: fix seq counting for skipped messages
Increment in_seq even when the message is skipped for some reason.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:24 -07:00
Sage Weil d45d0d970f ceph: add missing #includes
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:24 -07:00
Sage Weil 0b0c06d147 ceph: fix leaked spinlock during mds reconnect
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:23 -07:00
Sage Weil c8f16584ac ceph: print more useful version info on module load
Decouple the client version from the server side.  Print relevant protocol
and map version info instead.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:23 -07:00
Sage Weil 91dee39eeb ceph: fix snap realm splits
The snap realm split was checking i_snap_realm, not the list_head, to
determine if an inode belonged in the new realm.  The check always failed,
which meant we always moved the inode, corrupting the old realm's list and
causing various crashes.

Also wait to release old realm reference to avoid possibility of use after
free.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:23 -07:00
Sage Weil c10f5e12ba ceph: clear dir complete on d_move
d_move() reorders the d_subdirs list, breaking the readdir result caching.
Unless/until d_move preserves that ordering, clear CEPH_I_COMPLETE on
rename.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03 10:49:22 -07:00
Linus Torvalds 96e35b40c0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: use separate class for ceph sockets' sk_lock
  ceph: reserve one more caps space when doing readdir
  ceph: queue_cap_snap should always queue dirty context
  ceph: fix dentry reference leak in dcache readdir
  ceph: decode v5 of osdmap (pool names) [protocol change]
  ceph: fix ack counter reset on connection reset
  ceph: fix leaked inode ref due to snap metadata writeback race
  ceph: fix snap context reference leaks
  ceph: allow writeback of snapped pages older than 'oldest' snapc
  ceph: fix dentry rehashing on virtual .snap dir
2010-04-14 18:45:31 -07:00
Sage Weil a6a5349d17 ceph: use separate class for ceph sockets' sk_lock
Use a separate class for ceph sockets to prevent lockdep confusion.
Because ceph sockets only get passed kernel pointers, there is no
dependency from sk_lock -> mmap_sem.  If we share the same class as other
sockets, lockdep detects a circular dependency from

	mmap_sem (page fault) -> fs mutex -> sk_lock -> mmap_sem

because dependencies are noted from both ceph and user contexts.  Using
a separate class prevents the sk_lock(ceph) -> mmap_sem dependency and
makes lockdep happy.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-13 14:07:07 -07:00
Yehuda Sadeh e1e4dd0caa ceph: reserve one more caps space when doing readdir
We were missing space for the directory cap.  The result was a BUG at
fs/ceph/caps.c:2178.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-13 12:28:54 -07:00
Sage Weil fc837c8f04 ceph: queue_cap_snap should always queue dirty context
This simplifies the calling convention, and fixes a bug where we queue a
capsnap with a context other than i_head_snapc (the one that matches the
dirty pages).  The result was a BUG at fs/ceph/caps.c:2178 on writeback
completion when a capsnap matching the writeback snapc could not be found.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-13 12:28:31 -07:00
Sage Weil f5b066287c ceph: fix dentry reference leak in dcache readdir
When filldir returned an error (e.g. buffer full for a large directory),
we would leak a dentry reference, causing an oops on umount.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-12 14:25:51 -07:00
Sage Weil 2844a76a25 ceph: decode v5 of osdmap (pool names) [protocol change]
Teach the client to decode an updated format for the osdmap.  The new
format includes pool names, which will be useful shortly.  Get this change
in earlier rather than later.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-09 15:50:58 -07:00
Sage Weil 0e0d5e0c4b ceph: fix ack counter reset on connection reset
If in_seq_acked isn't reset along with in_seq, we don't ack received
messages until we reach the old count, consuming gobs memory on the other
end of the connection and introducing a large delay when those messages
are eventually deleted.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-02 16:07:19 -07:00
Sage Weil 819ccbfa44 ceph: fix leaked inode ref due to snap metadata writeback race
We create a ceph_cap_snap if there is dirty cap metadata (for writeback to
mds) OR dirty pages (for writeback to osd).  It is thus possible that the
metadata has been written back to the MDS but the OSD data has not when
the cap_snap is created.  This results in a cap_snap with dirty(caps) == 0.
The problem is that cap writeback to the MDS isn't necessary, and a
FLUSHSNAP cap op gets no ack from the MDS.  This leaves the cap_snap
attached to the inode along with its inode reference.

Fix the problem by dropping the cap_snap if it becomes 'complete' (all
pages written out) and dirty(caps) == 0 in ceph_put_wrbuffer_cap_refs().

Also, BUG() in __ceph_flush_snaps() if we encounter a cap_snap with
dirty(caps) == 0.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-01 09:34:38 -07:00
Sage Weil 6298a33757 ceph: fix snap context reference leaks
The get_oldest_context() helper takes a reference to the returned snap
context, but most callers weren't dropping that reference.  Fix them.

Also drop the unused locked __get_oldest_context() variant.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-01 09:34:37 -07:00
Sage Weil 80e755fede ceph: allow writeback of snapped pages older than 'oldest' snapc
On snap deletion, we don't regenerate ceph_cap_snaps for inodes with dirty
pages because deletion does not affect metadata writeback.  However, we
did run into problems when we went to write back the pages because the
'oldest' snapc is determined by the oldest cap_snap, and that may be the
newer snapc that reflects the deletion.  This caused confusion and an
infinite loop in ceph_update_writeable_page().

Change the snapc checks to allow writeback of any snapc that is equal to
OR older than the 'oldest' snapc.

When there are no cap_snaps, we were also using the realm's latest snapc
for writeback, which complicates ceph_put_wrbufffer_cap_refs().  Instead,
use i_head_snapc, the most snapc used for the most recent ('head') data.
This makes the writeback snapc (ceph_osd_request.r_snapc) _always_ match a
capsnap or i_head_snapc.

Also, in writepags_finish(), drop the snapc referenced by the _page_
and do not assume it matches the request snapc (it may not anymore).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-04-01 09:34:36 -07:00
Sage Weil 9358c6d4c0 ceph: fix dentry rehashing on virtual .snap dir
If a lookup fails on the magic .snap directory, we bind it to a magic
snap directory inode in ceph_lookup_finish().  That code assumes the dentry
is unhashed, but a recent server-side change started returning NULL leases
on lookup failure, causing the .snap dentry to be hashed and NULL by
ceph_fill_trace().

This causes dentry hash chain corruption, or a dies when d_rehash()
includes
	BUG_ON(!d_unhashed(entry));

So, avoid processing the NULL dentry lease if it the dentry matches the
snapdir name in ceph_fill_trace().  That allows the lookup completion to
properly bind it to the snapdir inode.  BUG there if dentry is hashed to
be sure.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-30 13:55:22 -07:00
Tejun Heo 5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Sage Weil 94aa8ae13d ceph: fix use after free on mds __unregister_request
There was a use after free in __unregister_request that would trigger
whenever the request map held the last reference.  This appears to have
triggered an oops during 'umount -f' when requests are being torn down.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-28 21:23:56 -07:00
Sage Weil 393f662096 ceph: fix possible double-free of mds request reference
Clear pointer to mds request after dropping the reference to
ensure we don't drop it again, as there is at least one error
path through this function that does not reset fi->last_readdir
to a new value.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:06 -07:00
Sage Weil d96d60498f ceph: fix session check on mds reply
Fix a broken check that a reply came back from the same MDS we sent the
request to.  I don't think a case that actually triggers this would ever
come up in practice, but it's clearly wrong and easy to fix.

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:05 -07:00
Dan Carpenter 4736b009b8 ceph: handle kmalloc() failure
Return ERR_PTR(-ENOMEM) if kmalloc() fails.  We handle allocation
failures the same way later in the function.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:04 -07:00
Sage Weil 9c423956b8 ceph: propagate mds session allocation failures to caller
Return error to original caller if register_session() fails.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:04 -07:00
Sage Weil 8f883c24de ceph: make write_begin wait propagate ERESTARTSYS
Currently, if the wait_event_interruptible is interrupted, we
return EAGAIN unconditionally and loop, such that we aren't, in
fact, interruptible.  So, propagate ERESTARTSYS if we get it.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:03 -07:00
Sage Weil ec4318bcb4 ceph: fix snap rebuild condition
We were rebuilding the snap context when it was not necessary
(i.e. when the realm seq hadn't changed _and_ the parent seq
was still older), which caused page snapc pointers to not match
the realm's snapc pointer (even though the snap context itself
was identical).  This confused begin_write and put it into an
endless loop.

The correct logic is: rebuild snapc if _my_ realm seq changed, or
if my parent realm's seq is newer than mine (and thus mine needs
to be rebuilt too).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:02 -07:00
Sage Weil 87b315a5b5 ceph: avoid reopening osd connections when address hasn't changed
We get a fault callback on _every_ tcp connection fault.  Normally, we
want to reopen the connection when that happens.  If the address we have
is bad, however, and connection attempts always result in a connection
refused or similar error, explicitly closing and reopening the msgr
connection just prevents the messenger's backoff logic from kicking in.
The result can be a console full of

[ 3974.417106] ceph: osd11 10.3.14.138:6800 connection failed
[ 3974.423295] ceph: osd11 10.3.14.138:6800 connection failed
[ 3974.429709] ceph: osd11 10.3.14.138:6800 connection failed

Instead, if we get a fault, and have outstanding requests, but the osd
address hasn't changed and the connection never successfully connected in
the first place, do nothing to the osd connection.  The messenger layer
will back off and retry periodically, because we never connected and thus
the lossy bit is not set.

Instead, touch each request's r_stamp so that handle_timeout can tell the
request is still alive and kicking.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:01 -07:00
Sage Weil 3dd72fc0e6 ceph: rename r_sent_stamp r_stamp
Make variable name slightly more generic, since it will (soon)
reflect either the time the request was sent OR the time it was
last determined to be still retrying.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:00 -07:00
Sage Weil 3c3f2e32ef ceph: fix connection fault con_work reentrancy problem
The messenger fault was clearing the BUSY bit, for reasons unclear.  This
made it possible for the con->ops->fault function to reopen the connection,
and requeue work in the workqueue--even though the current thread was
already in con_work.

This avoids a problem where the client busy loops with connection failures
on an unreachable OSD, but doesn't address the root cause of that problem.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:59 -07:00
Sage Weil e4cb4cb8a0 ceph: prevent dup stale messages to console for restarting mds
Prevent duplicate 'mds0 caps stale' message from spamming the console every
few seconds while the MDS restarts.  Set s_renew_requested earlier, so that
we only print the message once, even if we don't send an actual request.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:58 -07:00
Sage Weil efd7576b23 ceph: fix pg pool decoding from incremental osdmap update
The incremental map decoding of pg pool updates wasn't skipping
the snaps and removed_snaps vectors.  This caused osd requests
to stall when pool snapshots were created or fs snapshots were
deleted.  Use a common helper for full and incremental map
decoders that decodes pools properly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:57 -07:00
Sage Weil 80fc7314a7 ceph: fix mds sync() race with completing requests
The wait_unsafe_requests() helper dropped the mdsc mutex to wait
for each request to complete, and then examined r_node to get the
next request after retaking the lock.  But the request completion
removes the request from the tree, so r_node was always undefined
at this point.  Since it's a small race, it usually led to a
valid request, but not always.  The result was an occasional
crash in rb_next() while dereferencing node->rb_left.

Fix this by clearing the rb_node when removing the request from
the request tree, and not walking off into the weeds when we
are done waiting for a request.  Since the request we waited on
will _always_ be out of the request tree, take a ref on the next
request, in the hopes that it won't be.  But if it is, it's ok:
we can start over from the beginning (and traverse over older read
requests again).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:56 -07:00
Sage Weil 916623da10 ceph: only release unused caps with mds requests
We were releasing used caps (e.g. FILE_CACHE) from encode_inode_release
with MDS requests (e.g. setattr).  We don't carry refs on most caps, so
this code worked most of the time, but for setattr (utimes) we try to
drop Fscr.

This causes cap state to get slightly out of sync with reality, and may
result in subsequent mds revoke messages getting ignored.

Fix by only releasing unused caps.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:55 -07:00
Sage Weil 15637c8b12 ceph: clean up handle_cap_grant, handle_caps wrt session mutex
Drop session mutex unconditionally in handle_cap_grant, and do the
check_caps from the handle_cap_grant helper.  This avoids using a magic
return value.

Also avoid using a flag variable in the IMPORT case and call
check_caps at the appropriate point.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:54 -07:00
Sage Weil cdc2ce056a ceph: fix session locking in handle_caps, ceph_check_caps
Passing a session pointer to ceph_check_caps() used to mean it would leave
the session mutex locked.  That wasn't always possible if it wasn't passed
CHECK_CAPS_AUTHONLY.   If could unlock the passed session and lock a
differet session mutex, which was clearly wrong, and also emitted a
warning when it a racing CPU retook it and we did an unlock from the wrong
context.

This was only a problem when there was more than one MDS.

First, make ceph_check_caps unconditionally drop the session mutex, so that
it is free to lock other sessions as needed.  Then adjust the one caller
that passes in a session (handle_cap_grant) accordingly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:53 -07:00
Sage Weil 4ea0043a29 ceph: drop unnecessary WARN_ON in caps migration
If we don't have the exported cap it's because we already released it. No
need to WARN.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:52 -07:00
Sage Weil 12eadc1900 ceph: fix null pointer deref of r_osd in debug output
This causes an oops when debug output is enabled and we kick
an osd request with no current r_osd (sometime after an osd
failure).  Check the pointer before dereferencing.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:51 -07:00
Sage Weil 0a990e7093 ceph: clean up service ticket decoding
Previously we would decode state directly into our current ticket_handler.
This is problematic if for some reason we fail to decode, because we end
up with half new state and half old state.

We are probably already in bad shape if we get an update we can't decode,
but we may as well be tidy anyway.  Decode into new_* temporaries and
update the ticket_handler only on success.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:47 -07:00
Sage Weil 5b3dbb44ab ceph: release old ticket_blob buffer
Release the old ticket_blob buffer when we get an updated service ticket
from the monitor.  Previously these were getting leaked.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:11 -07:00
Sage Weil 807c86e2ce ceph: fix authenticator buffer size calculation
The buffer size was incorrectly calculated for the ceph_x_encrypt()
encapsulated ticket blob.  Use a helper (with correct arithmetic) and
BUG out if we were wrong.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:10 -07:00
Sage Weil 63733a0fc5 ceph: fix authenticator timeout
We were failing to reconnect to services due to an old authenticator, even
though we had the new ticket, because we weren't properly retrying the
connect handshake, because we were calling an old/incorrect helper that
left in_base_pos incorrect.  The result was a failure to reconnect to the
OSD or MDS (with an authentication error) if the MDS restarted after the
service had been up a few hours (long enough for the original authenticator
to be invalid).  This was only a problem if the AUTH_X authentication was
enabled.

Now that the 'negotiate' and 'connect' stages are fully separated, we
should use the prepare_read_connect() helper instead, and remove the
obsolete one.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:09 -07:00
Sage Weil 8b218b8a4a ceph: fix inode removal from snap realm when racing with migration
When an inode was dropped while being migrated between two MDSs,
i_cap_exporting_issued was non-zero such that issue caps were non-zero and
__ceph_is_any_caps(ci) was true.  This prevented the inode from being
removed from the snap realm, even as it was dropped from the cache.

Fix this by dropping any residual i_snap_realm ref in destroy_inode.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:08 -07:00