Commit Graph

130 Commits

Author SHA1 Message Date
Luis Henriques fb18a57568 ceph: quota: add initial infrastructure to support cephfs quotas
This patch adds the infrastructure required to support cephfs quotas as it
is currently implemented in the ceph fuse client.  Cephfs quotas can be
set on any directory, and can restrict the number of bytes or the number
of files stored beneath that point in the directory hierarchy.

Quotas are set using the extended attributes 'ceph.quota.max_files' and
'ceph.quota.max_bytes', and can be removed by setting these attributes to
'0'.

Link: http://tracker.ceph.com/issues/22372
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 11:17:51 +02:00
Zhi Zhang e30ee58121 ceph: try to allocate enough memory for reserved caps
ceph_reserve_caps() may not reserve enough caps under high memory
pressure, but it saved the needed caps number that expected to
be reserved. When getting caps, crash would happen due to number
mismatch.

Now we will try to trim more caps when failing to allocate memory
for caps need to be reserved, then try again. If still failing to
allocate memory, return -ENOMEM.

Signed-off-by: Zhi Zhang <zhang.david2011@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-01-29 18:36:12 +01:00
Greg Kroah-Hartman b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Yan, Zheng 717e6f2893 ceph: avoid panic in create_session_open_msg() if utsname() returns NULL
utsname() can return NULL while process is exiting. Kernel releases
file locks during process exits. We send request to mds when releasing
file lock. So it's possible that we open mds session while process is
exiting. utsname() is called in create_session_open_msg().

Link: http://tracker.ceph.com/issues/21275
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
[idryomov@gmail.com: drop utsname.h include from mds_client.c]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-19 21:04:52 +02:00
Jeff Layton 92475f05bd ceph: handle epoch barriers in cap messages
Have the client store and update the osdc epoch_barrier when a cap
message comes in with one.

When sending cap messages, send the epoch barrier as well. This allows
clients to inform servers that their released caps may not be used until
a particular OSD map epoch.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Yan, Zheng 79162547b7 ceph: make seeky readdir more efficient
Current cephfs client uses string to indicate start position of
readdir. The string is last entry of previous readdir reply.
This approach does not work for seeky readdir because we can
not easily convert the new postion to a string. For seeky readdir,
mds needs to return dentries from the beginning. Client keeps
retrying if the reply does not contain the dentry it wants.

In current version of ceph, mds sorts CDentry in its cache in
hash order. Client also uses dentry hash to compose dir postion.
For seeky readdir, if client passes the hash part of dir postion
to mds. mds can avoid replying useless dentries.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Elena Reshetova 3997c01d26 ceph: convert ceph_mds_session.s_ref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Jeff Layton 3dd69aabce ceph: add a new flag to indicate whether parent is locked
struct ceph_mds_request has an r_locked_dir pointer, which is set to
indicate the parent inode and that its i_rwsem is locked.  In some
critical places, we need to be able to indicate the parent inode to the
request handling code, even when its i_rwsem may not be locked.

Most of the code that operates on r_locked_dir doesn't require that the
i_rwsem be locked. We only really need it to handle manipulation of the
dcache. The rest (filling of the inode, updating dentry leases, etc.)
already has its own locking.

Add a new r_req_flags bit that indicates whether the parent is locked
when doing the request, and rename the pointer to "r_parent". For now,
all the places that set r_parent also set this flag, but that will
change in a later patch.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20 12:16:08 +01:00
Jeff Layton bc2de10dc4 ceph: convert bools in ceph_mds_request to a new r_req_flags field
Currently, we have a bunch of bool flags in struct ceph_mds_request. We
need more flags though, but each bool takes (at least) a byte. Those
add up over time.

Merge all of the existing bools in this struct into a single unsigned
long, and use the set/test/clear_bit macros to manipulate them. These
are atomic operations, but that is required here to prevent
load/modify/store races. The existing flags are protected by different
locks, so we can't rely on them for that purpose.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20 12:16:08 +01:00
Yan, Zheng fcff415c94 ceph: handle CEPH_SESSION_REJECT message
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-10-03 16:13:50 +02:00
Yan, Zheng 0e29438789 ceph: unify cap flush and snapcap flush
This patch includes following changes
- Assign flush tid to snapcap flush
- Remove session's s_cap_snaps_flushing list. Add inode to session's
  s_cap_flushing list instead. Inode is removed from the list when
  there is no pending snapcap flush or cap flush.
- make __kick_flushing_caps() re-send both snapcap flushes and cap
  flushes.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 03:00:42 +02:00
Yan, Zheng e4500b5e35 ceph: use list instead of rbtree to track cap flushes
We don't have requirement of searching cap flush by TID. In most cases,
we just need to know TID of the oldest cap flush. List is ideal for this
usage.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 03:00:42 +02:00
Yan, Zheng 430afbadd6 ceph: mount non-default filesystem by name
To mount non-default filesytem, user currently needs to provide mds
namespace ID. This is inconvenience.

This patch makes user be able to mount filesystem by name. If user
wants to mount non-default filesystem. Client first subscribes to
fsmap.user. Subscribe to mdsmap.<ID> after getting ID of filesystem.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 03:00:40 +02:00
Jeff Layton 8aa152c778 ceph: remove ceph_mdsc_lease_release
Nothing calls it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 03:00:38 +02:00
Yan, Zheng 779fe0fb8e ceph: rados pool namespace support
This patch adds codes that decode pool namespace information in
cap message and request reply. Pool namespace is saved in i_layout,
it will be passed to libceph when doing read/write.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 02:55:38 +02:00
Yan, Zheng 7627151ea3 libceph: define new ceph_file_layout structure
Define new ceph_file_layout structure and rename old ceph_file_layout
to ceph_file_layout_legacy. This is preparation for adding namespace
to ceph_file_layout structure.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 02:55:36 +02:00
Yan, Zheng f3c4ebe65e ceph: using hash value to compose dentry offset
If MDS sorts dentries in dirfrag in hash order, we use hash value to
compose dentry offset. dentry offset is:

  (0xff << 52) | ((24 bits hash) << 28) |
  (the nth entry hash hash collision)

This offset is stable across directory fragmentation. This alos means
there is no need to reset readdir offset if directory get fragmented
in the middle of readdir.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26 01:15:36 +02:00
Yan, Zheng 8974eebd38 ceph: record 'offset' for each entry of readdir result
This is preparation for using hash value as dentry 'offset'

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26 01:15:35 +02:00
Yan, Zheng 956d39d631 ceph: define 'end/complete' in readdir reply as bit flags
Set a flag in readdir request, which indicates that client interprets
'end/complete' as bit flags. So that mds can reply additional flags in
readdir reply.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26 01:15:35 +02:00
Yan, Zheng 2a5beea3f1 ceph: define struct for dir entry in readdir reply
This avoids defining multiple arrays for entries in readdir reply

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26 01:15:34 +02:00
Kirill A. Shutemov 09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Yan, Zheng 5ea5c5e0a7 ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support
Add support for the format change of MClientReply/MclientCaps.
Also add code that denies access to inodes with pool_ns layouts.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-03-04 21:00:37 +01:00
Yan, Zheng 68cd5b4b76 ceph: make fsync() wait unsafe requests that created/modified inode
If we get a unsafe reply for request that created/modified inode,
add the unsafe request to a list in the newly created/modified
inode. So we can make fsync() wait these unsafe requests.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02 23:36:48 +01:00
Yan, Zheng 48fec5d0a5 ceph: EIO all operations after forced umount
This patch makes try_get_cap_refs() and __do_request() check
if the file system was forced umount, and return -EIO if it was.
This patch also adds a helper function to drops dirty caps and
wakes up blocking operation.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-09-08 23:14:28 +03:00
Yan, Zheng fdd4e15838 ceph: rework dcache readdir
Previously our dcache readdir code relies on that child dentries in
directory dentry's d_subdir list are sorted by dentry's offset in
descending order. When adding dentries to the dcache, if a dentry
already exists, our readdir code moves it to head of directory
dentry's d_subdir list. This design relies on dcache internals.
Al Viro suggests using ncpfs's approach: keeping array of pointers
to dentries in page cache of directory inode. the validity of those
pointers are presented by directory inode's complete and ordered
flags. When a dentry gets pruned, we clear directory inode's complete
flag in the d_prune() callback. Before moving a dentry to other
directory, we clear the ordered flag for both old and new directory.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25 11:49:32 +03:00
Yan, Zheng 8310b08913 ceph: track pending caps flushing globally
So we know TID of the oldest pending caps flushing. Later patch will
send this information to MDS, so that MDS can trim its completed caps
flush list.

Tracking pending caps flushing globally also simplifies syncfs code.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25 11:49:31 +03:00
Yan, Zheng 553adfd941 ceph: track pending caps flushing accurately
Previously we do not trace accurate TID for flushing caps. when
MDS failovers, we have no choice but to re-send all flushing caps
with a new TID. This can cause problem because MDS can has already
flushed some caps and has issued the same caps to other client.
The re-sent cap flush has a new TID, which makes MDS unable to
detect if it has already processed the cap flush.

This patch adds code to track pending caps flushing accurately.
When re-sending cap flush is needed, we use its original flush
TID.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25 11:49:30 +03:00
Ilya Dryomov a319bf56a6 libceph: store timeouts in jiffies, verify user input
There are currently three libceph-level timeouts that the user can
specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive.  All of
these are in seconds and no checking is done on user input: negative
values are accepted, we multiply them all by HZ which may or may not
overflow, arbitrarily large jiffies then get added together, etc.

There is also a bug in the way mount_timeout=0 is handled.  It's
supposed to mean "infinite timeout", but that's not how wait.h APIs
treat it and so __ceph_open_session() for example will busy loop
without much chance of being interrupted if none of ceph-mons are
there.

Fix all this by verifying user input, storing timeouts capped by
msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies()
helper for all user-specified waits to handle infinite timeouts
correctly.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2015-06-25 11:49:29 +03:00
Yan, Zheng e8a7b8b12b ceph: exclude setfilelock requests when calculating oldest tid
setfilelock requests can block for a long time, which can prevent
client from advancing its oldest tid.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25 11:49:29 +03:00
Yan, Zheng 745a8e3bcc ceph: don't pre-allocate space for cap release messages
Previously we pre-allocate cap release messages for each caps. This
wastes lots of memory when there are large amount of caps. This patch
make the code not pre-allocate the cap release messages. Instead,
we add the corresponding ceph_cap struct to a list when releasing a
cap. Later when flush cap releases is needed, we allocate the cap
release messages dynamically.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25 11:49:29 +03:00
Yan, Zheng affbc19a68 ceph: make sure syncfs flushes all cap snaps
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25 11:49:29 +03:00
Yan, Zheng 10183a6955 ceph: check OSD caps before read/write
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-06-25 11:49:28 +03:00
Yan, Zheng 86d8f67b26 ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions)
use an atomic variable to track number of sessions, this can avoid block
operation inside wait loops.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:38 +03:00
Yan, Zheng 03f4fcb028 ceph: handle SESSION_FORCE_RO message
mark session as readonly and wake up all cap waiters.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:37 +03:00
Yan, Zheng 01deead041 ceph: use getattr request to fetch inline data
Add a new parameter 'locked_page' to ceph_do_getattr(). If inline data
in getattr reply will be copied to the page.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:52 +03:00
Yan, Zheng fb01d1f8b0 ceph: parse inline data in MClientReply and MClientCaps
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:52 +03:00
Yan, Zheng 9280be24dc ceph: fix file lock interruption
When a lock operation is interrupted, current code sends a unlock request to
MDS to undo the lock operation. This method does not work as expected because
the unlock request can drop locks that have already been acquired.

The fix is use the newly introduced CEPH_LOCK_FCNTL_INTR/CEPH_LOCK_FLOCK_INTR
requests to interrupt blocked file lock request. These requests do not drop
locks that have alread been acquired, they only interrupt blocked file lock
request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:49 +03:00
John Spray a687ecaf50 ceph: export ceph_session_state_name function
...so that it can be used from the ceph debugfs
code when dumping session info.

Signed-off-by: John Spray <john.spray@redhat.com>
2014-10-14 12:56:50 -07:00
Yan, Zheng 25e6bae356 ceph: use pagelist to present MDS request data
Current code uses page array to present MDS request data. Pages in the
array are allocated/freed by caller of ceph_mdsc_do_request(). If request
is interrupted, the pages can be freed while they are still being used by
the request message.

The fix is use pagelist to present MDS request data. Pagelist is
reference counted.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14 12:56:49 -07:00
Sage Weil b8e69066d8 ceph: include time stamp in every MDS request
We recently modified the client/MDS protocol to include a timestamp in the
client request.  This allows ctime updates to follow the client's clock
in most cases, which avoids subtle problems when clocks are out of sync
and timestamps are updated sometimes by the MDS clock (for most requests)
and sometimes by the client clock (for cap writeback).

Signed-off-by: Sage Weil <sage@inktank.com>
2014-06-06 09:30:00 +08:00
Yan, Zheng 54008399dc ceph: preallocate buffer for readdir reply
Preallocate buffer for readdir reply. Limit number of entries in
readdir reply according to the buffer size.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-04-04 21:08:22 -07:00
Yan, Zheng 5d72d13c42 ceph: add open export target session helper
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-01-21 16:30:30 +08:00
Yan, Zheng 99a9c273b9 ceph: handle race between cap reconnect and cap release
When a cap get released while composing the cap reconnect message.
We should skip queuing the release message if the cap hasn't been
added to the cap reconnect message.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-11-23 11:01:02 -08:00
Linus Torvalds 1cf0209c43 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "A few groups of patches here.  Alex has been hard at work improving
  the RBD code, layout groundwork for understanding the new formats and
  doing layering.  Most of the infrastructure is now in place for the
  final bits that will come with the next window.

  There are a few changes to the data layout.  Jim Schutt's patch fixes
  some non-ideal CRUSH behavior, and a set of patches from me updates
  the client to speak a newer version of the protocol and implement an
  improved hashing strategy across storage nodes (when the server side
  supports it too).

  A pair of patches from Sam Lang fix the atomicity of open+create
  operations.  Several patches from Yan, Zheng fix various mds/client
  issues that turned up during multi-mds torture tests.

  A final set of patches expose file layouts via virtual xattrs, and
  allow the policies to be set on directories via xattrs as well
  (avoiding the awkward ioctl interface and providing a consistent
  interface for both kernel mount and ceph-fuse users)."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (143 commits)
  libceph: add support for HASHPSPOOL pool flag
  libceph: update osd request/reply encoding
  libceph: calculate placement based on the internal data types
  ceph: update support for PGID64, PGPOOL3, OSDENC protocol features
  ceph: update "ceph_features.h"
  libceph: decode into cpu-native ceph_pg type
  libceph: rename ceph_pg -> ceph_pg_v1
  rbd: pass length, not op for osd completions
  rbd: move rbd_osd_trivial_callback()
  libceph: use a do..while loop in con_work()
  libceph: use a flag to indicate a fault has occurred
  libceph: separate non-locked fault handling
  libceph: encapsulate connection backoff
  libceph: eliminate sparse warnings
  ceph: eliminate sparse warnings in fs code
  rbd: eliminate sparse warnings
  libceph: define connection flag helpers
  rbd: normalize dout() calls
  rbd: barriers are hard
  rbd: ignore zero-length requests
  ...
2013-02-28 17:43:09 -08:00
Eric W. Biederman ff3d004662 ceph: Convert struct ceph_mds_request to use kuid_t and kgid_t
Hold the uid and gid for a pending ceph mds request using the types
kuid_t and kgid_t.  When a request message is finally created convert
the kuid_t and kgid_t values into uids and gids in the initial user
namespace.

Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-12 03:19:26 -08:00
Sam Lang 6e8575faa8 ceph: Check for created flag in response from mds
The mds now sends back a created inode if the create request
performed the create.  If the file already existed, no inode is
returned in the reply.  This allows ceph to set the created flag
in atomic_open so that permissions are properly checked in the case
that the file wasn't created by the create call to the mds.

To ensure compability with previous kernels, a feature for sending
back the inode in the create reply was added, so that the mds will
only send back the inode if the client indicates it supports the
feature.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:36 -06:00
Alex Elder 6c4a19158b ceph: define ceph_auth_handshake type
The definitions for the ceph_mds_session and ceph_osd both contain
five fields related only to "authorizers."  Encapsulate those fields
into their own struct type, allowing for better isolation in some
upcoming patches.

Fix the #includes in "linux/ceph/osd_client.h" to lay out their more
complete canonical path.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-05-17 08:18:12 -05:00
Alex Elder d8fb02abdc ceph: create a new session lock to avoid lock inversion
Lockdep was reporting a possible circular lock dependency in
dentry_lease_is_valid().  That function needs to sample the
session's s_cap_gen and and s_cap_ttl fields coherently, but needs
to do so while holding a dentry lock.  The s_cap_lock field was
being used to protect the two fields, but that can't be taken while
holding a lock on a dentry within the session.

In most cases, the s_cap_gen and s_cap_ttl fields only get operated
on separately.  But in three cases they need to be updated together.
Implement a new lock to protect the spots updating both fields
atomically is required.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-02-02 12:49:19 -08:00
Sage Weil be655596b3 ceph: use i_ceph_lock instead of i_lock
We have been using i_lock to protect all kinds of data structures in the
ceph_inode_info struct, including lists of inodes that we need to iterate
over while avoiding races with inode destruction.  That requires grabbing
a reference to the inode with the list lock protected, but igrab() now
takes i_lock to check the inode flags.

Changing the list lock ordering would be a painful process.

However, using a ceph-specific i_ceph_lock in the ceph inode instead of
i_lock is a simple mechanical change and avoids the ordering constraints
imposed by igrab().

Reported-by: Amon Ott <a.ott@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-12-07 10:46:44 -08:00
Sage Weil 41b02e1f9b ceph: explicitly reference rename old_dentry parent dir in request
We carry a pin on the parent directory for the rename source and dest
dentries.  For the source it's r_locked_dir; we need to explicitly
reference the old_dentry parent as well, since the dentry's d_parent may
change between when the request was created and pinned and when it is
freed.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:31:14 -07:00
Sage Weil 2f90b852e3 ceph: ignore lease mask
The lease mask is no longer used (and it changed a while back).  Instead,
use a non-zero duration to indicate that there is a lease being issued.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:28:25 -07:00
Sage Weil db3540522e ceph: fix cap flush race reentrancy
In e9964c10 we change cap flushing to do a delicate dance because some
inodes on the cap_dirty list could be in a migrating state (got EXPORT but
not IMPORT) in which we couldn't actually flush and move from
dirty->flushing, breaking the while (!empty) { process first } loop
structure.  It worked for a single sync thread, but was not reentrant and
triggered infinite loops when multiple syncers came along.

Instead, move inodes with dirty to a separate cap_dirty_migrating list
when in the limbo export-but-no-import state, allowing us to go back to
the simple loop structure (which was reentrant).  This is cleaner and more
robust.

Audited the cap_dirty users and this looks fine:
list_empty(&ci->i_dirty_item) is still a reliable indicator of whether we
have dirty caps (which list we're on is irrelevant) and list_del_init()
calls still do the right thing.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:12 -07:00
Sage Weil 4af25fdda6 ceph: drop redundant r_mds field
The r_mds field is redundant, since we can find the same information at
r_session->s_mds, and when r_session is NULL then r_mds is meaningless.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:13 -08:00
Sage Weil 14303d20f3 ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
This implements the DIRLAYOUTHASH protocol feature, which passes the dir
layout over the wire from the MDS.  This gives the client knowledge
of the correct hash function to use for mapping dentries among dir
fragments.

Note that if this feature is _not_ present on the client but is on the
MDS, the client may misdirect requests.  This will result in a forward
and degrade performance.  It may also result in inaccurate NFS filehandle
generation, which will prevent fh resolution when the inode is not present
in the client cache and the parent directories have been fragmented.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:13 -08:00
Herb Shiu 25933abdd8 ceph: Handle file locks in replies from the MDS.
Previously the kernel client incorrectly assumed everything was a directory.

Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:22:27 -08:00
Sage Weil cb4276cca4 ceph: fix uid/gid on resent mds requests
MDS requests can be rebuilt and resent in non-process context, but were
filling in uid/gid from current_fsuid/gid.  Put that information in the
request struct on request setup.

This fixes incorrect (and root) uid/gid getting set for requests that
are forwarded between MDSs, usually due to metadata migrations.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-08 07:29:05 -08:00
Yehuda Sadeh 3d14c5d2b6 ceph: factor out libceph from Ceph file system
This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph.  This
is mostly a matter of moving files around.  However, a few key pieces
of the interface change as well:

 - ceph_client becomes ceph_fs_client and ceph_client, where the latter
   captures the mon and osd clients, and the fs_client gets the mds client
   and file system specific pieces.
 - Mount option parsing and debugfs setup is correspondingly broken into
   two pieces.
 - The mon client gets a generic handler callback for otherwise unknown
   messages (mds map, in this case).
 - The basic supported/required feature bits can be expanded (and are by
   ceph_fs_client).

No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:37:28 -07:00
Sage Weil f3c60c5918 ceph: fix multiple mds session shutdown
The use of a completion when waiting for session shutdown during umount is
inappropriate, given the complexity of the condition.  For multiple MDS's,
this resulted in the umount thread spinning, often preventing the session
close message from being processed in some cases.

Switch to a waitqueue and defined a condition helper.  This cleans things
up nicely.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-22 15:04:43 -07:00
Greg Farnum e55b71f802 ceph: handle ESTALE properly; on receipt send to authority if it wasn't
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil 154f42c2c3 ceph: connect to export targets on cap export
When we get a cap EXPORT message, make sure we are connected to all export
targets to ensure we can handle the matching IMPORT.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Yehuda Sadeh 37151668ba ceph: do caps accounting per mds_client
Caps related accounting is now being done per mds client instead
of just being global. This prepares ground work for a later revision
of the caps preallocated reservation list.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Sage Weil ee6b272b9c ceph: drop unused argument
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil e979cf5039 ceph: do not include cap/dentry releases in replayed messages
Strip the cap and dentry releases from replayed messages.  They can
cause the shared state to get out of sync because they were generated
(with the request message) earlier, and no longer reflect the current
client state.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-16 10:30:18 -07:00
Sage Weil 2b2300d62e ceph: try to send partial cap release on cap message on missing inode
If we have enough memory to allocate a new cap release message, do so, so
that we can send a partial release message immediately.  This keeps us from
making the MDS wait when the cap release it needs is in a partially full
release message.

If we fail because of ENOMEM, oh well, they'll just have to wait a bit
longer.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-10 13:30:25 -07:00
Sage Weil 3d7ded4d81 ceph: release cap on import if we don't have the inode
If we get an IMPORT that give us a cap, but we don't have the inode, queue
a release (and try to send it immediately) so that the MDS doesn't get
stuck waiting for us.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-10 13:30:07 -07:00
Sage Weil 167c9e352d ceph: use common helper for aborted dir request invalidation
We invalidate I_COMPLETE and dentry leases in two places: on aborted mds
request and on request replay.  Use common helper to avoid duplicate code.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 15:25:40 -07:00
Sage Weil b4556396fa ceph: fix race between aborted requests and fill_trace
When we abort requests we need to prevent fill_trace et al from doing
anything that relies on locks held by the VFS caller.  This fixes a race
between the reply handler and the abort code, ensuring that continue
holding the dir mutex until the reply handler completes.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 10:25:45 -07:00
Sage Weil e1518c7c0a ceph: clean up mds reply, error handling
We would occasionally BUG out in the reply handler because r_reply was
nonzero, due to a race with ceph_mdsc_do_request temporarily setting
r_reply to an ERR_PTR value.  This is unnecessary, messy, and also wrong
in the EIO case.

Clean up by consistently using r_err for errors and r_reply for messages.
Also fix the abort logic to trigger consistently for all errors that return
to the caller early (e.g., EIO from timeout case).  If an abort races with
a reply, use the result from the reply.

Also fix locking for r_err, r_reply update in the reply handler.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-17 10:25:44 -07:00
Sage Weil 7c1332b8cb ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap.  To allow this, we used to
prevent cap reordering while we were iterating.  However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.

Instead, we keep an iterator pointer in the session pointing to
the current cap.  As before, we avoid reordering.  For removal,
if the cap isn't the current cap we are iterating over, we are
fine.  If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list.  In iterate_caps, we
can safely finish removal and get the next cap pointer.

While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-17 10:02:47 -08:00
Sage Weil a105f00cf1 ceph: use rbtree for snap_realms
Switch from radix tree to rbtree for snap realms.  This is much more
appropriate given that realm keys are few and far between.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 22:01:09 -08:00
Sage Weil 44ca18f268 ceph: use rbtree for mds requests
The rbtree is a more appropriate data structure than a radix_tree.  It
avoids extra memory usage and simplifies the code.

It also fixes a bug where the debugfs 'mdsc' file wasn't including the
most recent mds request.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 22:01:08 -08:00
Sage Weil 5b1daecd59 ceph: properly handle aborted mds requests
Previously, if the MDS request was interrupted, we would unregister the
request and ignore any reply.  This could cause the caps or other cache
state to become out of sync.  (For instance, aborting dbench and doing
rm -r on clients would complain about a non-empty directory because the
client didn't realize it's aborted file create request completed.)

Even we don't unregister, we still can't process the reply normally because
we are no longer holding the caller's locks (like the dir i_mutex).

So, mark aborted operations with r_aborted, and in the reply handler, be
sure to process all the caps.  Do not process the namespace changes,
though, since we no longer will hold the dir i_mutex.  The dentry lease
state can also be ignored as it's more forgiving.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-01-25 11:49:51 -08:00
Sage Weil 5dacf09121 ceph: do not touch_caps while iterating over caps list
Avoid confusing iterate_session_caps(), flag the session while we are
iterating so that __touch_cap does not rearrange items on the list.

All other modifiers of session->s_caps do so under the protection of
s_mutex.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-12-23 08:17:14 -08:00
Sage Weil 153c8e6bf7 ceph: use kref for struct ceph_mds_request
Signed-off-by: Sage Weil <sage@newdream.net>
2009-12-07 12:31:09 -08:00
Sage Weil 4e7a5dcd1b ceph: negotiate authentication protocol; implement AUTH_NONE protocol
When we open a monitor session, we send an initial AUTH message listing
the auth protocols we support, our entity name, and (possibly) a previously
assigned global_id.  The monitor chooses a protocol and responds with an
initial message.

Initially implement AUTH_NONE, a dummy protocol that provides no security,
but works within the new framework.  It generates 'authorizers' that are
used when connecting to (mds, osd) services that simply state our entity
name and global_id.

This is a wire protocol change.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-18 16:19:57 -08:00
Sage Weil 5f44f14260 ceph: handle errors during osd client init
Unwind initializing if we get ENOMEM during client initialization.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-18 15:02:36 -08:00
Sage Weil 039934b895 ceph: build cleanly without CONFIG_DEBUG_FS
Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-12 15:56:51 -08:00
Sage Weil cdac830313 ceph: remove recon_gen logic
We don't get an explicit affirmative confirmation that our caps reconnect,
nor do we necessarily want to pay that cost.  So, take all this code out
for now.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-10 16:03:53 -08:00
Sage Weil 685f9a5d14 ceph: do not confuse stale and dead (unreconnected) caps
We were using the cap_gen to track both stale caps (caps that timed out
due to temporarily losing touch with the mds) and dead caps that did not
reconnect after an MDS failure.  Introduce a recon_gen counter to track
reconnections to restarted MDSs and kill dead caps based on that instead.

Rename gen to cap_gen while we're at it to make it more clear which is
which.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-09 12:06:07 -08:00
Sage Weil 2f2dc05340 ceph: MDS client
The MDS (metadata server) client is responsible for submitting
requests to the MDS cluster and parsing the response.  We decide which
MDS to submit each request to based on cached information about the
current partition of the directory hierarchy across the cluster.  A
stateful session is opened with each MDS before we submit requests to
it, and a mutex is used to control the ordering of messages within
each session.

An MDS request may generate two responses.  The first indicates the
operation was a success and returns any result.  A second reply is
sent when the operation commits to disk.  Note that locking on the MDS
ensures that the results of updates are visible only to the updating
client before the operation commits.  Requests are linked to the
containing directory so that an fsync will wait for them to commit.

If an MDS fails and/or recovers, we resubmit requests as needed.  We
also reconnect existing capabilities to a recovering MDS to
reestablish that shared session state.  Old dentry leases are
invalidated.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-06 11:31:09 -07:00