Commit Graph

1684 Commits

Author SHA1 Message Date
Dongsheng Yang 02b2f549d5 libceph: allow setting abort_on_full for rbd
Introduce a new option abort_on_full, default to false. Then
we can get -ENOSPC when the pool is full, or reaches quota.

[ Don't show abort_on_full in /proc/mounts. ]

Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-01-07 22:47:48 +01:00
Linus Torvalds c7eaf342ec A fairly quiet round: a couple of messenger performance improvements
from myself and a few cap handling fixes from Zheng.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlwuI7ATHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzizcvB/9GqpAzR+Yy1iIQGNeijPSeuXsrlcQF
 WErfaG8tUwZY3vqv3+OSZBwuMgq6wAyCo3wJmh0GCZoy02WLJbPB/G8AiHtoZUAh
 wAWfL8feZkzx3L7JV0OrPG0GGYkhKu5PebM4rq3cXvlL0OiTKPs8bmbTvh0mSv3z
 gH1odW0j2mAb1/3tqm9M5+7XhrGSnmSfA028NeKx6I4nE0ONd9BEcHZDoRBBQeNf
 tgyxH4IJuuQ+x4/FKIn6+hBbMYiVrTBlz4wQHrJvvzDUeCkWu+E8JZ4utxxNdfmS
 uGsPDRqi4LSMwt1q0HLHhkCP0lg5yf9NByGoy+VH5/gS8ma6be9+IbfX
 =puaN
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.21-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "A fairly quiet round: a couple of messenger performance improvements
  from myself and a few cap handling fixes from Zheng"

* tag 'ceph-for-4.21-rc1' of git://github.com/ceph/ceph-client:
  ceph: don't encode inode pathes into reconnect message
  ceph: update wanted caps after resuming stale session
  ceph: skip updating 'wanted' caps if caps are already issued
  ceph: don't request excl caps when mount is readonly
  ceph: don't update importing cap's mseq when handing cap export
  libceph: switch more to bool in ceph_tcp_sendmsg()
  libceph: use MSG_SENDPAGE_NOTLAST with ceph_tcp_sendpage()
  libceph: use sock_no_sendpage() as a fallback in ceph_tcp_sendpage()
  libceph: drop last_piece logic from write_partial_message_data()
  ceph: remove redundant assignment
  ceph: cleanup splice_dentry()
2019-01-05 13:58:08 -08:00
Nikolay Borisov f86196ea87 fs: don't open code lru_to_page()
Multiple filesystems open code lru_to_page().  Rectify this by moving
the macro from mm_inline (which is specific to lru stuff) to the more
generic mm.h header and start using the macro where appropriate.

No functional changes.

Link: http://lkml.kernel.org/r/20181129104810.23361-1-nborisov@suse.com
Link: https://lkml.kernel.org/r/20181129075301.29087-1-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Pankaj gupta <pagupta@redhat.com>
Acked-by: "Yan, Zheng" <zyan@redhat.com>		[ceph]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:48 -08:00
Arun KS ca79b0c211 mm: convert totalram_pages and totalhigh_pages variables to atomic
totalram_pages and totalhigh_pages are made static inline function.

Main motivation was that managed_page_count_lock handling was complicating
things.  It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Yan, Zheng 5ccedf1ccd ceph: don't encode inode pathes into reconnect message
mds hasn't used inode pathes since introducing inode backtrace.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-12-26 16:08:36 +01:00
Yan, Zheng d2f8bb27c8 ceph: update wanted caps after resuming stale session
mds contains an optimization, it does not re-issue stale caps if
client does not want any cap.

A special case of the optimization is that client wants some caps,
but skipped updating 'wanted'. For this case, client needs to update
'wanted' when stale session get renewed.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-12-26 16:08:36 +01:00
Yan, Zheng fdac94fab7 ceph: skip updating 'wanted' caps if caps are already issued
When reading cached inode that already has Fscr caps, this can avoid
two cap messages (one updats 'wanted' caps, one clears 'wanted' caps).

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-12-26 16:08:36 +01:00
Yan, Zheng 8a2ac3a8e9 ceph: don't request excl caps when mount is readonly
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-12-26 16:08:36 +01:00
Yan, Zheng 3c1392d4c4 ceph: don't update importing cap's mseq when handing cap export
Updating mseq makes client think importer mds has accepted all prior
cap messages and importer mds knows what caps client wants. Actually
some cap messages may have been dropped because of mseq mismatch.

If mseq is left untouched, importing cap's mds_wanted later will get
reset by cap import message.

Cc: stable@vger.kernel.org
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-12-26 16:08:25 +01:00
Chengguang Xu 0cab9f33d9 ceph: remove redundant assignment
There is redundant assighment of variable i in
ceph_mdsmap_get_random_mds(), just remvoe it.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-12-26 15:56:04 +01:00
Yan, Zheng 2bf996ac48 ceph: cleanup splice_dentry()
splice_dentry() may drop the original dentry and return other
dentry. It relies on its caller to update pointer that points
to the dropped dentry. This is error-prone.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-12-26 15:56:03 +01:00
Luis Henriques 6f9718fe41 ceph: make 'nocopyfrom' a default mount option
Since we found a problem with the 'copy-from' operation after objects have
been truncated, offloading object copies to OSDs should be discouraged
until the issue is fixed.

Thus, this patch adds the 'nocopyfrom' mount option to the default mount
options which effectily means that remote copies won't be done in
copy_file_range unless they are explicitly enabled at mount time.

[ Adjust ceph_show_options() accordingly. ]

Link: https://tracker.ceph.com/issues/37378
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-12-11 18:22:17 +01:00
Ilya Dryomov 23c625ce30 libceph: assume argonaut on the server side
No one is running pre-argonaut.  In addition one of the argonaut
features (NOSRCADDR) has been required since day one (and a half,
2.6.34 vs 2.6.35) of the kernel client.

Allow for the possibility of reusing these feature bits later.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-11-08 17:51:11 +01:00
Luis Henriques 71f2cc64d0 ceph: quota: fix null pointer dereference in quota check
This patch fixes a possible null pointer dereference in
check_quota_exceeded, detected by the static checker smatch, with the
following warning:

   fs/ceph/quota.c:240 check_quota_exceeded()
    error: we previously assumed 'realm' could be null (see line 188)

Fixes: b7a2921765 ("ceph: quota: support for ceph.quota.max_files")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-11-08 17:51:11 +01:00
Luis Henriques c2c6d3ce0d ceph: add destination file data sync before doing any remote copy
If we try to copy into a file that was just written, any data that is
remote copied will be overwritten by our buffered writes once they are
flushed.  When this happens, the call to invalidate_inode_pages2_range
will also return a -EBUSY error.

This patch fixes this by also sync'ing the destination file before
starting any copy.

Fixes: 503f82a993 ("ceph: support copy_file_range file operation")
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-11-08 17:50:37 +01:00
Linus Torvalds 9931a07d51 Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull AFS updates from Al Viro:
 "AFS series, with some iov_iter bits included"

* 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
  missing bits of "iov_iter: Separate type from direction and use accessor functions"
  afs: Probe multiple fileservers simultaneously
  afs: Fix callback handling
  afs: Eliminate the address pointer from the address list cursor
  afs: Allow dumping of server cursor on operation failure
  afs: Implement YFS support in the fs client
  afs: Expand data structure fields to support YFS
  afs: Get the target vnode in afs_rmdir() and get a callback on it
  afs: Calc callback expiry in op reply delivery
  afs: Fix FS.FetchStatus delivery from updating wrong vnode
  afs: Implement the YFS cache manager service
  afs: Remove callback details from afs_callback_break struct
  afs: Commit the status on a new file/dir/symlink
  afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
  afs: Don't invoke the server to read data beyond EOF
  afs: Add a couple of tracepoints to log I/O errors
  afs: Handle EIO from delivery function
  afs: Fix TTL on VL server and address lists
  afs: Implement VL server rotation
  afs: Improve FS server rotation error handling
  ...
2018-11-01 19:58:52 -07:00
David Howells aa563d7bca iov_iter: Separate type from direction and use accessor functions
In the iov_iter struct, separate the iterator type from the iterator
direction and use accessor functions to access them in most places.

Convert a bunch of places to use switch-statements to access them rather
then chains of bitwise-AND statements.  This makes it easier to add further
iterator types.  Also, this can be more efficient as to implement a switch
of small contiguous integers, the compiler can use ~50% fewer compare
instructions than it has to use bitwise-and instructions.

Further, cease passing the iterator type into the iterator setup function.
The iterator function can set that itself.  Only the direction is required.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:07 +01:00
David Howells 00e2370744 iov_iter: Use accessor function
Use accessor functions to access an iterator's type and direction.  This
allows for the possibility of using some other method of determining the
type of iterator than if-chains with bitwise-AND conditions.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:40:44 +01:00
Luis Henriques ea4cdc548e ceph: new mount option to disable usage of copy-from op
Add a new mount option 'nocopyfrom' that will prevent the usage of the
RADOS 'copy-from' operation in cephfs.  This could be useful, for example,
for an administrator to temporarily mitigate any possible bugs in the
'copy-from' implementation.

Currently, only copy_file_range uses this RADOS operation.  Setting this
mount option will result in this syscall reverting to the default VFS
implementation, i.e. to perform the copies locally instead of doing remote
object copies.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:24 +02:00
Luis Henriques 503f82a993 ceph: support copy_file_range file operation
This commit implements support for the copy_file_range syscall in cephfs.
It is implemented using the RADOS 'copy-from' operation, which allows to
do a remote object copy, without the need to download/upload data from/to
the OSDs.

Some manual copy may however be required if the source/destination file
offsets aren't object aligned or if the copy length is smaller than the
object size.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:24 +02:00
Luis Henriques 2ee9dd958d ceph: add non-blocking parameter to ceph_try_get_caps()
ceph_try_get_caps currently calls try_get_cap_refs with the nonblock
parameter always set to 'true'.  This change adds a new parameter that
allows to set it's value.  This will be useful for a follow-up patch that
will need to get two sets of capabilities for two different inodes without
risking a deadlock.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:23 +02:00
Ilya Dryomov 0d9c1ab3be libceph: preallocate message data items
Currently message data items are allocated with ceph_msg_data_create()
in setup_request_data() inside send_request().  send_request() has never
been allowed to fail, so each allocation is followed by a BUG_ON:

  data = ceph_msg_data_create(...);
  BUG_ON(!data);

It's been this way since support for multiple message data items was
added in commit 6644ed7b7e ("libceph: make message data be a pointer")
in 3.10.

There is no reason to delay the allocation of message data items until
the last possible moment and we certainly don't need a linked list of
them as they are only ever appended to the end and never erased.  Make
ceph_msg_new2() take max_data_items and adapt the rest of the code.

Reported-by: Jerry Lee <leisurelysw24@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:22 +02:00
Ilya Dryomov 26f887e0a3 libceph, rbd, ceph: move ceph_osdc_alloc_messages() calls
The current requirement is that ceph_osdc_alloc_messages() should be
called after oid and oloc are known.  In preparation for preallocating
message data items, move ceph_osdc_alloc_messages() further down, so
that it is called when OSD op codes are known.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:22 +02:00
Ilya Dryomov 61d2f85504 ceph: num_ops is off by one in ceph_aio_retry_work()
Two OSD op slots are allocated, but only one is ever used.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:21 +02:00
Xuehan Xu 6680288441 ceph: set timeout conditionally in __cap_delay_requeue
__cap_delay_requeue could be invoked through ceph_check_caps when there
exists caps that needs to be sent and are delayed by "i_hold_caps_min"
or "i_hold_caps_max". If __cap_delay_requeue sets timeout unconditionally,
there could be a chance that some "wanted" caps can not be release for a
long since their timeouts are reset every time they get delayed.

Fixes: http://tracker.ceph.com/issues/36369
Signed-off-by: Xuehan Xu <xuxuehan@360.cn>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:21 +02:00
Ilya Dryomov 894868330a libceph: don't consume a ref on pagelist in ceph_msg_data_add_pagelist()
Because send_mds_reconnect() wants to send a message with a pagelist
and pass the ownership to the messenger, ceph_msg_data_add_pagelist()
consumes a ref which is then put in ceph_msg_data_destroy().  This
makes managing pagelists in the OSD client (where they are wrapped in
ceph_osd_data) unnecessarily hard because the handoff only happens in
ceph_osdc_start_request() instead of when the pagelist is passed to
ceph_osd_data_pagelist_init().  I counted several memory leaks on
various error paths.

Fix up ceph_msg_data_add_pagelist() and carry a pagelist ref in
ceph_osd_data.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:21 +02:00
Ilya Dryomov 33165d4723 libceph: introduce ceph_pagelist_alloc()
struct ceph_pagelist cannot be embedded into anything else because it
has its own refcount.  Merge allocation and initialization together.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:21 +02:00
Luis Henriques bddff633ab ceph: only allow punch hole mode in fallocate
Current implementation of cephfs fallocate isn't correct as it doesn't
really reserve the space in the cluster, which means that a subsequent
call to a write may actually fail due to lack of space.  In fact, it is
currently possible to fallocate an amount space that is larger than the
free space in the cluster.  It has behaved this way since the initial
commit ad7a60de88 ("ceph: punch hole support").

Since there's no easy solution to fix this at the moment, this patch
simply removes support for all fallocate operations but
FALLOC_FL_PUNCH_HOLE (which implies FALLOC_FL_KEEP_SIZE).

Link: https://tracker.ceph.com/issues/36317
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Yan, Zheng fce7a9744b ceph: refactor ceph_sync_read()
Avoid allocating memory for the entire user request: striped_read()
does a synchronous OSD request per object, so it doesn't need more than
object size worth of pages at a time.

[ Preserve the comment, changelog. ]

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Yan, Zheng 74c9e6bf4c ceph: check if LOOKUPNAME request was aborted when filling trace
d_lookup()/d_alloc() require parent inode locked. Parent inode is
not locked if request is aborted.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Yan, Zheng c58f450bd6 ceph: fix dentry leak in ceph_readdir_prepopulate
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Yan, Zheng efe328230d Revert "ceph: fix dentry leak in splice_dentry()"
This reverts commit 8b8f53af1e.

splice_dentry() is used by three places. For two places, req->r_dentry
is passed to splice_dentry(). In the case of error, req->r_dentry does
not get updated. So splice_dentry() should not drop reference.

Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Chengguang Xu 5da207993e ceph: check snap first in ceph_set_acl()
Do the snap check first in ceph_set_acl(), so we can avoid
unnecessary operations when the inode has snap.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Chengguang Xu 3167893ae6 ceph: reset cap hold timeout only for requeued inode
__cap_delay_requeue() only requeue inode which does not
have CEPH_I_FLUSH flag, so avoid reset cap hold timeout
for that inode.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:19 +02:00
Ilya Dryomov 8aaff15168 ceph: avoid a use-after-free in ceph_destroy_options()
syzbot reported a use-after-free in ceph_destroy_options(), called from
ceph_mount().  The problem was that create_fs_client() consumed the opt
pointer on some errors, but not on all of them.  Make sure it always
consumes both libceph and ceph options.

Reported-by: syzbot+8ab6f1042021b4eed062@syzkaller.appspotmail.com
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
2018-09-06 16:18:04 +02:00
Linus Torvalds 0a78ac4b9b The main things are support for cephx v2 authentication protocol and
basic support for rbd images within namespaces (myself).  Also included
 y2038 conversion patches from Arnd, a pile of miscellaneous fixes from
 Chengguang and Zheng's feature bit infrastructure for the filesystem.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlt62CkTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzizfhB/0c/rz6frunc6EcZMWuBNzlOIOktJ/m
 MEbPGjCxMAsmidO1rqHHYF4iEN5hr+3AWTbtIL2m6wkqYVdg3FjmNaAYB27AdQMG
 kH9bLfrKIew72/NZqXfm25yjY/86kIt8t91kay4Lchc97tSYhnFSnku7iAX2HTND
 TMhq/1O/GvEyw/RmqnenJEQqFJvKnfgPPQm6W8sM2bH0T5j+EXmDT/Rv+90LogFR
 J4+pZkHqDfvyMb1WJ5MkumohytbRVzRNKcMpOvjquJSqUgtgZa2JdrIsypDqSNKY
 nUT6jGGlxoSbHCqRwDJoFEJOlh5A9RwKqYxNuM2a/vs9u7HpvdCK/Iah
 =AtgY
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The main things are support for cephx v2 authentication protocol and
  basic support for rbd images within namespaces (myself).

  Also included are y2038 conversion patches from Arnd, a pile of
  miscellaneous fixes from Chengguang and Zheng's feature bit
  infrastructure for the filesystem"

* tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client: (40 commits)
  ceph: don't drop message if it contains more data than expected
  ceph: support cephfs' own feature bits
  crush: fix using plain integer as NULL warning
  libceph: remove unnecessary non NULL check for request_key
  ceph: refactor error handling code in ceph_reserve_caps()
  ceph: refactor ceph_unreserve_caps()
  ceph: change to void return type for __do_request()
  ceph: compare fsc->max_file_size and inode->i_size for max file size limit
  ceph: add additional size check in ceph_setattr()
  ceph: add additional offset check in ceph_write_iter()
  ceph: add additional range check in ceph_fallocate()
  ceph: add new field max_file_size in ceph_fs_client
  libceph: weaken sizeof check in ceph_x_verify_authorizer_reply()
  libceph: check authorizer reply/challenge length before reading
  libceph: implement CEPHX_V2 calculation mode
  libceph: add authorizer challenge
  libceph: factor out encrypt_authorizer()
  libceph: factor out __ceph_x_decrypt()
  libceph: factor out __prepare_write_connect()
  libceph: store ceph_auth_handshake pointer in ceph_connection
  ...
2018-08-20 18:26:55 -07:00
Linus Torvalds a66b4cd1e7 Merge branch 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs open-related updates from Al Viro:

 - "do we need fput() or put_filp()" rules are gone - it's always fput()
   now. We keep track of that state where it belongs - in ->f_mode.

 - int *opened mess killed - in finish_open(), in ->atomic_open()
   instances and in fs/namei.c code around do_last()/lookup_open()/atomic_open().

 - alloc_file() wrappers with saner calling conventions are introduced
   (alloc_file_clone() and alloc_file_pseudo()); callers converted, with
   much simplification.

 - while we are at it, saner calling conventions for path_init() and
   link_path_walk(), simplifying things inside fs/namei.c (both on
   open-related paths and elsewhere).

* 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  few more cleanups of link_path_walk() callers
  allow link_path_walk() to take ERR_PTR()
  make path_init() unconditionally paired with terminate_walk()
  document alloc_file() changes
  make alloc_file() static
  do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()
  new helper: alloc_file_clone()
  create_pipe_files(): switch the first allocation to alloc_file_pseudo()
  anon_inode_getfile(): switch to alloc_file_pseudo()
  hugetlb_file_setup(): switch to alloc_file_pseudo()
  ocxlflash_getfile(): switch to alloc_file_pseudo()
  cxl_getfile(): switch to alloc_file_pseudo()
  ... and switch shmem_file_setup() to alloc_file_pseudo()
  __shmem_file_setup(): reorder allocations
  new wrapper: alloc_file_pseudo()
  kill FILE_{CREATED,OPENED}
  switch atomic_open() and lookup_open() to returning 0 in all success cases
  document ->atomic_open() changes
  ->atomic_open(): return 0 in all success cases
  get rid of 'opened' in path_openat() and the helpers downstream
  ...
2018-08-13 19:58:36 -07:00
Yan, Zheng 0fcf6c02b2 ceph: don't drop message if it contains more data than expected
Later version mds may encode more data into messages.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:44 +02:00
Yan, Zheng 342ce1823e ceph: support cephfs' own feature bits
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:44 +02:00
Chengguang Xu e5bc08d09f ceph: refactor error handling code in ceph_reserve_caps()
Call new helper __ceph_unreserve_caps() to reduce duplicated code.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:44 +02:00
Chengguang Xu 7bf8f736c8 ceph: refactor ceph_unreserve_caps()
The code of ceph_unreserve_caps() and error handling in
ceph_reserve_caps() are duplicated, so introduce a helper
__ceph_unreserve_caps() to reduce duplicated code.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:43 +02:00
Chengguang Xu d554849290 ceph: change to void return type for __do_request()
We do not check return code for __do_request() in all callers,
so change to void return type.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:43 +02:00
Chengguang Xu 9da12e3a7d ceph: compare fsc->max_file_size and inode->i_size for max file size limit
In ceph_llseek(), we compare fsc->max_file_size and inode->i_size to
choose max file size limit.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:43 +02:00
Chengguang Xu 36a4c72d1c ceph: add additional size check in ceph_setattr()
ceph_setattr() finally calls vfs function inode_newsize_ok()
to do offset validation and that is based on sb->s_maxbytes.
Because we set sb->s_maxbytes to MAX_LFS_FILESIZE to through
VFS check and do proper offset validation in cephfs level,
we need adding proper offset validation before calling
inode_newsize_ok().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:43 +02:00
Chengguang Xu 8687a3e2c7 ceph: add additional offset check in ceph_write_iter()
If the offset is larger or equal to both real file size and
max file size, then return -EFBIG.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:28 +02:00
Chengguang Xu 0671e9968d ceph: add additional range check in ceph_fallocate()
If the range is larger than both real file size and limit of
max file size, then return -EFBIG.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:28 +02:00
Chengguang Xu 719784ba70 ceph: add new field max_file_size in ceph_fs_client
In order to not bother to VFS and other specific filesystems,
we decided to do offset validation inside ceph kernel client,
so just simply set sb->s_maxbytes to MAX_LFS_FILESIZE so that
it can successfully pass VFS check. We add new field max_file_size
in ceph_fs_client to store real file size limit and doing proper
check based on it.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:27 +02:00
Ilya Dryomov 6daca13d2e libceph: add authorizer challenge
When a client authenticates with a service, an authorizer is sent with
a nonce to the service (ceph_x_authorize_[ab]) and the service responds
with a mutation of that nonce (ceph_x_authorize_reply).  This lets the
client verify the service is who it says it is but it doesn't protect
against a replay: someone can trivially capture the exchange and reuse
the same authorizer to authenticate themselves.

Allow the service to reject an initial authorizer with a random
challenge (ceph_x_authorize_challenge).  The client then has to respond
with an updated authorizer proving they are able to decrypt the
service's challenge and that the new authorizer was produced for this
specific connection instance.

The accepting side requires this challenge and response unconditionally
if the client side advertises they have CEPHX_V2 feature bit.

This addresses CVE-2018-1128.

Link: http://tracker.ceph.com/issues/24836
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-08-02 21:33:24 +02:00
Souptick Joarder 24499847e4 ceph: adding new return type vm_fault_t
Use new return type vm_fault_t for page_mkwrite
and fault handler.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:20 +02:00
Arnd Bergmann 0ed1e90a09 ceph: use timespec64 for r_stamp
The ceph_mds_request stamp still uses the deprecated timespec structure,
this converts it over as well.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:19 +02:00
Arnd Bergmann fac02ddf91 libceph: use timespec64 for r_mtime
The request mtime field is used all over ceph, and is currently
represented as a 'timespec' structure in Linux. This changes it to
timespec64 to allow times beyond 2038, modifying all users at the
same time.

[ Remove now redundant ts variable in writepage_nounlock(). ]

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:14 +02:00
Arnd Bergmann 9bbeab41ce ceph: use timespec64 for inode timestamp
Since the vfs structures are all using timespec64, we can now
change the internal representation, using ceph_encode_timespec64 and
ceph_decode_timespec64.

In case of ceph_aux_inode however, we need to avoid doing a memcmp()
on uninitialized padding data, so the members of the i_mtime field get
copied individually into 64-bit integers.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Arnd Bergmann 63ecae7e43 ceph: stop using current_kernel_time()
ceph_mdsc_create_request() is one of the last callers of the
deprecated current_kernel_time() as well as timespec_trunc().

This changes it to use the timespec64 based interfaces instead,
though we still need to convert the result until we are ready to
change over req->r_stamp.

The output of the two functions, ktime_get_coarse_real_ts64() and
current_kernel_time() is the same coarse-granular timestamp,
the only difference here is that ktime_get_coarse_real_ts64()
doesn't overflow in 2038.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Chengguang Xu 67fcd15140 ceph: add d_drop for some error cases in ceph_symlink()
When file num exceeds quota limit, should call d_drop to drop
dentry from cache as well.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Chengguang Xu 0459871c49 ceph: add d_drop for some error cases in ceph_mknod()
When file num exceeds quota limit or fails from ceph_per_init_acls()
should call d_drop to drop dentry from cache as well.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Chengguang Xu 61ad36d47d ceph: return errors from posix_acl_equiv_mode() correctly
In order to return correct error code should replace variable ret
using err in error case.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Yan, Zheng dfeb84d4ad ceph: fix incorrect use of strncpy
GCC8 prints following warning:

 fs/ceph/mds_client.c:3683:2: warning: ‘strncpy’ output may be truncated
 copying 64 bytes from a string of length 64 [-Wstringop-truncation]

[ Change to strscpy() while at it. ]

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
Ilya Dryomov 2f56b6bae7 libceph: amend "bad option arg" error message
Don't mention "mount" -- in the rbd case it is "mapping".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
Chengguang Xu 93d35c754d ceph: restore ctime as well in the case of restoring old mode
It's better to restore ctime as well in the case of restoring old mode
in ceph_set_acl().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
Chengguang Xu f017754d69 ceph: add retry logic for error -ERANGE in ceph_get_acl()
When the size of acl extended attribution is larger than pre-allocated
value buffer size, we will hit error '-ERANGE' and it's probabaly caused
by concurrent get/set acl from different clients. In this case, current
logic just sets acl to NULL so that we cannot get proper information but
the operation looks successful.

This patch adds retry logic for error -ERANGE and return -EIO if fail
from the retry. Additionally, print real errno when failing from
__ceph_getxattr().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
Al Viro 44907d7900 get rid of 'opened' argument of ->atomic_open() - part 3
now it can be done...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:20 -04:00
Al Viro be12af3ef5 getting rid of 'opened' argument of ->atomic_open() - part 1
'opened' argument of finish_open() is unused.  Kill it.

Signed-off-by Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:19 -04:00
Al Viro 73a09dd943 introduce FMODE_CREATED and switch to it
Parallel to FILE_CREATED, goes into ->f_mode instead of *opened.
NFS is a bit of a wart here - it doesn't have file at the point
where FILE_CREATED used to be set, so we need to propagate it
there (for now).  IMA is another one (here and everywhere)...

Note that this needs do_dentry_open() to leave old bits in ->f_mode
alone - we want it to preserve FMODE_CREATED if it had been already
set (no other bit can be there).

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:18 -04:00
Yan, Zheng 8b8f53af1e ceph: fix dentry leak in splice_dentry()
In any case, d_splice_alias() does not drop reference of original
dentry.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-26 18:42:44 +02:00
Linus Torvalds 7a932516f5 vfs/y2038: inode timestamps conversion to timespec64
This is a late set of changes from Deepa Dinamani doing an automated
 treewide conversion of the inode and iattr structures from 'timespec'
 to 'timespec64', to push the conversion from the VFS layer into the
 individual file systems.
 
 There were no conflicts between this and the contents of linux-next
 until just before the merge window, when we saw multiple problems:
 
 - A minor conflict with my own y2038 fixes, which I could address
   by adding another patch on top here.
 - One semantic conflict with late changes to the NFS tree. I addressed
   this by merging Deepa's original branch on top of the changes that
   now got merged into mainline and making sure the merge commit includes
   the necessary changes as produced by coccinelle.
 - A trivial conflict against the removal of staging/lustre.
 - Multiple conflicts against the VFS changes in the overlayfs tree.
   These are still part of linux-next, but apparently this is no longer
   intended for 4.18 [1], so I am ignoring that part.
 
 As Deepa writes:
 
   The series aims to switch vfs timestamps to use struct timespec64.
   Currently vfs uses struct timespec, which is not y2038 safe.
 
   The series involves the following:
   1. Add vfs helper functions for supporting struct timepec64 timestamps.
   2. Cast prints of vfs timestamps to avoid warnings after the switch.
   3. Simplify code using vfs timestamps so that the actual
      replacement becomes easy.
   4. Convert vfs timestamps to use struct timespec64 using a script.
      This is a flag day patch.
 
   Next steps:
   1. Convert APIs that can handle timespec64, instead of converting
      timestamps at the boundaries.
   2. Update internal data structures to avoid timestamp conversions.
 
 Thomas Gleixner adds:
 
   I think there is no point to drag that out for the next merge window.
   The whole thing needs to be done in one go for the core changes which
   means that you're going to play that catchup game forever. Let's get
   over with it towards the end of the merge window.
 
 [1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
 MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
 g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
 L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
 Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
 LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
 nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
 wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
 c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
 tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
 WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
 A3HkjIBrKW5AgQDxfgvm
 =CZX2
 -----END PGP SIGNATURE-----

Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground

Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
 "This is a late set of changes from Deepa Dinamani doing an automated
  treewide conversion of the inode and iattr structures from 'timespec'
  to 'timespec64', to push the conversion from the VFS layer into the
  individual file systems.

  As Deepa writes:

   'The series aims to switch vfs timestamps to use struct timespec64.
    Currently vfs uses struct timespec, which is not y2038 safe.

    The series involves the following:
    1. Add vfs helper functions for supporting struct timepec64
       timestamps.
    2. Cast prints of vfs timestamps to avoid warnings after the switch.
    3. Simplify code using vfs timestamps so that the actual replacement
       becomes easy.
    4. Convert vfs timestamps to use struct timespec64 using a script.
       This is a flag day patch.

    Next steps:
    1. Convert APIs that can handle timespec64, instead of converting
       timestamps at the boundaries.
    2. Update internal data structures to avoid timestamp conversions'

  Thomas Gleixner adds:

   'I think there is no point to drag that out for the next merge
    window. The whole thing needs to be done in one go for the core
    changes which means that you're going to play that catchup game
    forever. Let's get over with it towards the end of the merge window'"

* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
  pstore: Remove bogus format string definition
  vfs: change inode times to use struct timespec64
  pstore: Convert internal records to timespec64
  udf: Simplify calls to udf_disk_stamp_to_time
  fs: nfs: get rid of memcpys for inode times
  ceph: make inode time prints to be long long
  lustre: Use long long type to print inode time
  fs: add timespec64_truncate()
2018-06-15 07:31:07 +09:00
Linus Torvalds dc594c39f7 The main piece is a set of libceph changes that revamps how OSD
requests are aborted, improving CephFS ENOSPC handling and making
 "umount -f" actually work (Zheng and myself).  The rest is mostly
 mount option handling cleanups from Chengguang and assorted fixes
 from Zheng, Luis and Dongsheng.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJbIkigAAoJEEp/3jgCEfOL3EUH/1s7Ib3FgFzG/SPPKISxZOGr
 ndZGg0rPT9mPIQ4rp6t0z/cDlMrluPmCK3sWrAPe//sZz9iZiuip+mCL0gUFXFNr
 1kL2xDKkJzGxtP3UlUvr5CC6bnxLdeBXJRBDLk/swtphuqArKndlbN/iLZnCZivT
 uJDk+vZTwNJ3UhQP4QdnOQLV60NYs+q4euTqbZF3+pDiRiONbxRfXC3adFsc8zL9
 zlie3CHPbrQHWMsfNvbfM3rBH1WhTwEssDm+IEFlKl19q9SKP2WPZfmBcE1pmZ58
 AhIMoNGdQha1FXS6N96kaPaqFgeysPnEPoyHDqLxsUMKqsvJlOEZsK1jujza4rE=
 =EfXm
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.18-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The main piece is a set of libceph changes that revamps how OSD
  requests are aborted, improving CephFS ENOSPC handling and making
  "umount -f" actually work (Zheng and myself).

  The rest is mostly mount option handling cleanups from Chengguang and
  assorted fixes from Zheng, Luis and Dongsheng.

* tag 'ceph-for-4.18-rc1' of git://github.com/ceph/ceph-client: (31 commits)
  rbd: flush rbd_dev->watch_dwork after watch is unregistered
  ceph: update description of some mount options
  ceph: show ino32 if the value is different with default
  ceph: strengthen rsize/wsize/readdir_max_bytes validation
  ceph: fix alignment of rasize
  ceph: fix use-after-free in ceph_statfs()
  ceph: prevent i_version from going back
  ceph: fix wrong check for the case of updating link count
  libceph: allocate the locator string with GFP_NOFAIL
  libceph: make abort_on_full a per-osdc setting
  libceph: don't abort reads in ceph_osdc_abort_on_full()
  libceph: avoid a use-after-free during map check
  libceph: don't warn if req->r_abort_on_full is set
  libceph: use for_each_request() in ceph_osdc_abort_on_full()
  libceph: defer __complete_request() to a workqueue
  libceph: move more code into __complete_request()
  libceph: no need to call flush_workqueue() before destruction
  ceph: flush pending works before shutdown super
  ceph: abort osd requests on force umount
  libceph: introduce ceph_osdc_abort_requests()
  ...
2018-06-15 07:24:58 +09:00
Kees Cook 6da2ec5605 treewide: kmalloc() -> kmalloc_array()
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:

        kmalloc(a * b, gfp)

with:
        kmalloc_array(a * b, gfp)

as well as handling cases of:

        kmalloc(a * b * c, gfp)

with:

        kmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kmalloc
+ kmalloc_array
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kmalloc(sizeof(THING) * C2, ...)
|
  kmalloc(sizeof(TYPE) * C2, ...)
|
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Deepa Dinamani 95582b0083 vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.

The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.

The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.

virtual patch

@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
  current_time ( ... )
  {
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
  ...
- return timespec_trunc(
+ return timespec64_trunc(
  ... );
  }

@ depends on patch @
identifier xtime;
@@
 struct \( iattr \| inode \| kstat \) {
 ...
-       struct timespec xtime;
+       struct timespec64 xtime;
 ...
 }

@ depends on patch @
identifier t;
@@
 struct inode_operations {
 ...
int (*update_time) (...,
-       struct timespec t,
+       struct timespec64 t,
...);
 ...
 }

@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
 fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
 ...) { ... }

@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
  ) { ... }

@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)

<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 =  timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)

@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
 fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}

@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1  ;
|
 node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 stat->xtime = node2->i_xtime1;
|
 stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1  ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
 node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-06-05 16:57:31 -07:00
Chengguang Xu 3619aa8b74 ceph: show ino32 if the value is different with default
In current ceph_show_options(), there is no item for showing 'ino32',
so add showing mount option 'ino32' if the value is different with
default.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:46:02 +02:00
Chengguang Xu 8db0c7596f ceph: strengthen rsize/wsize/readdir_max_bytes validation
The check (intval < PAGE_SIZE) will involve type cast, so even when
specifying negative value to rsize/wsize/readdir_max_bytes, it will
pass the validation check successfully.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:46:01 +02:00
Chengguang Xu c36ed50de2 ceph: fix alignment of rasize
On currently logic:
when I specify rasize=0~1 then it will be 4096.
when I specify rasize=2~4097 then it will be 8192.

Make it the same as rsize & wsize.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:46:01 +02:00
Luis Henriques 73fb0949cf ceph: fix use-after-free in ceph_statfs()
KASAN found an UAF in ceph_statfs.  This was a one-off bug but looking at
the code it looks like the monmap access needs to be protected as it can
be modified while we're accessing it.  Fix this by protecting the access
with the monc->mutex.

  BUG: KASAN: use-after-free in ceph_statfs+0x21d/0x2c0
  Read of size 8 at addr ffff88006844f2e0 by task trinity-c5/304

  CPU: 0 PID: 304 Comm: trinity-c5 Not tainted 4.17.0-rc6+ #172
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
  Call Trace:
   dump_stack+0xa5/0x11b
   ? show_regs_print_info+0x5/0x5
   ? kmsg_dump_rewind+0x118/0x118
   ? ceph_statfs+0x21d/0x2c0
   print_address_description+0x73/0x2b0
   ? ceph_statfs+0x21d/0x2c0
   kasan_report+0x243/0x360
   ceph_statfs+0x21d/0x2c0
   ? ceph_umount_begin+0x80/0x80
   ? kmem_cache_alloc+0xdf/0x1a0
   statfs_by_dentry+0x79/0xb0
   vfs_statfs+0x28/0x110
   user_statfs+0x8c/0xe0
   ? vfs_statfs+0x110/0x110
   ? __fdget_raw+0x10/0x10
   __se_sys_statfs+0x5d/0xa0
   ? user_statfs+0xe0/0xe0
   ? mutex_unlock+0x1d/0x40
   ? __x64_sys_statfs+0x20/0x30
   do_syscall_64+0xee/0x290
   ? syscall_return_slowpath+0x1c0/0x1c0
   ? page_fault+0x1e/0x30
   ? syscall_return_slowpath+0x13c/0x1c0
   ? prepare_exit_to_usermode+0xdb/0x140
   ? syscall_trace_enter+0x330/0x330
   ? __put_user_4+0x1c/0x30
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Allocated by task 130:
   __kmalloc+0x124/0x210
   ceph_monmap_decode+0x1c1/0x400
   dispatch+0x113/0xd20
   ceph_con_workfn+0xa7e/0x44e0
   process_one_work+0x5f0/0xa30
   worker_thread+0x184/0xa70
   kthread+0x1a0/0x1c0
   ret_from_fork+0x35/0x40

  Freed by task 130:
   kfree+0xb8/0x210
   dispatch+0x15a/0xd20
   ceph_con_workfn+0xa7e/0x44e0
   process_one_work+0x5f0/0xa30
   worker_thread+0x184/0xa70
   kthread+0x1a0/0x1c0
   ret_from_fork+0x35/0x40

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:46:01 +02:00
Yan, Zheng aae1a442f8 ceph: prevent i_version from going back
inode info from non-auth can be stale.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:46:01 +02:00
Yan, Zheng fa466743a9 ceph: fix wrong check for the case of updating link count
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:46:01 +02:00
Ilya Dryomov c843d13cae libceph: make abort_on_full a per-osdc setting
The intent behind making it a per-request setting was that it would be
set for writes, but not for reads.  As it is, the flag is set for all
fs/ceph requests except for pool perm check stat request (technically
a read).

ceph_osdc_abort_on_full() skips reads since the previous commit and
I don't see a use case for marking individual requests.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
2018-06-04 20:46:00 +02:00
Yan, Zheng a57d9064e4 ceph: flush pending works before shutdown super
Pending works hold inode references, which cause "Busy inodes after
unmount" warning.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:57 +02:00
Yan, Zheng 12b69d5f6f ceph: abort osd requests on force umount
This avoid force umount waiting on page writeback:

  io_schedule+0xd/0x30
  wait_on_page_bit_common+0xc6/0x130
  __filemap_fdatawait_range+0xbd/0x100
  filemap_fdatawait_keep_errors+0x15/0x40
  sync_inodes_sb+0x1cf/0x240
  sync_filesystem+0x52/0x90
  generic_shutdown_super+0x1d/0x110
  ceph_kill_sb+0x28/0x80 [ceph]
  deactivate_locked_super+0x35/0x60
  cleanup_mnt+0x36/0x70
  task_work_run+0x79/0xa0
  exit_to_usermode_loop+0x62/0x70
  do_syscall_64+0xdb/0xf0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  0xffffffffffffffff

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:57 +02:00
Luis Henriques 8c6286f1c6 ceph: fix st_nlink stat for directories
Currently, calling stat on a cephfs directory returns 1 for st_nlink.
This behaviour has recently changed in the fuse client, as some
applications seem to expect this value to be either 0 (if it's
unlinked) or 2 + number of subdirectories.  This behaviour was changed
in the fuse client with commit 67c7e4619188 ("client: use common
interp of st_nlink for dirs").

This patch modifies the kernel client to have a similar behaviour.

Link: https://tracker.ceph.com/issues/23873
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:56 +02:00
Yan, Zheng 597817ddbb ceph: support file lock on directory
Link: http://tracker.ceph.com/issues/24028
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:56 +02:00
Ilya Dryomov 6dd4940ba5 ceph: show wsize only if non-default
This is how it was before commit 95cca2b44e ("ceph: limit osd write
size") went in.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:56 +02:00
Yan, Zheng 4985d6f9e5 ceph: handle the new nfiles/nsubdirs fields in cap message
Without these new fields, stale st_size is returned in following
case.

1. MDS modifies a directory
2. MDS issues CEPH_CAP_ANY_SHARED to client
3. The client satifies stat(2) by its cached metadata. set st_size
   to "i_files + i_subdirs".

Link: http://tracker.ceph.com/issues/23855
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:56 +02:00
Yan, Zheng a1c6b83581 ceph: define argument structure for handle_cap_grant
The data structure includes the versioned feilds of cap message.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:56 +02:00
Yan, Zheng 2af54a72b5 ceph: update i_files/i_subdirs only when Fs cap is issued
In MDS, file/subdir counts of a directory inode are protected by
filelock. In request reply without Fs cap, nfiles/nsubdirs can be
stale.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:55 +02:00
Yan, Zheng 49a9f4f671 ceph: always get rstat from auth mds
rstat is not tracked by capability. client can't know if rstat from
non-auth mds is uptodate or not.

Link: http://tracker.ceph.com/issues/23538
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:55 +02:00
Yan, Zheng 4e9906e798 ceph: use bit flags to define vxattr attributes
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:55 +02:00
Deepa Dinamani 13442b036a ceph: make inode time prints to be long long
Subsequent patches in the series convert inode timestamps
to use struct timespec64 instead of struct timespec as
part of solving the y2038 problem.

Convert these print formats to use long long types to
avoid warnings and errors on conversion.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: zyan@redhat.com
Cc: ceph-devel@vger.kernel.org
2018-05-25 15:31:12 -07:00
Ilya Dryomov fc218544fb ceph: fix iov_iter issues in ceph_direct_read_write()
dio_get_pagev_size() and dio_get_pages_alloc() introduced in commit
b5b98989dc ("ceph: combine as many iovec as possile into one OSD
request") assume that the passed iov_iter is ITER_IOVEC.  This isn't
the case with splice where it ends up poking into the guts of ITER_BVEC
or ITER_PIPE iterators, causing lockups and crashes easily reproduced
with generic/095.

Rather than trying to figure out gap alignment and stuff pages into
a page vector, add a helper for going from iov_iter to a bio_vec array
and make use of the new CEPH_OSD_DATA_TYPE_BVECS code.

Fixes: b5b98989dc ("ceph: combine as many iovec as possile into one OSD request")
Link: http://tracker.ceph.com/issues/18130
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Tested-by: Luis Henriques <lhenriques@suse.com>
2018-05-10 10:15:12 +02:00
Ilya Dryomov 3a15b38fd2 ceph: fix rsize/wsize capping in ceph_direct_read_write()
rsize/wsize cap should be applied before ceph_osdc_new_request() is
called.  Otherwise, if the size is limited by the cap instead of the
stripe unit, ceph_osdc_new_request() would setup an extent op that is
bigger than what dio_get_pages_alloc() would pin and add to the page
vector, triggering asserts in the messenger.

Cc: stable@vger.kernel.org
Fixes: 95cca2b44e ("ceph: limit osd write size")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
2018-05-10 10:15:00 +02:00
Yan, Zheng f191982689 ceph: check if mds create snaprealm when setting quota
If mds does not, return -EOPNOTSUPP.

Link: http://tracker.ceph.com/issues/23491
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-23 17:35:19 +02:00
Yan, Zheng ffdeec7aa4 ceph: always update atime/mtime/ctime for new inode
For new inode, atime/mtime/ctime are uninitialized.  Don't compare
against them.

Cc: stable@kernel.org
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-16 09:38:40 +02:00
Linus Torvalds b284d4d5a6 The big ticket items are:
- support for rbd "fancy" striping (myself).  The striping feature bit
   is now fully implemented, allowing mapping v2 images with non-default
   striping patterns.  This completes support for --image-format 2.
 
 - CephFS quota support (Luis Henriques and Zheng Yan).  This set is
   based on the new SnapRealm code in the upcoming v13.y.z ("Mimic")
   release.  Quota handling will be rejected on older filesystems.
 
 - memory usage improvements in CephFS (Chengguang Xu).  Directory
   specific bits have been split out of ceph_file_info and some effort
   went into improving cap reservation code to avoid OOM crashes.
 
 Also included a bunch of assorted fixes all over the place from
 Chengguang and others.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJazOI/AAoJEEp/3jgCEfOLOu0IAKGFkcCo0UdQDGHHJZHn2rAm
 CSWMMwyYGAhoWI6Gva0jx1A2omZLFSeq/MC8dWLL/MNAKt8i/qo8bTsTrwCHMR2Q
 D0FsvMWIhkWRS1/FcD1uVDhn0a/DFm5Kfy8kzz3v695TDCt+BYWrCqyHTB/wSdRR
 VpO3KdpHQ9h3ojNBRgIniOCNPeQP+QzLXy+P0h0oKbP2Y03mwJlsWG4L6zakkkwT
 e2I+RVdlOMUDJ7rZxiXESBr6BuLI4oOkPe8roQGmZPy1Xe17xa9M5iWVNuM6RUhO
 Z9bS2aLMhbDyeCPqvzgAnsUtFT0PAQjB5NYw2yqisbHs/wrU5kMOOpcLqz/Ls/s=
 =v1I9
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The big ticket items are:

   - support for rbd "fancy" striping (myself).

     The striping feature bit is now fully implemented, allowing mapping
     v2 images with non-default striping patterns. This completes
     support for --image-format 2.

   - CephFS quota support (Luis Henriques and Zheng Yan).

     This set is based on the new SnapRealm code in the upcoming v13.y.z
     ("Mimic") release. Quota handling will be rejected on older
     filesystems.

   - memory usage improvements in CephFS (Chengguang Xu).

     Directory specific bits have been split out of ceph_file_info and
     some effort went into improving cap reservation code to avoid OOM
     crashes.

  Also included a bunch of assorted fixes all over the place from
  Chengguang and others"

* tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client: (67 commits)
  ceph: quota: report root dir quota usage in statfs
  ceph: quota: add counter for snaprealms with quota
  ceph: quota: cache inode pointer in ceph_snap_realm
  ceph: fix root quota realm check
  ceph: don't check quota for snap inode
  ceph: quota: update MDS when max_bytes is approaching
  ceph: quota: support for ceph.quota.max_bytes
  ceph: quota: don't allow cross-quota renames
  ceph: quota: support for ceph.quota.max_files
  ceph: quota: add initial infrastructure to support cephfs quotas
  rbd: remove VLA usage
  rbd: fix spelling mistake: "reregisteration" -> "reregistration"
  ceph: rename function drop_leases() to a more descriptive name
  ceph: fix invalid point dereference for error case in mdsc destroy
  ceph: return proper bool type to caller instead of pointer
  ceph: optimize memory usage
  ceph: optimize mds session register
  libceph, ceph: add __init attribution to init funcitons
  ceph: filter out used flags when printing unused open flags
  ceph: don't wait on writeback when there is no more dirty pages
  ...
2018-04-10 12:25:30 -07:00
Linus Torvalds 62f8e6c5dc fscache development
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAWsdxrvu3V2unywtrAQJVmQ/9Fv8d/Ecdwv5nxVBmN7uA8lOYcHEbZWmd
 FhFQE8qYLjKMo9Fy4tPkBbu1l6CVnetaTRE5qwixACJAftrdjABKJAazGR3Uxief
 0jMSWScrV1XCeRErPcczHcx52Hefl8f1DQdA3zpoF0ewz7CjyxMxkl67bsYJbNKE
 T4ebCu5IJk+5PPwwMM3REKjQbunSXXnzgCLUI2cc0Yf76CTVpx6p+NpxV+2wq0p7
 vym83F68qACAEzNH+oozN7IwqjkWyYOnTtCLiMsh4iq30jP6ohtLom6RcRp7QUxM
 Z9hxgG3NptypuVBO1jKxaQ6XZGgAasYmppOmJ/SoALv2PKsAbxi372lTR4ikceKq
 H4oNTbs5tVmyvu3qFwtLN+vX+GdfaoSUnUG8vTvnCB3tHHtYj7q5QeFE0HaX4QSq
 oLANkCOZU8TJsT30pxsCNYiqc5HK9kaLjUQId9K+xq7mM/IuhtNtBQ+ZpqAh5IxB
 4bXKYLdeJ1myZrkYTa6gcTqeFax3djCBJ3UvjTnuqRZAaQg079WkG84Kdq1ZjDRp
 IQpKQnPX9JGhjW1zqLK1Ay8h+HFPgWR5BBVOaLwImr1mH+ccG0iNIeDjrOc8h6J5
 e60XM/x2dIYxpXyFYAkldbAI24aRg1FNzfniG4rSAPecf3SwWrxg/qK7uujLbJHM
 fKNA80yifHo=
 =ukqs
 -----END PGP SIGNATURE-----

Merge tag 'fscache-next-20180406' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull fscache updates from David Howells:
 "Three patches that fix some of AFS's usage of fscache:

   (1) Need to invalidate the cache if a foreign data change is detected
       on the server.

   (2) Move the vnode ID uniquifier (equivalent to i_generation) from
       the auxiliary data to the index key to prevent a race between
       file delete and a subsequent file create seeing the same index
       key.

   (3) Need to retire cookies that correspond to files that we think got
       deleted on the server.

  Four patches to fix some things in fscache and cachefiles:

   (4) Fix a couple of checker warnings.

   (5) Correctly indicate to the end-of-operation callback whether an
       operation completed or was cancelled.

   (6) Add a check for multiple cookie relinquishment.

   (7) Fix a path through the asynchronous write that doesn't wake up a
       waiter for a page if the cache decides not to write that page,
       but discards it instead.

  A couple of patches to add tracepoints to fscache and cachefiles:

   (8) Add tracepoints for cookie operators, object state machine
       execution, cachefiles object management and cachefiles VFS
       operations.

   (9) Add tracepoints for fscache operation management and page
       wrangling.

  And then three development patches:

  (10) Attach the index key and auxiliary data to the cookie, pass this
       information through various fscache-netfs API functions and get
       rid of the callbacks to the netfs to get it.

       This means that the cache can get at this information, even if
       the netfs goes away. It also means that the cache can be lazy in
       updating the coherency data.

  (11) Pass the object data size through various fscache-netfs API
       rather than calling back to the netfs for it, and store the value
       in the object.

       This makes it easier to correctly resize the object, as the size
       is updated on writes to the cache, rather than calling back out
       to the netfs.

  (12) Maintain a catalogue of allocated cookies. This makes it possible
       to catch cookie collision up front rather than down in the bowels
       of the cache being run from a service thread from the object
       state machine.

       This will also make it possible in the future to reconnect to a
       cookie that's not gone dead yet because it's waiting for
       finalisation of the storage and also make it possible to bring
       cookies online if the cache is added after the cookie has been
       obtained"

* tag 'fscache-next-20180406' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  fscache: Maintain a catalogue of allocated cookies
  fscache: Pass object size in rather than calling back for it
  fscache: Attach the index key and aux data to the cookie
  fscache: Add more tracepoints
  fscache: Add tracepoints
  fscache: Fix hanging wait on page discarded by writeback
  fscache: Detect multiple relinquishment of a cookie
  fscache: Pass the correct cancelled indications to fscache_op_complete()
  fscache, cachefiles: Fix checker warnings
  afs: Be more aggressive in retiring cached vnodes
  afs: Use the vnode ID uniquifier in the cache key not the aux data
  afs: Invalidate cache on server data change
2018-04-07 09:08:24 -07:00
Linus Torvalds 9022ca6b11 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff, including Christoph's I_DIRTY patches"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: move I_DIRTY_INODE to fs.h
  ubifs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
  ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
  gfs2: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) calls
  fs: fold open_check_o_direct into do_dentry_open
  vfs: Replace stray non-ASCII homoglyph characters with their ASCII equivalents
  vfs: make sure struct filename->iname is word-aligned
  get rid of pointless includes of fs_struct.h
  [poll] annotate SAA6588_CMD_POLL users
2018-04-06 11:07:08 -07:00
David Howells ee1235a9a0 fscache: Pass object size in rather than calling back for it
Pass the object size in to fscache_acquire_cookie() and
fscache_write_page() rather than the netfs providing a callback by which it
can be received.  This makes it easier to update the size of the object
when a new page is written that extends the object.

The current object size is also passed by fscache to the check_aux
function, obviating the need to store it in the aux data.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Anna Schumaker <anna.schumaker@netapp.com>
Tested-by: Steve Dickson <steved@redhat.com>
2018-04-06 14:05:14 +01:00
David Howells 402cb8dda9 fscache: Attach the index key and aux data to the cookie
Attach copies of the index key and auxiliary data to the fscache cookie so
that:

 (1) The callbacks to the netfs for this stuff can be eliminated.  This
     can simplify things in the cache as the information is still
     available, even after the cache has relinquished the cookie.

 (2) Simplifies the locking requirements of accessing the information as we
     don't have to worry about the netfs object going away on us.

 (3) The cache can do lazy updating of the coherency information on disk.
     As long as the cache is flushed before reboot/poweroff, there's no
     need to update the coherency info on disk every time it changes.

 (4) Cookies can be hashed or put in a tree as the index key is easily
     available.  This allows:

     (a) Checks for duplicate cookies can be made at the top fscache layer
     	 rather than down in the bowels of the cache backend.

     (b) Caching can be added to a netfs object that has a cookie if the
     	 cache is brought online after the netfs object is allocated.

A certain amount of space is made in the cookie for inline copies of the
data, but if it won't fit there, extra memory will be allocated for it.

The downside of this is that live cache operation requires more memory.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Anna Schumaker <anna.schumaker@netapp.com>
Tested-by: Steve Dickson <steved@redhat.com>
2018-04-04 13:41:28 +01:00
Luis Henriques 9122eed528 ceph: quota: report root dir quota usage in statfs
This commit changes statfs default behaviour when reporting usage
statistics.  Instead of using the overall filesystem usage, statfs now
reports the quota for the filesystem root, if ceph.quota.max_bytes has
been set for this inode.  If quota hasn't been set, it falls back to the
old statfs behaviour.

A new mount option is also added ('noquotadf') to disable this behaviour.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 11:17:53 +02:00
Luis Henriques d557c48db7 ceph: quota: add counter for snaprealms with quota
By keeping a counter with the number of snaprealms that have quota set
allows to optimize the functions that need to walk throught the realms
hierarchy looking for quotas.  Thus, if this counter is zero it's safe to
assume that there are no realms with quota.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 11:17:53 +02:00
Luis Henriques e3161f17d9 ceph: quota: cache inode pointer in ceph_snap_realm
Keep a pointer to the inode in struct ceph_snap_realm.  This allows to
optimize functions that walk the realms hierarchy (e.g. in quotas).

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 11:17:53 +02:00
Yan, Zheng 0eb6bbe4d9 ceph: fix root quota realm check
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 11:17:52 +02:00
Yan, Zheng 2596366907 ceph: don't check quota for snap inode
snap inode's i_snap_realm is not pointing to ceph_snap_realm.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 11:17:52 +02:00