... and also fix up the comment -- format 1 data objects have always
been 12 hex digits long.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Factor OSD request allocation and initialization code out into
__rbd_osd_req_create().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
The allocation doesn't depend on offset and length. Both offset and
length can be changed after obj_request is allocated, too.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Add support for RBD_FEATURE_DATA_POOL feature. rbd_dev->layout.pool_id
now stores the data pool id.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Rather than initializing layout fields with some made up values in
__rbd_dev_create(), move the initialization into rbd_init_layout() and
call it after the header is actually populated.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Returning u64 doesn't make sense: max header->obj_order is 25 and
ceph_file_layout::object_size is u32.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
As explained in the previous commit, rbd_obj_request machinery (and
rbd_osd_req_create() in particular) shouldn't be used for working with
metadata objects.
Switch to the recently added ceph_osdc_call(). It assumes single pages
for outbound and inbound buffers, but that's OK - none of the callers
need more than that. These pages need to be allocated (messenger is in
dire need of proper iterator interface!), but we are swapping for
pages[] and pagelist allocations in the existing code.
Kill class_name argument - all rbd methods are under "rbd".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
To spare checking for "this reply fits into a page, but does it fit
into my buffer?" in some callers, osd_req_op_cls_response_data_pages()
needs to know how big it is.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
rbd_obj_request machinery is completely unnecessary here; all that's
being done is fetching a metadata object - no striping, cloning, etc.
More importantly, rbd_osd_req_create() grabs pool id from layout and
that is becoming a data pool id.
Kill offset argument - all metadata objects are small and read in full.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
No reason to delay it until image_id is known. This will be required
by some rbd_obj_method_sync() callers, after rbd_obj_method_sync() is
changed to take oloc.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Image format 1 is deprecated and format 2 doesn't have these. Also,
__rbd_dev_create() takes care of zeroing (or otherwise initializing)
format 2 specific fields.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
... to accommodate potentially very wide EC pools. This increases the
size of a typical rbd ceph_osd_request by ~12% (from 1040 to 1168 bytes),
but I'd rather go future proof here.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
With EC overwrites maturing, the kernel client will be getting exposed
to potentially very wide EC pools. While "min(pi->size, X)" works fine
when the cluster is stable and happy, truncating OSD sets interferes
with resend logic (ceph_is_new_interval(), etc). Abort the mapping if
the pool is too wide, assigning the request to the homeless session.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Much like Arlo Guthrie, I decided that one big pile is better than two
little piles.
Reflects ceph.git commit 95c2df6c7e0b22d2ea9d91db500cf8b9441c73ba.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Then add it to the working state. It would be very nice if we didn't
have to take a lock to calculate a crush placement. By moving the
permutation array into the working data, we can treat the CRUSH map as
immutable.
Reflects ceph.git commit cbcd039651c0569551cb90d26ce27e1432671f2a.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This was causing a build failure for openrisc when using musl and
gcc 5.4.0 since the file is not available in the toolchain.
It doesnt seem this is needed and removing it does not cause any build
warnings for me.
Signed-off-by: Stafford Horne <shorne@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In commit c3f4688a08 (ceph: don't set req->r_locked_dir in
ceph_d_revalidate), we changed the code to do a GETATTR instead of a
LOOKUP as the parent info isn't strictly necessary to revalidate the
dentry. What we missed there though is that in order to update the lease
on the dentry after revalidating it, we _do_ need parent info.
Change ceph_d_revalidate back to doing a LOOKUP instead of a GETATTR so
that we can get the parent info in order to update the lease from
ceph_fill_trace. Note that we set req->r_parent here, but we cannot set
the CEPH_MDS_R_PARENT_LOCKED flag as we can't guarantee that it is.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
We don't really require that the parent be locked in order to update the
lease on a dentry. Lease info is protected by the d_lock. In the event
that the parent is not locked in ceph_fill_trace, and we have both
parent and target info, go ahead and update the dentry lease.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In a later patch, we're going to need to allow ceph_fill_trace to
update the dentry's lease when the parent is not locked. This is
potentially racy though -- by the time we get around to processing the
trace, the parent may have already changed.
Change update_dentry_lease to take a ceph_vino pointer and use that to
ensure that the dentry's parent still matches it before updating the
lease.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This if block updates the dentry lease even in the case where
the MDS didn't grant one.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
struct ceph_mds_request has an r_locked_dir pointer, which is set to
indicate the parent inode and that its i_rwsem is locked. In some
critical places, we need to be able to indicate the parent inode to the
request handling code, even when its i_rwsem may not be locked.
Most of the code that operates on r_locked_dir doesn't require that the
i_rwsem be locked. We only really need it to handle manipulation of the
dcache. The rest (filling of the inode, updating dentry leases, etc.)
already has its own locking.
Add a new r_req_flags bit that indicates whether the parent is locked
when doing the request, and rename the pointer to "r_parent". For now,
all the places that set r_parent also set this flag, but that will
change in a later patch.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently, we have a bunch of bool flags in struct ceph_mds_request. We
need more flags though, but each bool takes (at least) a byte. Those
add up over time.
Merge all of the existing bools in this struct into a single unsigned
long, and use the set/test/clear_bit macros to manipulate them. These
are atomic operations, but that is required here to prevent
load/modify/store races. The existing flags are protected by different
locks, so we can't rely on them for that purpose.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Just get it from r_session since that's what's always passed in.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Keeping around commented out code is just asking for it to bitrot and
makes viewing the code under cscope more confusing. If
we really need this, then we can revert this patch and put it under a
Kconfig option.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
__ceph_caps_mds_wanted() ignores caps from stale session. So the
return value of __ceph_caps_mds_wanted() can keep the same across
ceph_renew_caps(). This causes try_get_cap_refs() to keep calling
ceph_renew_caps(). The fix is ignore the session valid check for
the try_get_cap_refs() case. If session is stale, just let the
caps requester sleep.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
when flushing inode's auth cap changes, we need to move it into the
new auth cap session's cap_flushing list
Signed-off-by: Yan, Zheng <zyan@redhat.com>
add_to_page_cache_lru() can fails, so the actual pages to read
can be smaller than the initial size of osd request. We need to
update osd request size in that case.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
sparse says:
fs/ceph/ioctl.c💯28: warning: cast to restricted __le64
preferred_osd is a __s64 so we don't need to do any conversion. Also,
just remove the cast in ceph_ioctl_get_layout as it's not needed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently crypto.c gets linux/sched.h indirectly through linux/slab.h
from linux/kasan.h. Include it directly for memalloc_noio_*() inlines.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
I ran into this compile warning, which is the result of BUG_ON(1)
not always leading to the compiler treating the code path as
unreachable:
include/linux/ceph/osdmap.h: In function 'ceph_can_shift_osds':
include/linux/ceph/osdmap.h:62:1: error: control reaches end of non-void function [-Werror=return-type]
Using BUG() here avoids the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
user space may open/close single file frequently. It's not good
to send a clientcaps message to mds for each open/close syscall.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
This patch sets the io_pages bdi hint based on the rsize mount option.
Without this patch large buffered reads (request size > max readahead)
are processed sequentially in chunks of the readahead size (i.e. read
requests are sent out up to the readahead size, then the
do_generic_file_read() function waits until the first page is received).
With this patch read requests are sent out at once up to the size
specified in the rsize mount option (default: 64 MB).
Signed-off-by: Andreas Gerstmayr <andreas.gerstmayr@catalysts.cc>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
This removes the uses of ACCESS_ONCE in favor of READ_ONCE
Signed-off-by: Seraphime Kirkovski <kirkseraph@gmail.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
If we have a parent inode reference already, then we don't need to
go back up the directory tree to find one.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Accessing d_parent requires some sort of locking or it could vanish
out from under us. Since we take the d_lock anyway, use that to fetch
d_parent and take a reference to it, and then use that reference to
call ceph_encode_inode_release.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In the event that we have a parent inode reference in the request, we
can use that instead of mucking about in the dcache. Pass any parent
inode info we have down to build_dentry_path so it can make use of it.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
While we hold a reference to the dentry when build_dentry_path is
called, we could end up racing with a rename that changes d_parent.
Handle that situation correctly, by using the rcu_read_lock to
ensure that the parent dentry and inode stick around long enough
to safely check ceph_snap and ceph_ino.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
__choose_mds exists to pick an MDS to use when issuing a call. Doing
that typically involves picking an inode and using the authoritative
MDS for it. In most cases, that's pretty straightforward, as we are
using an inode to which we hold a reference (usually represented by
r_dentry or r_inode in the request).
In the case of a snapshotted directory however, we need to fetch
the non-snapped parent, which involves walking back up the parents
in the tree. The dentries in the snapshot dir are effectively frozen
but the overall parent is _not_, and could vanish if a concurrent
rename were to occur.
Clean this code up and take special care to ensure the validity of
the entries we're working with. First, try to use the inode in
r_locked_dir if one exists. If not and all we have is r_dentry,
then we have to walk back up the tree. Use the rcu_read_lock for
this so we can ensure that any d_parent we find won't go away, and
take extra care to deal with the possibility that the dentries could
go negative.
Change get_nonsnap_parent to return an inode, and take a reference to
that inode before returning (if any). Change all of the other places
where we set "inode" in __choose_mds to also take a reference, and then
call iput on that inode before exiting the function.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
What happens is that a write to /dev/sg is given a request with non-zero
->iovec_count combined with zero ->dxfer_len. Or with ->dxferp pointing
to an array full of empty iovecs.
Having write permission to /dev/sg shouldn't be equivalent to the
ability to trigger BUG_ON() while holding spinlocks...
Found by Dmitry Vyukov and syzkaller.
[ The BUG_ON() got changed to a WARN_ON_ONCE(), but this fixes the
underlying issue. - Linus ]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't crash the machine just because of an empty transfer. Use WARN_ON()
combined with returning an error.
Found by Dmitry Vyukov and syzkaller.
[ Changed to "WARN_ON_ONCE()". Al has a patch that should fix the root
cause, but a BUG_ON() is not acceptable in any case, and a WARN_ON()
might still be a cause of excessive log spamming.
NOTE! If this warning ever triggers, we may end up leaking resources,
since this doesn't bother to try to clean the command up. So this
WARN_ON_ONCE() triggering does imply real problems. But BUG_ON() is
much worse.
People really need to stop using BUG_ON() for "this shouldn't ever
happen". It makes pretty much any bug worse. - Linus ]
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: James Bottomley <jejb@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If ip6_dst_lookup_tail has acquired a dst and fails the IPv4-mapped
check, release the dst before returning an error.
Fixes: ec5e3b0a1d ("ipv6: Inhibit IPv4-mapped src address on the wire.")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two more bugfixes that came in during this week:
- one defconfig change to enable a vital driver used on some Qualcomm
based phones. This was already queued for 4.11, but the maintainer
asked to have it in 4.10 after all.
- One regression fix for the reset controller framework, this got
broken by a typo in the 4.10 merge window.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAWKi+0mCrR//JCVInAQKQeA//T3CvNEB5ubRhYMV3G3rML0iEkjSlLkeS
QaUEhRAUzjs5X29u2n0Dn/FQsnkE1ZtBVfZfh6Ai8CWkBP7qHR6HRuHCtXlIeT44
NFihuJx15iIPK9PoPMEQcX+vUmAURJ2+ordRGKcHRO35X/3Z/B587ETtJ5cvSp14
b7lrG3sWGq8lFQM2yYjaDyOWL6WaVRAkzarPvGxfsBijSC4AbUwkK2z2HQd8CfyL
0CQ4H/PA/XqbQidL38NQZgV5tgq9CQXdoILDcDrDWQ8GtfoD5kkuOIAYhYrpdrkx
Tjbwkro9RA+3VlkIKh0r8WzOpBn4nFW6Awluui1I2S4/rgzJYorZagM6AsAJtFGh
hT7J5Tiahbe1QtEiWpBSB+lXZDjSEoQDxVgCliJcGjIIAnv9CbAYFvYeNYk6bPAg
LDND8lMZPnXfaV1LIi3Hq5NO2DQjiL0YM5lLcMXU0N9ritVm/GppGnS+HUZf/u2G
l8erWezCshLaWLvPSOOO4gPSJHzOeNfL596EQ+WeD2nFLVp6xRsKZCwHzj6u3hG8
2BiK12q4cFBTiSG2uVBajCNbxgF68l03wF3d3FCxQm0V1I746EuLQIfybB3z680s
YjkUoPf5ApWjTnR912ay5u8o4/aWvPXJvOVoouDAHtDeDYxKfLO+kynd6vs4xvkw
OluDShz/IC8=
=M6/8
-----END PGP SIGNATURE-----
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Arnd Bergmann:
"Two more bugfixes that came in during this week:
- a defconfig change to enable a vital driver used on some Qualcomm
based phones. This was already queued for 4.11, but the maintainer
asked to have it in 4.10 after all.
- a regression fix for the reset controller framework, this got
broken by a typo in the 4.10 merge window"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: multi_v7_defconfig: enable Qualcomm RPMCC
reset: fix shared reset triggered_count decrement on error
Pull ARM fixes from Russell King:
"A couple of fixes from Kees concerning problems he spotted with our
user access support"
* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 8658/1: uaccess: fix zeroing of 64-bit get_user()
ARM: 8657/1: uaccess: consistently check object sizes
Pull x86 fix from Thomas Gleixner:
"Make the build clean by working around yet another GCC stupidity"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vm86: Fix unused variable warning if THP is disabled
Pull locking fix from Thomas Gleixner:
"Move the futex init function to core initcall so user mode helper does
not run into an uninitialized futex syscall"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Move futex_init() to core_initcall