Pull overlayfs updates from Miklos Szeredi:
- Report constant st_ino values across copy-up even if underlying
layers are on different filesystems, but using different st_dev
values for each layer.
Ideally we'd report the same st_dev across the overlay, and it's
possible to do for filesystems that use only 32bits for st_ino by
unifying the inum space. It would be nice if it wasn't a choice of 32
or 64, rather filesystems could report their current maximum (that
could change on resize, so it wouldn't be set in stone).
- miscellaneus fixes and a cleanup of ovl_fill_super(), that was long
overdue.
- created a path_put_init() helper that clears out the pointers after
putting the ref.
I think this could be useful elsewhere, so added it to <linux/path.h>
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (30 commits)
ovl: remove unneeded arg from ovl_verify_origin()
ovl: Put upperdentry if ovl_check_origin() fails
ovl: rename ufs to ofs
ovl: clean up getting lower layers
ovl: clean up workdir creation
ovl: clean up getting upper layer
ovl: move ovl_get_workdir() and ovl_get_lower_layers()
ovl: reduce the number of arguments for ovl_workdir_create()
ovl: change order of setup in ovl_fill_super()
ovl: factor out ovl_free_fs() helper
ovl: grab reference to workbasedir early
ovl: split out ovl_get_indexdir() from ovl_fill_super()
ovl: split out ovl_get_lower_layers() from ovl_fill_super()
ovl: split out ovl_get_workdir() from ovl_fill_super()
ovl: split out ovl_get_upper() from ovl_fill_super()
ovl: split out ovl_get_lowerstack() from ovl_fill_super()
ovl: split out ovl_get_workpath() from ovl_fill_super()
ovl: split out ovl_get_upperpath() from ovl_fill_super()
ovl: use path_put_init() in error paths for ovl_fill_super()
vfs: add path_put_init()
...
If ovl_check_origin() fails, we should put upperdentry. We have a reference
on it by now. So goto out_put_upper instead of out.
Fixes: a9d019573e ("ovl: lookup non-dir copy-up-origin by file handle")
Cc: <stable@vger.kernel.org> #4.12
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Rename all "struct ovl_fs" pointers to "ofs". The "ufs" name is historical
and can only be found in overlayfs/super.c.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Move calling ovl_get_lower_layers() into ovl_get_lowerstack().
ovl_get_lowerstack() now returns the root dentry's filled in ovl_entry.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Move calling ovl_get_workdir() into ovl_get_workpath().
Rename ovl_get_workdir() to ovl_make_workdir() and ovl_get_workpath() to
ovl_get_workdir().
Workpath is now not needed outside ovl_get_workdir().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Merge ovl_get_upper() and ovl_get_upperpath().
The resulting function is named ovl_get_upper(), though it still returns
upperpath as well.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Remove "sb" and "dentry" arguments of ovl_workdir_create() and related
functions. Move setting MS_RDONLY flag to callers.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Move ovl_get_upper() immediately after ovl_get_upperpath(),
ovl_get_workdir() immediately after ovl_get_workdir() and
ovl_get_lower_layers() immediately after ovl_get_lowerstack().
Also move prepare_creds() up to where other allocations are happening.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_rename() updates dir cache version for impure old parent if an entry
with copy up origin is moved into old parent, but it did not update
cache version if the entry moved out of old parent has a copy up origin.
[SzM] Same for new dir: we updated the version if an entry with origin was
moved in, but not if an entry with origin was moved out.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
For the case of all layers not on the same fs, return the copy up origin
inode st_dev/st_ino for non-dir from stat(2).
This guaranties constant st_dev/st_ino for non-dir across copy up.
Like the same fs case, st_ino of non-dir is also persistent.
If the st_dev/st_ino for copied up object would have been the same as
that of the real underlying lower file, running diff on underlying lower
file and overlay copied up file would result in diff reporting that the
two files are equal when in fact, they may have different content.
Therefore, unlike the same fs case, st_dev is not persistent because it
uses the unique anonymous bdev allocated for the lower layer.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
For non-samefs setup, to make sure that st_dev/st_ino pair is unique
across the system, we return a unique anonymous st_dev for stat(2)
of lower layer inode.
A following patch is going to fix constant st_dev/st_ino across copy up
by returning origin st_dev/st_ino for copied up objects.
If the st_dev/st_ino for copied up object would have been the same as
that of the real underlying lower file, running diff on underlying lower
file and overlay copied up file would result in diff reporting that the
2 files are equal when in fact, they may have different content.
[amir: simplify ovl_get_pseudo_dev()
split from allocate anonymous bdev patch]
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Generate unique values of st_dev per lower layer for non-samefs
overlay mount. The unique values are obtained by allocating anonymous
bdevs for each of the lowerdirs in the overlayfs instance.
The anonymous bdev is going to be returned by stat(2) for lowerdir
non-dir entries in non-samefs case.
[amir: split from ovl_getattr() and re-structure patches]
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Define new structures to represent overlay instance lower layers and
overlay merge dir lower layers to make room for storing more per layer
information in-memory.
Instead of keeping the fs instance lower layers in an array of struct
vfsmount, keep them in an array of new struct ovl_layer, that has a
pointer to struct vfsmount.
Instead of keeping the dentry lower layers in an array of struct path,
keep them in an array of new struct ovl_path, that has a pointer to
struct dentry and to struct ovl_layer.
Add a small helper to find the fs layer id that correspopnds to a lower
struct ovl_path and use it in ovl_lookup().
[amir: split re-structure from anonymous bdev patch]
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Most overlayfs c files already explicitly include ovl_entry.h
to use overlay entry struct definitions and upcoming changes
are going to require even more c files to include this header.
All overlayfs c files include overlayfs.h and overlayfs.h itself
refers to some structs defined in ovl_entry.h, so it seems more
logic to include ovl_entry.h from overlayfs.h than from c files.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
An "origin && non-merge" upper dir may have leftover whiteouts that
were created in past mount. overlayfs does no clear this dir when we
delete it, which may lead to rmdir fail or temp file left in workdir.
Simple reproducer:
mkdir lower upper work merge
mkdir -p lower/dir
touch lower/dir/a
mount -t overlay overlay -olowerdir=lower,upperdir=upper,\
workdir=work merge
rm merge/dir/a
umount merge
rm -rf lower/*
touch lower/dir (*)
mount -t overlay overlay -olowerdir=lower,upperdir=upper,\
workdir=work merge
rm -rf merge/dir
Syslog dump:
overlayfs: cleanup of 'work/#7' failed (-39)
(*): if we do not create the regular file, the result is different:
rm: cannot remove "dir/": Directory not empty
This patch adds a check for the case of non-merge dir that may contain
whiteouts, and calls ovl_check_empty_dir() to check and clear whiteouts
from upper dir when an empty dir is being deleted.
[amir: split patch from ovl_check_empty_dir() cleanup
rename ovl_is_origin() to ovl_may_have_whiteouts()
check OVL_WHITEOUTS flag instead of checking origin xattr]
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Filter out non-whiteout non-upper entries from list of merge dir entries
while checking if merge dir is empty in ovl_check_empty_dir().
The remaining work for ovl_clear_empty() is to clear all entries on the
list.
[amir: split patch from rmdir bug fix]
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If a non-merge dir in an overlay mount has an overlay.origin xattr, it
means it was once an upper merge dir, which may contain whiteouts and
then the lower dir was removed under it.
Do not iterate real dir directly in this case to avoid exposing whiteouts.
[SzM] Set OVL_WHITEOUT for all merge directories as well.
[amir] A directory that was just copied up does not have the OVL_WHITEOUTS
flag. We need to set it to fix merge dir iteration.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This fixes a lockdep splat when mounting a nested overlayfs.
Fixes: a015dafcaf ("ovl: use ovl_inode mutex to synchronize...")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With index=on, ovl_indexdir_cleanup() tries to cleanup invalid index
entries (e.g. bad index name). This behavior could result in cleaning of
entries created by newer kernels and is therefore undesirable.
Instead, abort mount if such entries are encountered. We still cleanup
'stale' entries and 'orphan' entries, both those cases can be a result
of offline changes to lower and upper dirs.
When encoutering an index entry of type directory or whiteout, kernel
was supposed to fallback to read-only mount, but the fill_super()
operation returns EROFS in this case instead of returning success with
read-only mount flag, so mount fails when encoutering directory or
whiteout index entries. Bless this behavior by returning -EINVAL on
directory and whiteout index entries as we do for all unsupported index
entries.
Fixes: 61b674710c ("ovl: do not cleanup directory and whiteout index..")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Treat ENOENT from index entry lookup the same way as treating a returned
negative dentry. Apparently, either could be returned if file is not
found, depending on the underlying file system.
Fixes: 359f392ca5 ("ovl: lookup index entry for copy up origin")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Commit fbaf94ee3c ("ovl: don't set origin on broken lower hardlink")
attempt to avoid the condition of non-indexed upper inode with lower
hardlink as origin. If this condition is found, lookup returns EIO.
The protection of commit mentioned above does not cover the case of lower
that is not a hardlink when it is copied up (with either index=off/on)
and then lower is hardlinked while overlay is offline.
Changes to lower layer while overlayfs is offline should not result in
unexpected behavior, so a permanent EIO error after creating a link in
lower layer should not be considered as correct behavior.
This fix replaces EIO error with success in cases where upper has origin
but no index is found, or index is found that does not match upper
inode. In those cases, lookup will not fail and the returned overlay inode
will be hashed by upper inode instead of by lower origin inode.
Fixes: 359f392ca5 ("ovl: lookup index entry for copy up origin")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The error code is missing here so it means we return ERR_PTR(0) or NULL.
The other error paths all return an error code so this probably should
as well.
Fixes: 02b69b284c ("ovl: lookup redirects")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This was detected by fault injection test
Signed-off-by: Hirofumi Nakagawa <nklabs@gmail.com>
Fixes: 13cf199d00 ("ovl: allocate an ovl_inode struct")
Cc: <stable@vger.kernel.org> # v4.13
Enforcing exclusive ownership on upper/work dirs caused a docker
regression: https://github.com/moby/moby/issues/34672.
Euan spotted the regression and pointed to the offending commit.
Vivek has brought the regression to my attention and provided this
reproducer:
Terminal 1:
mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
merged/
Terminal 2:
unshare -m
Terminal 1:
umount merged
mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
merged/
mount: /root/overlay-testing/merged: none already mounted or mount point
busy
To fix the regression, I replaced the error with an alarming warning.
With index feature enabled, mount does fail, but logs a suggestion to
override exclusive dir protection by disabling index.
Note that index=off mount does take the inuse locks, so a concurrent
index=off will issue the warning and a concurrent index=on mount will fail.
Documentation was updated to reflect this change.
Fixes: 2cac0c00a6 ("ovl: get exclusive ownership on upper/work dirs")
Cc: <stable@vger.kernel.org> # v4.13
Reported-by: Euan Kemp <euank@euank.com>
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Use the ovl_lock_rename_workdir() helper which requires
unlock_rename() only on lock success.
Fixes: ("fd210b7d67ee ovl: move copy up lock out")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
index dentry was not released when breaking out of the loop
due to index verification error.
Fixes: 415543d5c6 ("ovl: cleanup bad and stale index entries on mount")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Pull mount flag updates from Al Viro:
"Another chunk of fmount preparations from dhowells; only trivial
conflicts for that part. It separates MS_... bits (very grotty
mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
only a small subset of MS_... stuff).
This does *not* convert the filesystems to new constants; only the
infrastructure is done here. The next step in that series is where the
conflicts would be; that's the conversion of filesystems. It's purely
mechanical and it's better done after the merge, so if you could run
something like
list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')
sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
-e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
-e 's/\<MS_NODEV\>/SB_NODEV/g' \
-e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
-e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
-e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
-e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
-e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
-e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
-e 's/\<MS_SILENT\>/SB_SILENT/g' \
-e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
-e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
-e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
-e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
$list
and commit it with something along the lines of 'convert filesystems
away from use of MS_... constants' as commit message, it would save a
quite a bit of headache next cycle"
* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: Differentiate mount flags (MS_*) from internal superblock flags
VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
GFP_TEMPORARY was introduced by commit e12ba74d8f ("Group short-lived
and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's
primary motivation was to allow users to tell that an allocation is
short lived and so the allocator can try to place such allocations close
together and prevent long term fragmentation. As much as this sounds
like a reasonable semantic it becomes much less clear when to use the
highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the
context holding that memory sleep? Can it take locks? It seems there is
no good answer for those questions.
The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
__GFP_RECLAIMABLE which in itself is tricky because basically none of
the existing caller provide a way to reclaim the allocated memory. So
this is rather misleading and hard to evaluate for any benefits.
I have checked some random users and none of them has added the flag
with a specific justification. I suspect most of them just copied from
other existing users and others just thought it might be a good idea to
use without any measuring. This suggests that GFP_TEMPORARY just
motivates for cargo cult usage without any reasoning.
I believe that our gfp flags are quite complex already and especially
those with highlevel semantic should be clearly defined to prevent from
confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and
replace all existing users to simply use GFP_KERNEL. Please note that
SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
so they will be placed properly for memory fragmentation prevention.
I can see reasons we might want some gfp flag to reflect shorterm
allocations but I propose starting from a clear semantic definition and
only then add users with proper justification.
This was been brought up before LSF this year by Matthew [1] and it
turned out that GFP_TEMPORARY really doesn't have a clear semantic. It
seems to be a heuristic without any measured advantage for most (if not
all) its current users. The follow up discussion has revealed that
opinions on what might be temporary allocation differ a lot between
developers. So rather than trying to tweak existing users into a
semantic which they haven't expected I propose to simply remove the flag
and start from scratch if we really need a semantic for short term
allocations.
[1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org
[akpm@linux-foundation.org: fix typo]
[akpm@linux-foundation.org: coding-style fixes]
[sfr@canb.auug.org.au: drm/i915: fix up]
Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull overlayfs updates from Miklos Szeredi:
"This fixes d_ino correctness in readdir, which brings overlayfs on par
with normal filesystems regarding inode number semantics, as long as
all layers are on the same filesystem.
There are also some bug fixes, one in particular (random ioctl's
shouldn't be able to modify lower layers) that touches some vfs code,
but of course no-op for non-overlay fs"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix false positive ESTALE on lookup
ovl: don't allow writing ioctl on lower layer
ovl: fix relatime for directories
vfs: add flags to d_real()
ovl: cleanup d_real for negative
ovl: constant d_ino for non-merge dirs
ovl: constant d_ino across copy up
ovl: fix readdir error value
ovl: check snprintf return
Commit b9ac5c274b ("ovl: hash overlay non-dir inodes by copy up origin")
verifies that the origin lower inode stored in the overlayfs inode matched
the inode of a copy up origin dentry found by lookup.
There is a false positive result in that check when lower fs does not
support file handles and copy up origin cannot be followed by file handle
at lookup time.
The false negative happens when finding an overlay inode in cache on a
copied up overlay dentry lookup. The overlay inode still 'remembers' the
copy up origin inode, but the copy up origin dentry is not available for
verification.
Relax the check in case copy up origin dentry is not available.
Fixes: b9ac5c274b ("ovl: hash overlay non-dir inodes by copy up...")
Cc: <stable@vger.kernel.org> # v4.13
Reported-by: Jordi Pujol <jordipujolp@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Need to treat non-regular overlayfs files the same as regular files when
checking for an atime update.
Add a d_real() flag to make it return the upper dentry for all file types.
Reported-by: "zhangyi (F)" <yi.zhang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
d_real() is never called with a negative dentry. So remove the
d_is_negative() check (which would never trigger anyway, since d_is_reg()
returns false for a negative dentry).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
While we could replace the smp_mb__before_spinlock() with the new
smp_mb__after_spinlock(), the normal pattern is to use
smp_store_release() to publish an object that is used for
lockless_dereference() -- and mirrors the regular rcu_assign_pointer()
/ rcu_dereference() patterns.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>