- Fill out the build string
- Prevent inode fork extent count overflows
- Refactor the allocator to reduce long tail latency
- Rework incore log locking a little to reduce spinning
- Break up the xfs_iomap_begin functions into smaller more cohesive
parts
- Fix allocation alignment being dropped too early when the allocation
request is for more blocks than an AG is large
- Other small cleanups
- Clean up file buftarg retrieval helpers
- Hoist the resvsp and unresvsp ioctls to the vfs
- Remove the undocumented biosize mount option, since it has never been
mentioned as existing or supported on linux
- Clean up some of the mount option printing and parsing
- Enhance attr leaf verifier to check block structure
- Check dirent and attr names for invalid characters before passing them
to the vfs
- Refactor open-coded bmbt walking
- Fix a few places where we return EIO instead of EFSCORRUPTED after
failing metadata sanity checks
- Fix a synchronization problem between fallocate and aio dio corrupting
the file length
- Clean up various loose ends in the iomap and bmap code
- Convert to the new mount api
- Make sure we always log something when returning EFSCORRUPTED
- Fix some problems where long running scrub loops could trigger soft
lockup warnings and/or fail to exit due to fatal signals pending
- Fix various Coverity complaints
- Remove most of the function pointers from the directory code to reduce
indirection penalties
- Ensure that dquots are attached to the inode when performing unwritten
extent conversion after io
- Deuglify incore projid and crtime types
- Fix another AGI/AGF locking order deadlock when renaming
- Clean up some quota typedefs
- Remove the FSSETDM ioctls which haven't done anything in 20 years
- Fix some memory leaks when mounting the log fails
- Fix an underflow when updating an xattr leaf freemap
- Remove some trivial wrappers
- Report metadata corruption as an error, not a (potentially) fatal
assertion
- Clean up the dir/attr buffer mapping code
- Allow fatal signals to kill scrub during parent pointer checks
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3fNjcACgkQ+H93GTRK
tOv/8w//Y0Oa9Paiy8+iPTChs3/PqeKp307Fj5KONG52haMCakEJFT5+/wpkIAJw
uUmKiPolwN1ivviIUmIS14ThTJ7NV1jq0G0h/0tC25i/3hoJrGWdzqYJMlvhlqgE
taHrjCwPTDkhRJ0D5QCrkkHPU7lSdquO5TWxltaqYLhyLIt8SkklD6dN1dHWEPnk
k0j3TL+VqVJDYyEj1bLwJ0QUb2C3J8ygWnlviF/WxsSeJtJpGoeLEaYXhhsUK0Dt
aHg70OM6zzFzrJJAtJeBXpgaFsG/Pqbcw4wUWSxEMWjVSJwCSKLuZ5F+p6NcqoEj
HeLQkaGePoO61YCInk2JKLHIyx7ohqMOt7+Dm0mdbe1pvcKwV9ZcdkqKa8L/Fm6v
bUP6a2hEpsGy7vLnkYxwYACTLPbGX3uLw8MUr6ZpJ+SpfVLktU4ycpr8dCkJkp6a
0qOpEeHsBDy74NkMOUa7Qrju7lJ2GiL70qqBwaPe+ubcUa3U/3WAsSekSzXgUwn8
Fap4r8wn7cUbxymAvO06RlU8YymuulAlyjwdo9gOL/Su/5POldss6dy1YuUtyq19
CD6NtkHqEUMsTc2cI+H65H44aEeckB1j0D2Grm2uMchAh0GcTSFVNF6jony++B8k
s2sL2dEw9/9vr0uc1TSVF5ezxaONuyaCXdYXUkkdyq3iNvfpRCg=
=aACq
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.5-merge-16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS updates from Darrick Wong:
"For this release, we changed quite a few things.
Highlights:
- Fixed some long tail latency problems in the block allocator
- Removed some long deprecated (and for the past several years no-op)
mount options and ioctls
- Strengthened the extended attribute and directory verifiers
- Audited and fixed all the places where we could return EFSCORRUPTED
without logging anything
- Refactored the old SGI space allocation ioctls to make the
equivalent fallocate calls
- Fixed a race between fallocate and directio
- Fixed an integer overflow when files have more than a few
billion(!) extents
- Fixed a longstanding bug where quota accounting could be incorrect
when performing unwritten extent conversion on a freshly mounted fs
- Fixed various complaints in scrub about soft lockups and
unresponsiveness to signals
- De-vtable'd the directory handling code, which should make it
faster
- Converted to the new mount api, for better or for worse
- Cleaned up some memory leaks
and quite a lot of other smaller fixes and cleanups.
A more detailed summary:
- Fill out the build string
- Prevent inode fork extent count overflows
- Refactor the allocator to reduce long tail latency
- Rework incore log locking a little to reduce spinning
- Break up the xfs_iomap_begin functions into smaller more cohesive
parts
- Fix allocation alignment being dropped too early when the
allocation request is for more blocks than an AG is large
- Other small cleanups
- Clean up file buftarg retrieval helpers
- Hoist the resvsp and unresvsp ioctls to the vfs
- Remove the undocumented biosize mount option, since it has never
been mentioned as existing or supported on linux
- Clean up some of the mount option printing and parsing
- Enhance attr leaf verifier to check block structure
- Check dirent and attr names for invalid characters before passing
them to the vfs
- Refactor open-coded bmbt walking
- Fix a few places where we return EIO instead of EFSCORRUPTED after
failing metadata sanity checks
- Fix a synchronization problem between fallocate and aio dio
corrupting the file length
- Clean up various loose ends in the iomap and bmap code
- Convert to the new mount api
- Make sure we always log something when returning EFSCORRUPTED
- Fix some problems where long running scrub loops could trigger soft
lockup warnings and/or fail to exit due to fatal signals pending
- Fix various Coverity complaints
- Remove most of the function pointers from the directory code to
reduce indirection penalties
- Ensure that dquots are attached to the inode when performing
unwritten extent conversion after io
- Deuglify incore projid and crtime types
- Fix another AGI/AGF locking order deadlock when renaming
- Clean up some quota typedefs
- Remove the FSSETDM ioctls which haven't done anything in 20 years
- Fix some memory leaks when mounting the log fails
- Fix an underflow when updating an xattr leaf freemap
- Remove some trivial wrappers
- Report metadata corruption as an error, not a (potentially) fatal
assertion
- Clean up the dir/attr buffer mapping code
- Allow fatal signals to kill scrub during parent pointer checks"
* tag 'xfs-5.5-merge-16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (198 commits)
xfs: allow parent directory scans to be interrupted with fatal signals
xfs: remove the mappedbno argument to xfs_da_get_buf
xfs: remove the mappedbno argument to xfs_da_read_buf
xfs: split xfs_da3_node_read
xfs: remove the mappedbno argument to xfs_dir3_leafn_read
xfs: remove the mappedbno argument to xfs_dir3_leaf_read
xfs: remove the mappedbno argument to xfs_attr3_leaf_read
xfs: remove the mappedbno argument to xfs_da_reada_buf
xfs: improve the xfs_dabuf_map calling conventions
xfs: refactor xfs_dabuf_map
xfs: simplify mappedbno handling in xfs_da_{get,read}_buf
xfs: report corruption only as a regular error
xfs: Remove kmem_zone_free() wrapper
xfs: Remove kmem_zone_destroy() wrapper
xfs: Remove slab init wrappers
xfs: fix attr leaf header freemap.size underflow
xfs: fix some memory leaks in log recovery
xfs: fix another missing include
xfs: remove XFS_IOC_FSSETDM and XFS_IOC_FSSETDM_BY_HANDLE
xfs: remove duplicated include from xfs_dir2_data.c
...
Merge updates from Andrew Morton:
"Incoming:
- a small number of updates to scripts/, ocfs2 and fs/buffer.c
- most of MM
I still have quite a lot of material (mostly not MM) staged after
linux-next due to -next dependencies. I'll send those across next week
as the preprequisites get merged up"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (135 commits)
mm/page_io.c: annotate refault stalls from swap_readpage
mm/Kconfig: fix trivial help text punctuation
mm/Kconfig: fix indentation
mm/memory_hotplug.c: remove __online_page_set_limits()
mm: fix typos in comments when calling __SetPageUptodate()
mm: fix struct member name in function comments
mm/shmem.c: cast the type of unmap_start to u64
mm: shmem: use proper gfp flags for shmem_writepage()
mm/shmem.c: make array 'values' static const, makes object smaller
userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK
fs/userfaultfd.c: wp: clear VM_UFFD_MISSING or VM_UFFD_WP during userfaultfd_register()
userfaultfd: wrap the common dst_vma check into an inlined function
userfaultfd: remove unnecessary WARN_ON() in __mcopy_atomic_hugetlb()
userfaultfd: use vma_pagesize for all huge page size calculation
mm/madvise.c: use PAGE_ALIGN[ED] for range checking
mm/madvise.c: replace with page_size() in madvise_inject_error()
mm/mmap.c: make vma_merge() comment more easy to understand
mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops
autonuma: reduce cache footprint when scanning page tables
autonuma: fix watermark checking in migrate_balanced_pgdat()
...
As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need support
for time64_t.
In completely unrelated work, I spent time on cleaning up parts of this
file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the rest
of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which is
the scsi ioctl handlers. I have patches for those as well, but they need
more testing or possibly a rewrite.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJdsHCdAAoJEJpsee/mABjZtYkP/1JGl3jFv3Iq/5BCdPkaePP1
RtMJRNfURgK3GeuHUui330PvVjI/pLWXU/VXMK2MPTASpJLzYz3uCaZrpVWEMpDZ
+ImzGmgJkITlW1uWU3zOcQhOxTyb1hCZ0Ci+2xn9QAmyOL7prXoXCXDWv3h6iyiF
lwG+nW+HNtyx41YG+9bRfKNoG0ZJ+nkJ70BV6u0acQHXWn7Xuupa9YUmBL87hxAL
6dlJfLTJg6q8QSv/Q6LxslfWk2Ti8OOJZOwtFM5R8Bgl0iUcvshiRCKfv/3t9jXD
dJNvF1uq8z+gracWK49Qsfq5dnZ2ZxHFUo9u0NjbCrxNvWH/sdvhbaUBuJI75seH
VIznCkdxFhrqitJJ8KmxANxG08u+9zSKjSlxG2SmlA4qFx/AoStoHwQXcogJscNb
YIXYKmWBvwPzYu09QFAXdHFPmZvp/3HhMWU6o92lvDhsDwzkSGt3XKhCJea4DCaT
m+oCcoACqSWhMwdbJOEFofSub4bY43s5iaYuKes+c8O261/Dwg6v/pgIVez9mxXm
TBnvCsotq5m8wbwzv99eFqGeJH8zpDHrXxEtRR5KQqMqjLq/OQVaEzmpHZTEuK7n
e/V/PAKo2/V63g4k6GApQXDxnjwT+m0aWToWoeEzPYXS6KmtWC91r4bWtslu3rdl
bN65armTm7bFFR32Avnu
=lgCl
-----END PGP SIGNATURE-----
Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
"As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need
support for time64_t.
In completely unrelated work, I spent time on cleaning up parts of
this file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the
rest of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which
is the scsi ioctl handlers. I have patches for those as well, but they
need more testing or possibly a rewrite"
* tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
scsi: sd: enable compat ioctls for sed-opal
pktcdvd: add compat_ioctl handler
compat_ioctl: move SG_GET_REQUEST_TABLE handling
compat_ioctl: ppp: move simple commands into ppp_generic.c
compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
compat_ioctl: unify copy-in of ppp filters
tty: handle compat PPP ioctls
compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
compat_ioctl: handle SIOCOUTQNSD
af_unix: add compat_ioctl support
compat_ioctl: reimplement SG_IO handling
compat_ioctl: move WDIOC handling into wdt drivers
fs: compat_ioctl: move FITRIM emulation into file systems
gfs2: add compat_ioctl support
compat_ioctl: remove unused convert_in_user macro
compat_ioctl: remove last RAID handling code
compat_ioctl: remove /dev/raw ioctl translation
compat_ioctl: remove PCI ioctl translation
compat_ioctl: remove joystick ioctl translation
...
This helper prints warning if direct I/O write failed to invalidate cache,
and set EIO at inode to warn usersapce about possible data corruption.
See also commit 5a9d929d6e ("iomap: report collisions between directio
and buffered writes to userspace").
Direct I/O is supported by non-disk filesystems, for example NFS. Thus
generic code needs this even in kernel without CONFIG_BLOCK.
Link: http://lkml.kernel.org/r/157270038074.4812.7980855544557488880.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 0be0ee7181.
I was hoping it would be benign to switch over entirely to FMODE_STREAM,
and we'd have just a couple of small fixups we'd need, but it looks like
we're not quite there yet.
While it worked fine on both my desktop and laptop, they are fairly
similar in other respects, and run mostly the same loads. Kenneth
Crudup reports that it seems to break both his vmware installation and
the KDE upower service. In both cases apparently leading to timeouts
due to waitinmg for the f_pos lock.
There are a number of character devices in particular that definitely
want stream-like behavior, but that currently don't get marked as
streams, and as a result get the exclusion between concurrent
read()/write() on the same file descriptor. Which doesn't work well for
them.
The most obvious example if this is /dev/console and /dev/tty, which use
console_fops and tty_fops respectively (and ptmx_fops for the pty master
side). It may be that it's just this that causes problems, but we
clearly weren't ready yet.
Because there's a number of other likely common cases that don't have
llseek implementations and would seem to act as stream devices:
/dev/fuse (fuse_dev_operations)
/dev/mcelog (mce_chrdev_ops)
/dev/mei0 (mei_fops)
/dev/net/tun (tun_fops)
/dev/nvme0 (nvme_dev_fops)
/dev/tpm0 (tpm_fops)
/proc/self/ns/mnt (ns_file_operations)
/dev/snd/pcm* (snd_pcm_f_ops[])
and while some of these could be trivially automatically detected by the
vfs layer when the character device is opened by just noticing that they
have no read or write operations either, it often isn't that obvious.
Some character devices most definitely do use the file position, even if
they don't allow seeking: the firmware update code, for example, uses
simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
back and forth.
We'll revisit this when there's a better way to detect the problem and
fix it (possibly with a coccinelle script to do more of the FMODE_STREAM
annotations).
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Cc: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3YA5sQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplFxEACM7CwrWsullPX6b3j62NW6VepU5JQdzwVW
S+bmLpb8Z2I4wzEnaVuWAY5hEhGaS9NFtQLdBG0W0YOzH7sweNmL38dZfCE4+oFj
ZwytpXQQhAQUwkgANJCpNfzDymHduPsTz7RYqRr1plmhna1KC/dnhuMwg8lVOBf5
myWjqcCHxxoQn6KFqcX9/Azz29ZrgzV28lOnZdiw9yoTjraBmS/ymx4woaa3pc2v
UNw0Cgx53vHENJzEL9FNSxc0ENZq/bQhpDolnc2AlPGy9+vPg4afMitJb60KTT7r
HpDcLGkYAIKLrfk8DUmFW8lZhWsxTchXvK2+zwQV7nXMcdUgGN/G3HTIdvWEHFv8
oGbPB8cfdA2vNC9QAybwWEum/S0H/GfYsBVplNCUCdFXE7yj1cbKD5dPfCyIvmPz
BjgMae5vH/KoH+vNdZ8NL5oFz2eFC3rLxa/Ss78pcEoBdiiV3WQHPv9MBmn/OQ/v
CeUAM7omyWpbv3lcByNzIOkeeO3m6Ne28EpEMc2pzLnDPu2btvSyetdO488DE+7O
MNfApZULVX91W7jWnhM5GR+1SJTdEXZnoxnFV+J/j4deog5vUR7Dt1VkujpUILfL
7jMl3erF6C53wNrc465z8iLRp1ZM+aTpwatXXRfucNXeomExKK9zF+/+O1ACckUB
jWDCR9NTcw==
=e5Lx
-----END PGP SIGNATURE-----
Merge tag 'for-5.5/disk-revalidate-20191122' of git://git.kernel.dk/linux-block
Pull disk revalidation updates from Jens Axboe:
"This continues the work that Jan Kara started to thoroughly cleanup
and consolidate how we handle rescans and revalidations"
* tag 'for-5.5/disk-revalidate-20191122' of git://git.kernel.dk/linux-block:
block: move clearing bd_invalidated into check_disk_size_change
block: remove (__)blkdev_reread_part as an exported API
block: fix bdev_disk_changed for non-partitioned devices
block: move rescan_partitions to fs/block_dev.c
block: merge invalidate_partitions into rescan_partitions
block: refactor rescan_partitions
fdget_pos() is used by file operations that will read and update f_pos:
things like "read()", "write()" and "lseek()" (but not, for example,
"pread()/pwrite" that get their file positions elsewhere).
However, it had two separate escape clauses for this, because not
everybody wants or needs serialization of the file position.
The first and most obvious case is the "file descriptor doesn't have a
position at all", ie a stream-like file. Except we didn't actually use
FMODE_STREAM, but instead used FMODE_ATOMIC_POS. The reason for that
was that FMODE_STREAM didn't exist back in the days, but also that we
didn't want to mark all the special cases, so we only marked the ones
that _required_ position atomicity according to POSIX - regular files
and directories.
The case one was intentionally lazy, but now that we _do_ have
FMODE_STREAM we could and should just use it. With the change to use
FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
the code to set it is deleted.
Any cases where we don't want the serialization because the driver (or
subsystem) doesn't use the file position should just be updated to do
"stream_open()". We've done that for all the obvious and common
situations, we may need a few more. Quoting Kirill Smelkov in the
original FMODE_STREAM thread (see link below for full email):
"And I appreciate if people could help at least somehow with "getting
rid of mixed case entirely" (i.e. always lock f_pos_lock on
!FMODE_STREAM), because this transition starts to diverge from my
particular use-case too far. To me it makes sense to do that
transition as follows:
- convert nonseekable_open -> stream_open via stream_open.cocci;
- audit other nonseekable_open calls and convert left users that
truly don't depend on position to stream_open;
- extend stream_open.cocci to analyze alloc_file_pseudo as well (this
will cover pipes and sockets), or maybe convert pipes and sockets
to FMODE_STREAM manually;
- extend stream_open.cocci to analyze file_operations that use
no_llseek or noop_llseek, but do not use nonseekable_open or
alloc_file_pseudo. This might find files that have stream semantic
but are opened differently;
- extend stream_open.cocci to analyze file_operations whose
.read/.write do not use ppos at all (independently of how file was
opened);
- ...
- after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
!FMODE_STREAM;
- gather bug reports for deadlocked read/write and convert missed
cases to FMODE_STREAM, probably extending stream_open.cocci along
the road to catch similar cases
i.e. always take f_pos_lock unless a file is explicitly marked as
being stream, and try to find and cover all files that are streams"
We have not done the "extend stream_open.cocci to analyze
alloc_file_pseudo" as well, but the previous commit did manually handle
the case of pipes and sockets.
The other case where we can avoid locking f_pos is the "this file
descriptor only has a single user and it is us, and thus there is no
need to lock it".
The second test was correct, although a bit subtle and worth just
re-iterating here. There are two kinds of other sources of references
to the same file descriptor: file descriptors that have been explicitly
shared across fork() or with dup(), and file tables having elevated
reference counts due to threading (or explicit file sharing with
clone()).
The first case would have incremented the file count explicitly, and in
the second case the previous __fdget() would have incremented it for us
and set the FDPUT_FPUT flag.
But in both cases the file count would be greater than one, so the
"file_count(file) > 1" test catches both situations. Also note that if
file_count is 1, that also means that no other thread can have access to
the file table, so there also cannot be races with concurrent calls to
dup()/fork()/clone() that would increment the file count any other way.
Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In general drivers should never mess with partition tables directly.
Unfortunately s390 and loop do for somewhat historic reasons, but they
can use bdev_disk_changed directly instead when we export it as they
satisfy the sanity checks we have in __blkdev_reread_part.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com> [dasd]
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Large parts of rescan_partitions aren't about partitions, and
moving it to block_dev.c will allow for some further cleanups by
merging it into its only caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These use the same scheme as the pre-existing mapping of the XFS
RESVP ioctls to ->falloc, so just extend it and remove the XFS
implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[darrick: fix compile error on s390]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Many drivers have ioctl() handlers that are completely compatible between
32-bit and 64-bit architectures, except for the argument that is passed
down from user space and may have to be passed through compat_ptr()
in order to become a valid 64-bit pointer.
Using ".compat_ptr = compat_ptr_ioctl" in file operations should let
us simplify a lot of those drivers to avoid #ifdef checks, and convert
additional drivers that don't have proper compat handling yet.
On most architectures, the compat_ptr_ioctl() just passes all arguments
to the corresponding ->ioctl handler. The exception is arch/s390, where
compat_ptr() clears the top bit of a 32-bit pointer value, so user space
pointers to the second 2GB alias the first 2GB, as is the case for native
32-bit s390 user space.
The compat_ptr_ioctl() function must therefore be used only with
ioctl functions that either ignore the argument or pass a pointer to a
compatible data type.
If any ioctl command handled by fops->unlocked_ioctl passes a plain
integer instead of a pointer, or any of the passed data types is
incompatible between 32-bit and 64-bit architectures, a proper handler
is required instead of compat_ptr_ioctl.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
v3: add a better description
v2: use compat_ptr_ioctl instead of generic_compat_ioctl_ptrarg,
as suggested by Al Viro
- add a new knfsd file cache, so that we don't have to open and
close on each (NFSv2/v3) READ or WRITE. This can speed up
read and write in some cases. It also replaces our readahead
cache.
- Prevent silent data loss on write errors, by treating write
errors like server reboots for the purposes of write caching,
thus forcing clients to resend their writes.
- Tweak the code that allocates sessions to be more forgiving,
so that NFSv4.1 mounts are less likely to hang when a server
already has a lot of clients.
- Eliminate an arbitrary limit on NFSv4 ACL sizes; they should
now be limited only by the backend filesystem and the
maximum RPC size.
- Allow the server to enforce use of the correct kerberos
credentials when a client reclaims state after a reboot.
And some miscellaneous smaller bugfixes and cleanup.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl2OoFcVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+dRoP/3OW1NxPjpjbCQWZL0M+O3AYJJla
W8E+uoZKMosFEe/ymokMD0Vn5s47jPaMCifMjHZa2GygW8zHN9X2v0HURx/lob+o
/rJXwMn78N/8kdbfDz2FvaCPeT0IuNzRIFBV8/sSXofqwCBwvPo+cl0QGrd4/xLp
X35qlupx62TRk+kbdRjvv8kpS5SJ7BvR+FSA1WubNYWw2hpdEsr2OCFdGq2Wvthy
DK6AfGBXfJGsOE+HGCSj6ejRV6i0UOJ17P8gRSsx+YT0DOe5E7ROjt+qvvRwk489
wmR8Vjuqr1e40eGAUq3xuLfk5F5NgycY4ekVxk/cTVFNwWcz2DfdjXQUlyPAbrSD
SqIyxN1qdKT24gtr7AHOXUWJzBYPWDgObCVBXUGzyL81RiDdhf38HRNjL2TcSDld
tzCjQ0wbPw+iT74v6qQRY05oS+h3JOtDjU4pxsBnxVtNn4WhGJtaLfWW8o1C1QwU
bc4aX3TlYhDmzU7n7Zjt4rFXGJfyokM+o6tPao1Z60Pmsv1gOk4KQlzLtW/jPHx4
ZwYTwVQUKRDBfC62nmgqDyGI3/Qu11FuIxL2KXUCgkwDxNWN4YkwYjOGw9Lb5qKM
wFpxq6CDNZB/IWLEu8Yg85kDPPUJMoI657lZb7Osr/MfBpU0YljcMOIzLBy8uV1u
9COUbPaQipiWGu/0
=diBo
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Highlights:
- Add a new knfsd file cache, so that we don't have to open and close
on each (NFSv2/v3) READ or WRITE. This can speed up read and write
in some cases. It also replaces our readahead cache.
- Prevent silent data loss on write errors, by treating write errors
like server reboots for the purposes of write caching, thus forcing
clients to resend their writes.
- Tweak the code that allocates sessions to be more forgiving, so
that NFSv4.1 mounts are less likely to hang when a server already
has a lot of clients.
- Eliminate an arbitrary limit on NFSv4 ACL sizes; they should now be
limited only by the backend filesystem and the maximum RPC size.
- Allow the server to enforce use of the correct kerberos credentials
when a client reclaims state after a reboot.
And some miscellaneous smaller bugfixes and cleanup"
* tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux: (34 commits)
sunrpc: clean up indentation issue
nfsd: fix nfs read eof detection
nfsd: Make nfsd_reset_boot_verifier_locked static
nfsd: degraded slot-count more gracefully as allocation nears exhaustion.
nfsd: handle drc over-allocation gracefully.
nfsd: add support for upcall version 2
nfsd: add a "GetVersion" upcall for nfsdcld
nfsd: Reset the boot verifier on all write I/O errors
nfsd: Don't garbage collect files that might contain write errors
nfsd: Support the server resetting the boot verifier
nfsd: nfsd_file cache entries should be per net namespace
nfsd: eliminate an unnecessary acl size limit
Deprecate nfsd fault injection
nfsd: remove duplicated include from filecache.c
nfsd: Fix the documentation for svcxdr_tmpalloc()
nfsd: Fix up some unused variable warnings
nfsd: close cached files prior to a REMOVE or RENAME that would replace target
nfsd: rip out the raparms cache
nfsd: have nfsd_test_lock use the nfsd_file cache
nfsd: hook up nfs4_preprocess_stateid_op to the nfsd_file cache
...
In previous patch, an application could put part of its text section in
THP via madvise(). These THPs will be protected from writes when the
application is still running (TXTBSY). However, after the application
exits, the file is available for writes.
This patch avoids writes to file THP by dropping page cache for the file
when the file is open for write. A new counter nr_thps is added to struct
address_space. In do_dentry_open(), if the file is open for write and
nr_thps is non-zero, we drop page cache for the whole file.
Link: http://lkml.kernel.org/r/20190801184244.3169074-8-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series from Deepa Dinamani adds a per-superblock minimum/maximum
timestamp limit for a file system, and clamps timestamps as they are
written, to avoid random behavior from integer overflow as well as having
different time stamps on disk vs in memory.
At mount time, a warning is now printed for any file system that can
represent current timestamps but not future timestamps more than 30
years into the future, similar to the arbitrary 30 year limit that was
added to settimeofday().
This was picked as a compromise to warn users to migrate to other file
systems (e.g. ext4 instead of ext3) when they need the file system to
survive beyond 2038 (or similar limits in other file systems), but not
get in the way of normal usage.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJdcs20AAoJEJpsee/mABjZaOwQALl3lBEhg0aV6a0ZZ1uYehtd
vcjZ6OpehfiOAxYJu0wfLPATo4T0FuBxZKz3+trkJDICcxyc68AJ2wijwInIQnZW
MrSKnPyv/fSGp8Jr5w/0CLdp6yT6Dh7z4j2UxhwusR1bQh4cCYSswDg29/nmxgKp
Nu8m7jMvJQ2Q0r4Zy0sT/MaycUcSH5yvpyTcsYFixGOz1niNy91ISs1+aq6HZ3i3
+cuYTUy13y40iNUHzFBTcJItBnikwZOQ/zjNfJFXZ3bVEUPg8ZTLPYQ0OZz+pM0Z
AlXCKghb2EOKgq729LtA6oaY+Nom/1Gm1p80q3G+nGRVOqRgC+dfAVPZQoiER5Y1
zNPEDf2Sf7J9xktvfC+Qqa9QEUPLKs22ZIccG+vYBW65sS8IAiEDH3LAt444GGls
yB/Cx/Qw7BftpR5Om27Mhm5jDQzr43iTkZaPQWq7ydJXpfxnjlg9L19yS1omDFyV
hdbBXY6FikUICPKUW6I49z5BhjL+kmK9M2DVljImmdKNDTrfr0xY5M/EWjJZ7X+I
rnSe9qTY+iQ5/AXANn5wfj1Y6L5IxkmdWI/zDIbKhYMZLCqqFLd3mJERbs+CMDJq
qNrYyFPReFrg50oSduBPAByMTR4x9hus7iIC7r77kpoz5i60DPmIJoTfFm3844Gv
sBEyvWV08CpE9mSzXuv6
=em9y
-----END PGP SIGNATURE-----
Merge tag 'y2038-vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground
Pull y2038 vfs updates from Arnd Bergmann:
"Add inode timestamp clamping.
This series from Deepa Dinamani adds a per-superblock minimum/maximum
timestamp limit for a file system, and clamps timestamps as they are
written, to avoid random behavior from integer overflow as well as
having different time stamps on disk vs in memory.
At mount time, a warning is now printed for any file system that can
represent current timestamps but not future timestamps more than 30
years into the future, similar to the arbitrary 30 year limit that was
added to settimeofday().
This was picked as a compromise to warn users to migrate to other file
systems (e.g. ext4 instead of ext3) when they need the file system to
survive beyond 2038 (or similar limits in other file systems), but not
get in the way of normal usage"
* tag 'y2038-vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
ext4: Reduce ext4 timestamp warnings
isofs: Initialize filesystem timestamp ranges
pstore: fs superblock limits
fs: omfs: Initialize filesystem timestamp ranges
fs: hpfs: Initialize filesystem timestamp ranges
fs: ceph: Initialize filesystem timestamp ranges
fs: sysv: Initialize filesystem timestamp ranges
fs: affs: Initialize filesystem timestamp ranges
fs: fat: Initialize filesystem timestamp ranges
fs: cifs: Initialize filesystem timestamp ranges
fs: nfs: Initialize filesystem timestamp ranges
ext4: Initialize timestamps limits
9p: Fill min and max timestamps in sb
fs: Fill in max and min timestamps in superblock
utimes: Clamp the timestamps before update
mount: Add mount warning for impending timestamp expiry
timestamp_truncate: Replace users of timespec64_trunc
vfs: Add timestamp_truncate() api
vfs: Add file timestamp range support
- Remove KM_SLEEP/KM_NOSLEEP.
- Ensure that memory buffers for IO are properly sector-aligned to avoid
problems that the block layer doesn't check.
- Make the bmap scrubber more efficient in its record checking.
- Don't crash xfs_db when superblock inode geometry is corrupt.
- Fix btree key helper functions.
- Remove unneeded error returns for things that can't fail.
- Fix buffer logging bugs in repair.
- Clean up iterator return values.
- Speed up directory entry creation.
- Enable allocation of xattr value memory buffer during lookup.
- Fix readahead racing with truncate/punch hole.
- Other minor cleanups.
- Fix one AGI/AGF deadlock with RENAME_WHITEOUT.
- More BUG -> WARN whackamole.
- Fix various problems with the log failing to advance under certain
circumstances, which results in stalls during mount.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl1yhwsACgkQ+H93GTRK
tOtTig//fYLgFVz3l6ffCb8WkJkmi7iWOJp3eLzK55+3W0++ThNsMlRTOWH7xCpZ
f+3LEvvm1ILBgf4XVlwUGt2HlLmNZeKYmiOl/jZxCH25KdfILRIyeyacAYf9vIWf
NQr5HOutsa1IfEDCiDwEnxuuVbgC+rN8j7Rlp/PpweXwRYjssqRWnGRgaZchLbyr
JZ40D9J1HLooY/yftKrgnxtfL4rmAhPoGdX3DnZmobHYRpFHrY31Ks24w6ogShDu
usczNeShXWlg31B4fVHo/rrVQ0xG77U+w/DTNvrAj0uvAlzvWVVibpaZjZtbhadO
NM0zOG41BY/ExBAHhpg0ieVdYI7wNEftF9gjyT7cXO9soD1mRgH6UKQMCm+o1frF
brtcpgQS2aEyGZaXGBIS23ziT/+LLGcav7LUeo7Rf6yiVoEA+FlsGaymC7l+FGCQ
lcgHdeRkeukdj+GJlmpiedb+Xya2g464CXswW7JtCghdNsypRsI4OdQQ2r8Du+w0
PUwfugv1cMAz99xfSZtSoTa7pimFxb6tHRcoqZVfQCefbKQ0VMJDU/AY7gQ2U3UM
PiFKXgPFo0p4tUvA/9ECTPcMDhMKMv200CGCJKXrokWwHtJ6jrAHb+EobjrfoiyX
+hkGEmzzt3vur7Zt2+YesCH3tZj1UfpsemOlorxYQk3hbsA9HEc=
=TZLp
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"For this cycle we have the usual pile of cleanups and bug fixes, some
performance improvements for online metadata scrubbing, massive
speedups in the directory entry creation code, some performance
improvement in the file ACL lookup code, a fix for a logging stall
during mount, and fixes for concurrency problems.
It has survived a couple of weeks of xfstests runs and merges cleanly.
Summary:
- Remove KM_SLEEP/KM_NOSLEEP.
- Ensure that memory buffers for IO are properly sector-aligned to
avoid problems that the block layer doesn't check.
- Make the bmap scrubber more efficient in its record checking.
- Don't crash xfs_db when superblock inode geometry is corrupt.
- Fix btree key helper functions.
- Remove unneeded error returns for things that can't fail.
- Fix buffer logging bugs in repair.
- Clean up iterator return values.
- Speed up directory entry creation.
- Enable allocation of xattr value memory buffer during lookup.
- Fix readahead racing with truncate/punch hole.
- Other minor cleanups.
- Fix one AGI/AGF deadlock with RENAME_WHITEOUT.
- More BUG -> WARN whackamole.
- Fix various problems with the log failing to advance under certain
circumstances, which results in stalls during mount"
* tag 'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (45 commits)
xfs: push the grant head when the log head moves forward
xfs: push iclog state cleaning into xlog_state_clean_log
xfs: factor iclog state processing out of xlog_state_do_callback()
xfs: factor callbacks out of xlog_state_do_callback()
xfs: factor debug code out of xlog_state_do_callback()
xfs: prevent CIL push holdoff in log recovery
xfs: fix missed wakeup on l_flush_wait
xfs: push the AIL in xlog_grant_head_wake
xfs: Use WARN_ON_ONCE for bailout mount-operation
xfs: Fix deadlock between AGI and AGF with RENAME_WHITEOUT
xfs: define a flags field for the AG geometry ioctl structure
xfs: add a xfs_valid_startblock helper
xfs: remove the unused XFS_ALLOC_USERDATA flag
xfs: cleanup xfs_fsb_to_db
xfs: fix the dax supported check in xfs_ioctl_setattr_dax_invalidate
xfs: Fix stale data exposure when readahead races with hole punch
fs: Export generic_fadvise()
mm: Handle MADV_WILLNEED through vfs_fadvise()
xfs: allocate xattr buffer on demand
xfs: consolidate attribute value copying
...
- Prohibit writing to active swap files and swap partitions.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl1cDHQACgkQ+H93GTRK
tOs1qw//aqXAQ7bpLDl7jx9CSuAighzKir0mHYFm9HUsnuRT6gLqIOVSeugoi8hY
tYhPNzcKHL39YDa1QfKo1RKW6uCwsECHT/5TebLxBkTL3vGGAenPchAcjj89SV54
lQ/h8O6hkDU+KCKC0kmDem7ma7DD5YZmWXDxW/HvnygjCnZ9BFaOeLQt/TPBmOmN
lozPHcdrxhIuCuSTMjIZRq27Zl6uzj5tr+FkT+FWiYDrGhgWT7is6o397SEm7yYT
3JqUQ+ZUOY4IwLlrWiVKqi0IqjvWqhaLzmjZaKF+YC8Ni0sdpaDdsE/uPSCyQ7k7
28qbfypnu7bswakjcekdSX2Dj/SZivFb8AzlqaSIMVlw4STFzjMMYMLib8/OlPES
z1pAjXHypLjNO3dNBYp/mRll+/BQ2NM6oCtnVVQGKVnlcx3oLo+n6JSRK8t74DTf
BkYu93aybBpfE49Fb3VQum+9okg9BdShRxvUp023/WTUaa8aUyIbizn3iTrke/sx
0bC+Vvdr33JZnoO8WKVzSd7COTHOTQ920NodTKAJ9bkF3WKyLM135ctavHrtdAg3
FHBXpN7AjbOaLovLpiy3eb//ghKJwgyhqbN6VCGTudC/nkaXq7y7M3DPbXxQYxom
yCA0qMByMg+cL4BtzS52QEda7xK1iiuQ/3jbdQ4lFuBHhwekVKs=
=Ag/B
-----END PGP SIGNATURE-----
Merge tag 'vfs-5.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull swap access updates from Darrick Wong:
"Prohibit writing to active swap files and swap partitions.
There's no non-malicious use case for allowing userspace to scribble
on storage that the kernel thinks it owns"
* tag 'vfs-5.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
vfs: don't allow writes to swap files
mm: set S_SWAPFILE on blockdev swap devices
Please consider pulling fs-verity for 5.4.
fs-verity is a filesystem feature that provides Merkle tree based
hashing (similar to dm-verity) for individual readonly files, mainly for
the purpose of efficient authenticity verification.
This pull request includes:
(a) The fs/verity/ support layer and documentation.
(b) fs-verity support for ext4 and f2fs.
Compared to the original fs-verity patchset from last year, the UAPI to
enable fs-verity on a file has been greatly simplified. Lots of other
things were cleaned up too.
fs-verity is planned to be used by two different projects on Android;
most of the userspace code is in place already. Another userspace tool
("fsverity-utils"), and xfstests, are also available. e2fsprogs and
f2fs-tools already have fs-verity support. Other people have shown
interest in using fs-verity too.
I've tested this on ext4 and f2fs with xfstests, both the existing tests
and the new fs-verity tests. This has also been in linux-next since
July 30 with no reported issues except a couple minor ones I found
myself and folded in fixes for.
Ted and I will be co-maintaining fs-verity.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXX8ZUBQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK2YOAQCbnBAKWDxXS3alLARRwjQLjmEtQIGl
gsek+WurFIg/zAEAlpSzHwu13LvYzTqv3rhO2yhSlvhnDu4GQEJPXPm0wgM=
=ID0n
-----END PGP SIGNATURE-----
Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fs-verity support from Eric Biggers:
"fs-verity is a filesystem feature that provides Merkle tree based
hashing (similar to dm-verity) for individual readonly files, mainly
for the purpose of efficient authenticity verification.
This pull request includes:
(a) The fs/verity/ support layer and documentation.
(b) fs-verity support for ext4 and f2fs.
Compared to the original fs-verity patchset from last year, the UAPI
to enable fs-verity on a file has been greatly simplified. Lots of
other things were cleaned up too.
fs-verity is planned to be used by two different projects on Android;
most of the userspace code is in place already. Another userspace tool
("fsverity-utils"), and xfstests, are also available. e2fsprogs and
f2fs-tools already have fs-verity support. Other people have shown
interest in using fs-verity too.
I've tested this on ext4 and f2fs with xfstests, both the existing
tests and the new fs-verity tests. This has also been in linux-next
since July 30 with no reported issues except a couple minor ones I
found myself and folded in fixes for.
Ted and I will be co-maintaining fs-verity"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
f2fs: add fs-verity support
ext4: update on-disk format documentation for fs-verity
ext4: add fs-verity read support
ext4: add basic fs-verity support
fs-verity: support builtin file signatures
fs-verity: add SHA-512 support
fs-verity: implement FS_IOC_MEASURE_VERITY ioctl
fs-verity: implement FS_IOC_ENABLE_VERITY ioctl
fs-verity: add data verification hooks for ->readpages()
fs-verity: add the hook for file ->setattr()
fs-verity: add the hook for file ->open()
fs-verity: add inode and superblock fields
fs-verity: add Kconfig and the helper functions for hashing
fs: uapi: define verity bit for FS_IOC_GETFLAGS
fs-verity: add UAPI header
fs-verity: add MAINTAINERS file entry
fs-verity: add a documentation file
Filesystems will need to call this function from their fadvise handlers.
CC: stable@vger.kernel.org
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
timespec_trunc() function is used to truncate a
filesystem timestamp to the right granularity.
But, the function does not clamp tv_sec part of the
timestamps according to the filesystem timestamp limits.
The replacement api: timestamp_truncate() also alters the
signature of the function to accommodate filesystem
timestamp clamping according to flesystem limits.
Note that the tv_nsec part is set to 0 if tv_sec is not within
the range supported for the filesystem.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Add fields to the superblock to track the min and max
timestamps supported by filesystems.
Initially, when a superblock is allocated, initialize
it to the max and min values the fields can hold.
Individual filesystems override these to match their
actual limits.
Pseudo filesystems are assumed to always support the
min and max allowable values for the fields.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Don't let userspace write to an active swap file because the kernel
effectively has a long term lease on the storage and things could get
seriously corrupted if we let this happen.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
With the new file caching infrastructure in nfsd, we can end up holding
files open for an indefinite period of time, even when they are still
idle. This may prevent the kernel from handing out leases on the file,
which is something we don't want to block.
Fix this by running a SRCU notifier call chain whenever on any
lease attempt. nfsd can then purge the cache for that inode before
returning.
Since SRCU is only conditionally compiled in, we must only define the
new chain if it's enabled, and users of the chain must ensure that
SRCU is enabled.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Add a new fscrypt ioctl, FS_IOC_ADD_ENCRYPTION_KEY. This ioctl adds an
encryption key to the filesystem's fscrypt keyring ->s_master_keys,
making any files encrypted with that key appear "unlocked".
Why we need this
~~~~~~~~~~~~~~~~
The main problem is that the "locked/unlocked" (ciphertext/plaintext)
status of encrypted files is global, but the fscrypt keys are not.
fscrypt only looks for keys in the keyring(s) the process accessing the
filesystem is subscribed to: the thread keyring, process keyring, and
session keyring, where the session keyring may contain the user keyring.
Therefore, userspace has to put fscrypt keys in the keyrings for
individual users or sessions. But this means that when a process with a
different keyring tries to access encrypted files, whether they appear
"unlocked" or not is nondeterministic. This is because it depends on
whether the files are currently present in the inode cache.
Fixing this by consistently providing each process its own view of the
filesystem depending on whether it has the key or not isn't feasible due
to how the VFS caches work. Furthermore, while sometimes users expect
this behavior, it is misguided for two reasons. First, it would be an
OS-level access control mechanism largely redundant with existing access
control mechanisms such as UNIX file permissions, ACLs, LSMs, etc.
Encryption is actually for protecting the data at rest.
Second, almost all users of fscrypt actually do need the keys to be
global. The largest users of fscrypt, Android and Chromium OS, achieve
this by having PID 1 create a "session keyring" that is inherited by
every process. This works, but it isn't scalable because it prevents
session keyrings from being used for any other purpose.
On general-purpose Linux distros, the 'fscrypt' userspace tool [1] can't
similarly abuse the session keyring, so to make 'sudo' work on all
systems it has to link all the user keyrings into root's user keyring
[2]. This is ugly and raises security concerns. Moreover it can't make
the keys available to system services, such as sshd trying to access the
user's '~/.ssh' directory (see [3], [4]) or NetworkManager trying to
read certificates from the user's home directory (see [5]); or to Docker
containers (see [6], [7]).
By having an API to add a key to the *filesystem* we'll be able to fix
the above bugs, remove userspace workarounds, and clearly express the
intended semantics: the locked/unlocked status of an encrypted directory
is global, and encryption is orthogonal to OS-level access control.
Why not use the add_key() syscall
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We use an ioctl for this API rather than the existing add_key() system
call because the ioctl gives us the flexibility needed to implement
fscrypt-specific semantics that will be introduced in later patches:
- Supporting key removal with the semantics such that the secret is
removed immediately and any unused inodes using the key are evicted;
also, the eviction of any in-use inodes can be retried.
- Calculating a key-dependent cryptographic identifier and returning it
to userspace.
- Allowing keys to be added and removed by non-root users, but only keys
for v2 encryption policies; and to prevent denial-of-service attacks,
users can only remove keys they themselves have added, and a key is
only really removed after all users who added it have removed it.
Trying to shoehorn these semantics into the keyrings syscalls would be
very difficult, whereas the ioctls make things much easier.
However, to reuse code the implementation still uses the keyrings
service internally. Thus we get lockless RCU-mode key lookups without
having to re-implement it, and the keys automatically show up in
/proc/keys for debugging purposes.
References:
[1] https://github.com/google/fscrypt
[2] https://goo.gl/55cCrI#heading=h.vf09isp98isb
[3] https://github.com/google/fscrypt/issues/111#issuecomment-444347939
[4] https://github.com/google/fscrypt/issues/116
[5] https://bugs.launchpad.net/ubuntu/+source/fscrypt/+bug/1770715
[6] https://github.com/google/fscrypt/issues/128
[7] https://askubuntu.com/questions/1130306/cannot-run-docker-on-an-encrypted-filesystem
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Commit 33ec3e53e7 ("loop: Don't change loop device under exclusive
opener") made LOOP_SET_FD ioctl acquire exclusive block device reference
while it updates loop device binding. However this can make perfectly
valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding
temporarily the exclusive bdev reference in cases like this:
for i in {a..z}{a..z}; do
dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024
mkfs.ext2 $i.image
mkdir mnt$i
done
echo "Run"
for i in {a..z}{a..z}; do
mount -o loop -t ext2 $i.image mnt$i &
done
Fix the problem by not getting full exclusive bdev reference in
LOOP_SET_FD but instead just mark the bdev as being claimed while we
update the binding information. This just blocks new exclusive openers
instead of failing them with EBUSY thus fixing the problem.
Fixes: 33ec3e53e7 ("loop: Don't change loop device under exclusive opener")
Cc: stable@vger.kernel.org
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Analogous to fs/crypto/, add fields to the VFS inode and superblock for
use by the fs/verity/ support layer:
- ->s_vop: points to the fsverity_operations if the filesystem supports
fs-verity, otherwise is NULL.
- ->i_verity_info: points to cached fs-verity information for the inode
after someone opens it, otherwise is NULL.
- S_VERITY: bit in ->i_flags that identifies verity inodes, even when
they haven't been opened yet and thus still have NULL ->i_verity_info.
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Pull vfs mount updates from Al Viro:
"The first part of mount updates.
Convert filesystems to use the new mount API"
* 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
mnt_init(): call shmem_init() unconditionally
constify ksys_mount() string arguments
don't bother with registering rootfs
init_rootfs(): don't bother with init_ramfs_fs()
vfs: Convert smackfs to use the new mount API
vfs: Convert selinuxfs to use the new mount API
vfs: Convert securityfs to use the new mount API
vfs: Convert apparmorfs to use the new mount API
vfs: Convert openpromfs to use the new mount API
vfs: Convert xenfs to use the new mount API
vfs: Convert gadgetfs to use the new mount API
vfs: Convert oprofilefs to use the new mount API
vfs: Convert ibmasmfs to use the new mount API
vfs: Convert qib_fs/ipathfs to use the new mount API
vfs: Convert efivarfs to use the new mount API
vfs: Convert configfs to use the new mount API
vfs: Convert binfmt_misc to use the new mount API
convenience helper: get_tree_single()
convenience helper get_tree_nodev()
vfs: Kill sget_userns()
...
- Standardize parameter checking for the SETFLAGS and FSSETXATTR ioctls
(which were the file attribute setters for ext4 and xfs and have now
been hoisted to the vfs)
- Only allow the DAX flag to be set on files and directories.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl0aJgMACgkQ+H93GTRK
tOuKkg//SJaxcB63uVPZk9hDraYTmyo9OXRRX6X9WwDKPTWwa88CUwS1ny1QF7Mt
zMkgzG2/y2Rs9PQ0ARoPbh1hNb2CXnvA+xnzUEev1MW6UN/nTFMZEOPn2ZQ+DxQE
gg/0U56kKgtjtXzBZVpTgHzSETivdXwHxFW3hiTtyRXg+4ulgDIZLOjN2wRB+Pdb
X8ZmM6MqKOTbhQEXlw13TlCKBzoMjC1w4UU4rkZPjoSjAaUWiPfrk/XU7qgguf9p
v1dbSN2dADQ19jzZ1dmggXnlJsRMZjk/ls5rxJlB5DHDbh6YgnA2TE+tYrtH28eB
uyKfD+RQnMzRVdmH8PsMQRQQFXR2UYyprVP7a6wi6TkB+gytn7sR5uT4sbAhmhcF
TiTYfYNRXzemHCewyOwOsUE/7oCeiJcdbqiPAHHD/jYLZfRjSXDcGzz3+7ZYZ3GO
hRxUhpxHPbkmK4T2OxhzReCbRsLN/0BeEcDdLkNWmi2FTh3V1gYzMGkgI9wsVbsd
pHjoGIHbMPWqktF/obuGq96WVfYBBaWJ6WNzQqKT4dQYAJBW2omxitXQHLpi6cjt
hG5ncxa3cPpWx4t3Lx2hb0TPS7RyYvuoQIcS/Me2RWioxrwWrgnOqdHFfLEwWpfN
jRowdWiGgOIsq8hMt7qycmGCXzbgsbaA/7oRqh8TiwM9taPOM4c=
=uH2E
-----END PGP SIGNATURE-----
Merge tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull common SETFLAGS/FSSETXATTR parameter checking from Darrick Wong:
"Here's a patch series that sets up common parameter checking functions
for the FS_IOC_SETFLAGS and FS_IOC_FSSETXATTR ioctl implementations.
The goal here is to reduce the amount of behaviorial variance between
the filesystems where those ioctls originated (ext2 and XFS,
respectively) and everybody else.
- Standardize parameter checking for the SETFLAGS and FSSETXATTR
ioctls (which were the file attribute setters for ext4 and xfs and
have now been hoisted to the vfs)
- Only allow the DAX flag to be set on files and directories"
* tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
vfs: only allow FSSETXATTR to set DAX flag on files and dirs
vfs: teach vfs_ioc_fssetxattr_check to check extent size hints
vfs: teach vfs_ioc_fssetxattr_check to check project id info
vfs: create a generic checking function for FS_IOC_FSSETXATTR
vfs: create a generic checking and prep function for FS_IOC_SETFLAGS
- Add a new /proc/fs/nfsd/clients/ directory which exposes some
long-requested information about NFSv4 clients (like open files) and
allows forced revocation of client state.
- Replace the global duplicate reply cache by a cache per network
namespace; previously, a request in one network namespace could
incorrectly match an entry from another, though we haven't seen this
in production. This is the last remaining container bug that I'm
aware of; at this point you should be able to run separate nfsd's in
each network namespace, each with their own set of exports, and
everything should work.
- Cleanup and modify lock code to show the pid of lockd as the owner of
NLM locks. This is the correct version of the bugfix originally
attempted in b8eee0e90f "lockd: Show pid of lockd for remote locks".
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl0mX+YVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+EoYQAIbNV7tqpnWRk19ulxveif9zRLMV
ImW99rNhzjfLoIBBTclncCrU1+b2VHqlVGYvml+rdsl+fUCESj2m9/P+D70WHDsl
tk2NJoXkSe1tW4G3YltRfSNNQIsUsEGRa88/4gAT0vYA2OCFDpYzrMleENISQFTp
QQ+p1ct5tofTZbelx5KqdFnLRnQlUeykJbW68/YKIdtNF+nhq07LlvpVKjy4f3MB
rK93qn9YUtnNKldkrP2tWjiPAnzJFiX9XFRPLo2JCv13G28XhhuNp2PmWqsVoY+/
8YMfXY9C028YbrHG9ebwH197XcY1p6ROBZhRxGczEmiSrAHLap8rNGjyYk6+4eO9
5HAFUQJcFEA1NUD84kpUKNZs9PIi818IgI5FhuJrcCKt8OAeyNJaOo0YU3EhzND2
/iPt+FCBlJwEwXI9WSjZiyW3OFKuvCZZk99iN2s33X0dNqMSrkQVe4AmHm7vYlzF
KD0pthVaOwAA9sHua5MSTpi5LHH/IBdWU49NoCgzK277w8xi05oI6ZkYFJQ9hncV
PIWtmmW1b3uHF95s6Ko7mSU7GLEWB9Ux6B1sfOVNgMETK4i2z0ezUDJ+Hp9RSDcJ
iHrU3kaGZ60uq3HPwunlhOYuSDt5sew5GIpNdheGoLOjuhySK7ZBwFuvupqZKC7H
4nxqlrHVI4B8FOAH
=pAAs
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.3' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Highlights:
- Add a new /proc/fs/nfsd/clients/ directory which exposes some
long-requested information about NFSv4 clients (like open files)
and allows forced revocation of client state.
- Replace the global duplicate reply cache by a cache per network
namespace; previously, a request in one network namespace could
incorrectly match an entry from another, though we haven't seen
this in production. This is the last remaining container bug that
I'm aware of; at this point you should be able to run separate
nfsd's in each network namespace, each with their own set of
exports, and everything should work.
- Cleanup and modify lock code to show the pid of lockd as the owner
of NLM locks. This is the correct version of the bugfix originally
attempted in b8eee0e90f ("lockd: Show pid of lockd for remote
locks")"
* tag 'nfsd-5.3' of git://linux-nfs.org/~bfields/linux: (34 commits)
nfsd: Make __get_nfsdfs_client() static
nfsd: Make two functions static
nfsd: Fix misuse of strlcpy
sunrpc/cache: remove the exporting of cache_seq_next
nfsd: decode implementation id
nfsd: create xdr_netobj_dup helper
nfsd: allow forced expiration of NFSv4 clients
nfsd: create get_nfsdfs_clp helper
nfsd4: show layout stateids
nfsd: show lock and deleg stateids
nfsd4: add file to display list of client's opens
nfsd: add more information to client info file
nfsd: escape high characters in binary data
nfsd: copy client's address including port number to cl_addr
nfsd4: add a client info file
nfsd: make client/ directory names small ints
nfsd: add nfsd/clients directory
nfsd4: use reference count to free client
nfsd: rename cl_refcount
nfsd: persist nfsd filesystem across mounts
...
lookups.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl0lFIoACgkQ8vlZVpUN
gaOwNQf/aJxFxHVf4t3lga8kfoMhlbwINQknsGUVwg32HporMa1NxQXjbEMMhs6V
A31gBJ44nYVz1enz7nvbE4kx4quF4E8rDVprEetphv4i8GSdUAihwJwY5/H0oSd8
rxzTZzNKddoyN/j7H4LgAh7bo6IFk54kUuaAWuZDJnJtfLNQ6RBaIwg6u6Z8Fael
9H3u/RtFHqWPQp5j50PMUG06abr26GKi1gLL+yeoFD1tuzC54B5i6uy34amrXlon
5agIQ7YuB9bigK4VaLoF4df7o+7+Oa6ENaQ9O/TQc9Uy9ngdVlPpNb2bVDizRLNn
e369sBFTf3C8sMycJy6x9TCqg2B7Hw==
=EpCF
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Many bug fixes and cleanups, and an optimization for case-insensitive
lookups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix coverity warning on error path of filename setup
ext4: replace ktype default_attrs with default_groups
ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree()
ext4: refactor initialize_dirent_tail()
ext4: rename "dirent_csum" functions to use "dirblock"
ext4: allow directory holes
jbd2: drop declaration of journal_sync_buffer()
ext4: use jbd2_inode dirty range scoping
jbd2: introduce jbd2_inode dirty range scoping
mm: add filemap_fdatawait_range_keep_errors()
ext4: remove redundant assignment to node
ext4: optimize case-insensitive lookups
ext4: make __ext4_get_inode_loc plug
ext4: clean up kerneldoc warnigns when building with W=1
ext4: only set project inherit bit for directory
ext4: enforce the immutable flag on open files
ext4: don't allow any modifications to an immutable file
jbd2: fix typo in comment of journal_submit_inode_data_buffers
jbd2: fix some print format mistakes
ext4: gracefully handle ext4_break_layouts() failure during truncate
- Create a generic copy_file_range handler and make individual
filesystems responsible for calling it (i.e. no more assuming that
do_splice_direct will work or is appropriate)
- Refactor copy_file_range and remap_range parameter checking where they
are the same
- Install missing copy_file_range parameter checking(!)
- Remove suid/sgid and update mtime like any other file write
- Change the behavior so that a copy range crossing the source file's
eof will result in a short copy to the source file's eof instead of
EINVAL
- Permit filesystems to decide if they want to handle cross-superblock
copy_file_range in their local handlers.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl0BGvAACgkQ+H93GTRK
tOu2aw/+KGG7PiXm9ED3ZXUppKVddrZMOgqM7mSfHo6TBgW3pJUJcRIhawK0Wz/P
stgTsOkurHSl3iT3vQyX4GTZvLoGN/rfsRLPxogJptBUqVv3BOrXsrI53f7V/kbm
rtjlYsgExji7VBUiMTe5kOWWqxyR7B4nXyvY/8rier57rW/8C1I58B0OrxAmTK0k
rz1e5BtE1dg91xA7cSdEc38FInz8MW8cvsrEzW9vyYY4IVE0PBuhhA1EvryxTrAZ
hfthHFfzwxhJkI0mdha8uqNufNWrHLSqiwyjYC7pwAwSQzQPiQz9U17flu+URnfF
kXaR5LdXbBP3pl46RdthrfuonWsv612cC1Qwfjs8PBG9lG7b9PGJ40MGVTiw7LlQ
924/03ho0zAnV0E8Qn5O9nPshQNDJhwhzMS39EmMyFKb1D5XGzdMV0gDdIfx6hdO
HDbw6VQ33S59gvk7v/gxsFB5Bs4PKfamHx/QmwQwpqWM5XExcr0yJ90OTBtAuY4r
S+9gwG6uED3aPh8HbQ5UgnA8bZmMmi8AkcBvqJ9GgNw5SbZl0oyv9Sj6JNpoOejV
8y9JkhoZUxqiihnKTw/vtMrj5RCOfifNBjMSwrShfLdLKtK0AZl1mXC0/1Q3VnEQ
TUcyRHEzrtHgJ9/AK9xIyDNvNYzvHSLZj7maoZZumgQa2FOFrmw=
=qM44
-----END PGP SIGNATURE-----
Merge tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull copy_file_range updates from Darrick Wong:
"This fixes numerous parameter checking problems and inconsistent
behaviors in the new(ish) copy_file_range system call.
Now the system call will actually check its range parameters
correctly; refuse to copy into files for which the caller does not
have sufficient privileges; update mtime and strip setuid like file
writes are supposed to do; and allows copying up to the EOF of the
source file instead of failing the call like we used to.
Summary:
- Create a generic copy_file_range handler and make individual
filesystems responsible for calling it (i.e. no more assuming that
do_splice_direct will work or is appropriate)
- Refactor copy_file_range and remap_range parameter checking where
they are the same
- Install missing copy_file_range parameter checking(!)
- Remove suid/sgid and update mtime like any other file write
- Change the behavior so that a copy range crossing the source file's
eof will result in a short copy to the source file's eof instead of
EINVAL
- Permit filesystems to decide if they want to handle
cross-superblock copy_file_range in their local handlers"
* tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fuse: copy_file_range needs to strip setuid bits and update timestamps
vfs: allow copy_file_range to copy across devices
xfs: use file_modified() helper
vfs: introduce file_modified() helper
vfs: add missing checks to copy_file_range
vfs: remove redundant checks from generic_remap_checks()
vfs: introduce generic_file_rw_checks()
vfs: no fallback for ->copy_file_range
vfs: introduce generic_copy_file_range()
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAl0jN3oTHGpsYXl0b25A
a2VybmVsLm9yZwAKCRAADmhBGVaCFZtjEADOMpZKgmUzMX4CwGd2QjGe6VEvVenV
8rwvFgbgmkkPWD/p3n4bpWBJpwhtSLj2OGn9gsXRM52lmPuzX9XQOBp8n5/Dd7qv
mCAe2yMFWi/imL+neq/xVQLvgi+pBC5dCLhxSX8B+uIokDX7aVWrhnP7csRT5j92
cZWheeMSu7QWw5l8Rne5STwC6jxHhXb2p63zr6tGjlUT/xtum3bb9ZqOIk4b0Vkn
2qTkCZVJpGEIWSNCPvW6oKgAXDQqhtQ2sVIQsfoafe1kSbCHhB6WaUfQHwKqB3Nj
r5R2GFIni877nBqiuZYDUZKyhpkiKIo+cfq2JIQBUBcJBQJ7L7On9wN+NfaWPWXP
pVTLIXO9ClrWc9HUBTpkHSqvd5w2QlkwdXs500Ar1QD6alvxs5WwggirSHKGubpX
8zZsgsrvGZRjb5t/JLCRxPTrXqMvrODKh44JRLDt1Twwizw5SG+Alig7P9SvEVda
7iboRapCJ7ca46AgeIIy2QsUmVjtCg6lFNt3OZsmOJuMSOkANXw9nnQerbprQr7G
g4BfwkKY8IWfXXE3/TOgLHTZhyRgcbN4vuO6Ej+DdaG3NRrMio1h0+AeoXz38CKm
7BB0Aw0NtEC1Bn9tn8SZ9cJ120FCC65EZKYzKnhoR0/XVLtXU/rlcxhID30N7185
j8cy6iZtLoD/Iw==
=e9Bd
-----END PGP SIGNATURE-----
Merge tag 'locks-v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking updates from Jeff Layton:
"Just a couple of small lease-related patches this cycle.
One from Ira to add a new tracepoint that fires during lease conflict
checks, and another patch from Amir to reduce false positives when
checking for lease conflicts"
* tag 'locks-v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
locks: eliminate false positive conflicts for write lease
locks: Add trace_leases_conflict
After the update to use nlm_lockowners for the NLM server, there are no
more users of lm_compare_owner and lm_owner_key.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Create a generic checking function for the incoming FS_IOC_FSSETXATTR
fsxattr values so that we can standardize some of the implementation
behaviors.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Create a generic function to check incoming FS_IOC_SETFLAGS flag values
and later prepare the inode for updates so that we can standardize the
implementations that follow ext4's flag values.
Note that the efivarfs implementation no longer fails a no-op SETFLAGS
without CAP_LINUX_IMMUTABLE since that's the behavior in ext*.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
In the spirit of filemap_fdatawait_range() and
filemap_fdatawait_keep_errors(), introduce
filemap_fdatawait_range_keep_errors() which both takes a range upon
which to wait and does not clear errors from the address space.
Signed-off-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
check_conflicting_open() is checking for existing fd's open for read or
for write before allowing to take a write lease. The check that was
implemented using i_count and d_count is an approximation that has
several false positives. For example, overlayfs since v4.19, takes an
extra reference on the dentry; An open with O_PATH takes a reference on
the dentry although the file cannot be read nor written.
Change the implementation to use i_readcount and i_writecount to
eliminate the false positive conflicts and allow a write lease to be
taken on an overlayfs file.
The change of behavior with existing fd's open with O_PATH is symmetric
w.r.t. current behavior of lease breakers - an open with O_PATH currently
does not break a write lease.
This increases the size of struct inode by 4 bytes on 32bit archs when
CONFIG_FILE_LOCKING is defined and CONFIG_IMA was not already
defined.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
The combination of file_remove_privs() and file_update_mtime() is
quite common in filesystem ->write_iter() methods.
Modelled after the helper file_accessed(), introduce file_modified()
and use it from generic_remap_file_range_prep().
Note that the order of calling file_remove_privs() before
file_update_mtime() in the helper was matched to the more common order by
filesystems and not the current order in generic_remap_file_range_prep().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Like the clone and dedupe interfaces we've recently fixed, the
copy_file_range() implementation is missing basic sanity, limits and
boundary condition tests on the parameters that are passed to it
from userspace. Create a new "generic_copy_file_checks()" function
modelled on the generic_remap_checks() function to provide this
missing functionality.
[Amir] Shorten copy length instead of checking pos_in limits
because input file size already abides by the limits.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Factor out helper with some checks on in/out file that are
common to clone_file_range and copy_file_range.
Suggested-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Right now if vfs_copy_file_range() does not use any offload
mechanism, it falls back to calling do_splice_direct(). This fails
to do basic sanity checks on the files being copied. Before we
start adding this necessarily functionality to the fallback path,
separate it out into generic_copy_file_range().
generic_copy_file_range() has the same prototype as
->copy_file_range() so that filesystems can use it in their custom
->copy_file_range() method if they so choose.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
A recent documentation conversion renamed this file but forgot
to update the links.
Fixes: af96c1e304 ("docs: filesystems: vfs: Convert vfs.txt to RST")
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Proc filesystem has special locking rules for various files. Thus
fanotify which opens files on event delivery can easily deadlock
against another process that waits for fanotify permission event to be
handled. Since permission events on /proc have doubtful value anyway,
just disallow them.
Link: https://lore.kernel.org/linux-fsdevel/20190320131642.GE9485@quack2.suse.cz/
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Kill sget_userns(), folding it into sget() as that's the only remaining
user.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-fsdevel@vger.kernel.org
Kill mount_ns() as it has been replaced by vfs_get_super() in the new mount
API.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Once upon a time we used to set ->d_name of e.g. pipefs root
so that d_path() on pipes would work. These days it's
completely pointless - dentries of pipes are not even connected
to pipefs root. However, mount_pseudo() had set the root
dentry name (passed as the second argument) and callers
kept inventing names to pass to it. Including those that
didn't *have* any non-root dentries to start with...
All of that had been pointless for about 8 years now; it's
time to get rid of that cargo-culting...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
using Unicode 12.1. Also, the usual largish number of cleanups and bug
fixes.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlzSDXYACgkQ8vlZVpUN
gaPQ3Qf/Sh0NqHbmbdW1J52oh4GqUKUhUezEac40yZcZBU4p3PFPZ5Ji83kAQV5r
JgHx5YW4AYHs59UkRVq/er7wKEFJxAE8weUq90WYLE1Z/EjojDE8JHSsK00obKNN
rJOm5qX/gy5C7PVUSWkSuAZQPMSGrmH5U5ie0nrI7bFWnr7T5CQkWarspUq53JBG
RP910mPTT/otE7iTgUzjDeAMKfaSdtRhcJn/uTQ+2YZ1BJsHBHJHDnfQtd3CttHs
ncTUaqPnhWqOKJV2Y9TDyAWYeSbn30cF0dpBM38N4u6YwaUwrBp/kPI0tes97SgY
lZM4VEAW6iF+18uLSyv7D0Mpba9qQg==
=9R7U
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Add as a feature case-insensitive directories (the casefold feature)
using Unicode 12.1.
Also, the usual largish number of cleanups and bug fixes"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
ext4: export /sys/fs/ext4/feature/casefold if Unicode support is present
ext4: fix ext4_show_options for file systems w/o journal
unicode: refactor the rule for regenerating utf8data.h
docs: ext4.rst: document case-insensitive directories
ext4: Support case-insensitive file name lookups
ext4: include charset encoding information in the superblock
MAINTAINERS: add Unicode subsystem entry
unicode: update unicode database unicode version 12.1.0
unicode: introduce test module for normalized utf8 implementation
unicode: implement higher level API for string handling
unicode: reduce the size of utf8data[]
unicode: introduce code for UTF-8 normalization
unicode: introduce UTF-8 character database
ext4: actually request zeroing of inode table after grow
ext4: cond_resched in work-heavy group loops
ext4: fix use-after-free race with debug_want_extra_isize
ext4: avoid drop reference to iloc.bh twice
ext4: ignore e_value_offs for xattrs with value-in-ea-inode
ext4: protect journal inode's blocks using block_validity
ext4: use BUG() instead of BUG_ON(1)
...
-----BEGIN PGP SIGNATURE-----
iQIVAwUAXNGpJPu3V2unywtrAQIVVQ//ZaEhskofcXYCsyO9pXshVKCZmp1pZ9Q3
ecbTbrF18guwHfM25LURidtjBmEAeuG5NOac/XHxcUbn5NUVzBQ1FircTVmLgtGY
yjrBmMSqIDYhghslLAv78/HibdHJ+Flqy3RWAMyDMecTvx7VGx4idZQl5QIDbNEb
GNvGP3WRiHG8tm6dykfm3afQoAS+n5seBBPDFucqPzAYa/Z/mBLgaZRKbmuMwEAe
Q2mAf7vhYgw55JzeTSZZ4sWGP9Z+9Mi/18Hu8QvJwsrJW+jHlzJHtJp0EphSa3Xs
YIRx+6AQ7WqAhnBUzzY5nBzMClIfMv1GrCG/6rXTTI/UYX65kVAP5M8EW6BAI8oX
Fz2hJqCIvF8ZCSxIYLqizlEkxmEvfmwYxueX9km/+dfTma+MIaajMge+n3fDYmls
S4RONn2LuqVeIw3m8DtKUBr7VRP0J9s1z0O4kubCtZt5PKNekvzSQSMIc17sXSST
Uuo7aL3W6Lxk4bLMmB8o/Rf2RHBZlhmpPk8rF+I6jd0Q45SDV/TttqygyvKZseDo
MZbnmBiDElDWXyKE6gxQqdC13tpb3MlCPv1L+xKDPArXe9yjq2XvHY4NtYBMCa5U
iO1v+6W1JrGh8bkE72YuxKcBVVOStQxhHGU4D8WKZjOI7oeU87U7AD/8kSRhKQni
VRXY1z87sZk=
=yiyv
-----END PGP SIGNATURE-----
Merge tag 'afs-next-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS updates from David Howells:
"A set of fix and development patches for AFS for 5.2.
Summary:
- Fix the AFS file locking so that sqlite can run on an AFS mount and
also so that firefox and gnome can use a homedir that's mounted
through AFS.
This required emulation of fine-grained locking when the server
will only support whole-file locks and no upgrade/downgrade. Four
modes are provided, settable by mount parameter:
"flock=local" - No reference to the server
"flock=openafs" - Fine-grained locks are local-only, whole-file
locks require sufficient server locks
"flock=strict" - All locks require sufficient server locks
"flock=write" - Always get an exclusive server lock
If the volume is a read-only or backup volume, then flock=local for
that volume.
- Log extra information for a couple of cases where the client mucks
up somehow: AFS vnode with undefined type and dir check failure -
in both cases we seem to end up with unfilled data, but the issues
happen infrequently and are difficult to reproduce at will.
- Implement silly rename for unlink() and rename().
- Set i_blocks so that du can get some information about usage.
- Fix xattr handlers to return the right amount of data and to not
overflow buffers.
- Implement getting/setting raw AFS and YFS ACLs as xattrs"
* tag 'afs-next-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Implement YFS ACL setting
afs: Get YFS ACLs and information through xattrs
afs: implement acl setting
afs: Get an AFS3 ACL as an xattr
afs: Fix getting the afs.fid xattr
afs: Fix the afs.cell and afs.volume xattr handlers
afs: Calculate i_blocks based on file size
afs: Log more information for "kAFS: AFS vnode with undefined type\n"
afs: Provide mount-time configurable byte-range file locking emulation
afs: Add more tracepoints
afs: Implement sillyrename for unlink and rename
afs: Add directory reload tracepoint
afs: Handle lock rpc ops failing on a file that got deleted
afs: Improve dir check failure reports
afs: Add file locking tracepoints
afs: Further fix file locking
afs: Fix AFS file locking to allow fine grained locks
afs: Calculate lock extend timer from set/extend reply reception
afs: Split wait from afs_make_call()
Pull vfs 'struct file' related updates from Al Viro:
"A bit more of 'this fget() would be better off as fdget()'
whack-a-mole + a couple of ->f_count-related cleanups"
* 'work.file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
media: switch to fdget()
drm_syncobj: switch to fdget()
amdgpu: switch to fdget()
don't open-code file_count()
fs: drop unused fput_atomic definition