-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmGBFjQACgkQ+7dXa6fL
C2vuFhAAiJcYj/qiY9P6OMt593ppDlmrHCt7JzXVtnrVdX5RX0ZgvaG93w3DhPKx
/edBe/Jz+0FEGk3T+6fLyG9NZRmn8xGW9i+fhA88Vy6x1b1+pv2fG7lr/H5F6fg2
JC8dwa4nZdpkDzlfZImwseRJYyA5io/wjMva1BzJXXBXX7Y5chCQmFChrkyRzrX0
Gbj1SMGamDfbu/5OmxK04HDCPaQJ9irWxh/L9/cUcFXYeTYacDF/GwoNTV1pji3Y
Mg6qAiYd/P8c4zHG4ZjhtW1+d2wzuyu+KKKo32kPVM/Z1QWepJ48PcsVmnt67l0R
l18lzOu7i/S0vyL9sS+CXIgKcApyofYsm8X/M5zDC2NQDXa+wNSS/gEsf+waApxr
UtiDEjxDTm0uccHB1VhR4QMx7DrnYp0FBFMp1jUlIu2mMTdAaillzyAxcnGZUwTW
OMTZZxaLpKmvYUpZ9liQLU0Rcv2eQ2y+eI6RifQ4UpOEosSru5CcyxSdfKB/veHk
AkmqqqGRg+obSJoZxBMeTZD6SF0u/BrfBJAsS4d2UaNGkNlPXNi8GUUHsOurSPrh
Z9JqLdCf8u5/hpKQwNPStbjIS6M0KYvkdNuXnLrxbNuk2rgMg8/AraIV0kTTASSt
WV3L+WhNF+iJfVzQEfPQ8UJ7VQfgofdHhF5X1IdiznTe32DxojU=
=ax6b
-----END PGP SIGNATURE-----
Merge tag 'afs-next-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS updates from David Howells:
- Split the readpage handler for symlinks from the one for files. The
symlink readpage isn't given a file pointer, so the handling has to
be special-cased.
This has been posted as part of a patchset to foliate netfs, afs,
etc.[1] but I've moved it to this one as it's not actually doing
foliation but is more of a pre-cleanup.
- Fix file creation to set the mtime from the client's clock to keep
make happy if the server's clock isn't quite in sync.[2]
Link: https://lore.kernel.org/r/163005742570.2472992.7800423440314043178.stgit@warthog.procyon.org.uk/ [1]
Link: http://lists.infradead.org/pipermail/linux-afs/2021-October/004395.html [2]
* tag 'afs-next-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Set mtime from the client for yfs create operations
afs: Sort out symlink reading
For operations that create vnodes on the server such as CreateFile,
MakeDir or Symlink, the server will store its own current time as
the mtime if the client doesn't pass in a time in the accompanying
StoreStatus structure.
If the server and client clocks are not well synchronized, the client
may see timestamps in the future or inconsistent dependency checks
with "make" for files that are not modified after creation:
make[2]: Warning: File 'arch/x86/kernel/apic/modules.order' has
modification time 0.14 s in the future
make[2]: warning: Clock skew detected. Your build may be incomplete.
This is already handled correctly for non yfs operations; also
set the mtime for the corresponding yfs operations.
Changes:
v3: Replace S_IRWXUGO with 0777, per checkpatch
v2: [dhowells] Merge the two xdr_encode_YFSStoreStatus*() functions together
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lists.infradead.org/pipermail/linux-afs/2021-October/004395.html
Add memory folios, a new type to represent either order-0 pages or
the head page of a compound page. This should be enough infrastructure
to support filesystems converting from pages to folios.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmF9uI0ACgkQDpNsjXcp
gj7MUAf/R7LCZ+xFiIedw7SAgb/DGK0C9uVjuBEIZgAw21ZUw/GuPI6cuKBMFGGf
rRcdtlvMpwi7yZJcoNXxaqU/xPaaJMjf2XxscIvYJP1mjlZVuwmP9dOx0neNvWOc
T+8lqR6c1TLl82lpqIjGFLwvj2eVowq2d3J5jsaIJFd4odmmYVInrhJXOzC/LQ54
Niloj5ksehf+KUIRLDz7ycppvIHhlVsoAl0eM2dWBAtL0mvT7Nyn/3y+vnMfV2v3
Flb4opwJUgTJleYc16oxTn9svT2yS8q2uuUemRDLW8ABghoAtH3fUUk43RN+5Krd
LYCtbeawtkikPVXZMfWybsx5vn0c3Q==
=7SBe
-----END PGP SIGNATURE-----
Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache
Pull memory folios from Matthew Wilcox:
"Add memory folios, a new type to represent either order-0 pages or the
head page of a compound page. This should be enough infrastructure to
support filesystems converting from pages to folios.
The point of all this churn is to allow filesystems and the page cache
to manage memory in larger chunks than PAGE_SIZE. The original plan
was to use compound pages like THP does, but I ran into problems with
some functions expecting only a head page while others expect the
precise page containing a particular byte.
The folio type allows a function to declare that it's expecting only a
head page. Almost incidentally, this allows us to remove various calls
to VM_BUG_ON(PageTail(page)) and compound_head().
This converts just parts of the core MM and the page cache. For 5.17,
we intend to convert various filesystems (XFS and AFS are ready; other
filesystems may make it) and also convert more of the MM and page
cache to folios. For 5.18, multi-page folios should be ready.
The multi-page folios offer some improvement to some workloads. The
80% win is real, but appears to be an artificial benchmark (postgres
startup, which isn't a serious workload). Real workloads (eg building
the kernel, running postgres in a steady state, etc) seem to benefit
between 0-10%. I haven't heard of any performance losses as a result
of this series. Nobody has done any serious performance tuning; I
imagine that tweaking the readahead algorithm could provide some more
interesting wins. There are also other places where we could choose to
create large folios and currently do not, such as writes that are
larger than PAGE_SIZE.
I'd like to thank all my reviewers who've offered review/ack tags:
Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes
Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil
Babka, William Kucharski, Yu Zhao and Zi Yan.
I'd also like to thank those who gave feedback I incorporated but
haven't offered up review tags for this part of the series: Nick
Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard,
Hugh Dickins, and probably a few others who I forget"
* tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits)
mm/writeback: Add folio_write_one
mm/filemap: Add FGP_STABLE
mm/filemap: Add filemap_get_folio
mm/filemap: Convert mapping_get_entry to return a folio
mm/filemap: Add filemap_add_folio()
mm/filemap: Add filemap_alloc_folio
mm/page_alloc: Add folio allocation functions
mm/lru: Add folio_add_lru()
mm/lru: Convert __pagevec_lru_add_fn to take a folio
mm: Add folio_evictable()
mm/workingset: Convert workingset_refault() to take a folio
mm/filemap: Add readahead_folio()
mm/filemap: Add folio_mkwrite_check_truncate()
mm/filemap: Add i_blocks_per_folio()
mm/writeback: Add folio_redirty_for_writepage()
mm/writeback: Add folio_account_redirty()
mm/writeback: Add folio_clear_dirty_for_io()
mm/writeback: Add folio_cancel_dirty()
mm/writeback: Add folio_account_cleaned()
mm/writeback: Add filemap_dirty_folio()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmFfE4oACgkQ+7dXa6fL
C2txTBAAnWlEssljz7x09A/I9Js155U2hW9oDSoqkUxqZSe05oBbTPNycURvXAGZ
wZhNZdD5Xc4ITjLmPQQclgkfWc+deq6UKzw8E58XmjiO1Uq6WcqUsC95M1USAmaM
nRyhGrYRxJbv5eRDx3Ox3yoLntlSzvX1ZLhWr6DgAnb9uCdIWSGgy34XTd3aOSZa
OEtPR/tvBZygxMV9wsflD2GNNLe7QDrOMUnvFSlmxBOUolclbHj9uhB/fQXN7frN
Q/nf5QluBqZK13CIbiKSPy0wfl/hEdSFsOs5jAgMGm4IsZjSpsw2lvzxlfEaI7U/
QzNHpqAc0ynPI9fbvs2LTkNFR1oe+njOIVvu0QMjOXEdnyOGEbFjX5eDNiKSAih4
R3cNh2T16yUsx99lVbGkJAwbBQTmdp2yvfugQVX5qDNi+Ln8TFUKUHgruUv/FYJw
hUjcOL6cjGdWORpWkxSoEariA6zDjKCWiyMu5w2yzSufI+DJ0AI6MQVOeqaX6dm6
EldlxDO3w7uvXmwpH1RZsHXCqWfyiHn4P5LsSuVy/wM2O/VemaGQuHsxnLtMMJ+q
HGniSziE6LAvF0RvBrngFGhAY6rqMIGzXK/+S1Z/YwM9+tYnoYhbANDhjmywrcI5
GWaKePV5giTXlaI/XertjzEpQ2yo8r2HkYoVowV3NaRNrc3qgnQ=
=X7mM
-----END PGP SIGNATURE-----
Merge tag 'misc-fixes-20211007' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull netfslib, cachefiles and afs fixes from David Howells:
- Fix another couple of oopses in cachefiles tracing stemming from the
possibility of passing in a NULL object pointer
- Fix netfs_clear_unread() to set READ on the iov_iter so that source
it is passed to doesn't do the wrong thing (some drivers look at the
flag on iov_iter rather than other available information to determine
the direction)
- Fix afs_launder_page() to write back at the correct file position on
the server so as not to corrupt data
* tag 'misc-fixes-20211007' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix afs_launder_page() to set correct start file position
netfs: Fix READ/WRITE confusion when calling iov_iter_xarray()
cachefiles: Fix oops with cachefiles_cull() due to NULL object
wait_on_page_writeback_killable() only has one caller, so convert it to
call folio_wait_writeback_killable(). For the wait_on_page_writeback()
callers, add a compatibility wrapper around folio_wait_writeback().
Turning PageWriteback() into folio_test_writeback() eliminates a call
to compound_head() which saves 8 bytes and 15 bytes in the two
functions. Unfortunately, that is more than offset by adding the
wait_on_page_writeback compatibility wrapper for a net increase in text
of 7 bytes.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Howells <dhowells@redhat.com>
When an afs file or directory is modified locally such that the total file
size is extended, i_blocks needs to be recalculated too.
Fix this by making afs_write_end() and afs_edit_dir_add() call
afs_set_i_size() rather than setting inode->i_size directly as that also
recalculates inode->i_blocks.
This can be tested by creating and writing into directories and files and
then examining them with du. Without this change, directories show a 4
blocks (they start out at 2048 bytes) and files show 0 blocks; with this
change, they should show a number of blocks proportional to the file size
rounded up to 1024.
Fixes: 31143d5d51 ("AFS: implement basic file write support")
Fixes: 63a4681ff3 ("afs: Locally edit directory data for mkdir/create/unlink/...")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163113612442.352844.11162345591911691150.stgit@warthog.procyon.org.uk/
AFS-3 has two data fetch RPC variants, FS.FetchData and FS.FetchData64, and
Linux's afs client switches between them when talking to a non-YFS server
if the read size, the file position or the sum of the two have the upper 32
bits set of the 64-bit value.
This is a problem, however, since the file position and length fields of
FS.FetchData are *signed* 32-bit values.
Fix this by capturing the capability bits obtained from the fileserver when
it's sent an FS.GetCapabilities RPC, rather than just discarding them, and
then picking out the VICED_CAPABILITY_64BITFILES flag. This can then be
used to decide whether to use FS.FetchData or FS.FetchData64 - and also
FS.StoreData or FS.StoreData64 - rather than using upper_32_bits() to
switch on the parameter values.
This capabilities flag could also be used to limit the maximum size of the
file, but all servers must be checked for that.
Note that the issue does not exist with FS.StoreData - that uses *unsigned*
32-bit values. It's also not a problem with Auristor servers as its
YFS.FetchData64 op uses unsigned 64-bit values.
This can be tested by cloning a git repo through an OpenAFS client to an
OpenAFS server and then doing "git status" on it from a Linux afs
client[1]. Provided the clone has a pack file that's in the 2G-4G range,
the git status will show errors like:
error: packfile .git/objects/pack/pack-5e813c51d12b6847bbc0fcd97c2bca66da50079c.pack does not match index
error: packfile .git/objects/pack/pack-5e813c51d12b6847bbc0fcd97c2bca66da50079c.pack does not match index
This can be observed in the server's FileLog with something like the
following appearing:
Sun Aug 29 19:31:39 2021 SRXAFS_FetchData, Fid = 2303380852.491776.3263114, Host 192.168.11.201:7001, Id 1001
Sun Aug 29 19:31:39 2021 CheckRights: len=0, for host=192.168.11.201:7001
Sun Aug 29 19:31:39 2021 FetchData_RXStyle: Pos 18446744071815340032, Len 3154
Sun Aug 29 19:31:39 2021 FetchData_RXStyle: file size 2400758866
...
Sun Aug 29 19:31:40 2021 SRXAFS_FetchData returns 5
Note the file position of 18446744071815340032. This is the requested file
position sign-extended.
Fixes: b9b1f8d593 ("AFS: write support fixes")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
cc: openafs-devel@openafs.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214217#c9 [1]
Link: https://lore.kernel.org/r/951332.1631308745@warthog.procyon.org.uk/
Try to avoid taking the RCU read lock when checking the validity of a
vnode's callback state. The only thing it's needed for is to pin the
parent volume's server list whilst we search it to find the record of the
server we're currently using to see if it has been reinitialised (ie. it
sent us a CB.InitCallBackState* RPC).
Do this by the following means:
(1) Keep an additional per-cell counter (fs_s_break) that's incremented
each time any of the fileservers in the cell reinitialises.
Since the new counter can be accessed without RCU from the vnode, we
can check that first - and only if it differs, get the RCU read lock
and check the volume's server list.
(2) Replace afs_get_s_break_rcu() with afs_check_server_good() which now
indicates whether the callback promise is still expected to be present
on the server. This does the checks as described in (1).
(3) Restructure afs_check_validity() to take account of the change in (2).
We can also get rid of the valid variable and just use the need_clear
variable with the addition of the afs_cb_break_no_promise reason.
(4) afs_check_validity() probably shouldn't be altering vnode->cb_v_break
and vnode->cb_s_break when it doesn't have cb_lock exclusively locked.
Move the change to vnode->cb_v_break to __afs_break_callback().
Delegate the change to vnode->cb_s_break to afs_select_fileserver()
and set vnode->cb_fs_s_break there also.
(5) afs_validate() no longer needs to get the RCU read lock around its
call to afs_check_validity() - and can skip the call entirely if we
don't have a promise.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111669583.283156.1397603105683094563.stgit@warthog.procyon.org.uk/
Fix the coherency management of mmap'd data such that 3rd-party changes
become visible as soon as possible after the callback notification is
delivered by the fileserver. This is done by the following means:
(1) When we break a callback on a vnode specified by the CB.CallBack call
from the server, we queue a work item (vnode->cb_work) to go and
clobber all the PTEs mapping to that inode.
This causes the CPU to trip through the ->map_pages() and
->page_mkwrite() handlers if userspace attempts to access the page(s)
again.
(Ideally, this would be done in the service handler for CB.CallBack,
but the server is waiting for our reply before considering, and we
have a list of vnodes, all of which need breaking - and the process of
getting the mmap_lock and stripping the PTEs on all CPUs could be
quite slow.)
(2) Call afs_validate() from the ->map_pages() handler to check to see if
the file has changed and to get a new callback promise from the
server.
Also handle the fileserver telling us that it's dropping all callbacks,
possibly after it's been restarted by sending us a CB.InitCallBackState*
call by the following means:
(3) Maintain a per-cell list of afs files that are currently mmap'd
(cell->fs_open_mmaps).
(4) Add a work item to each server that is invoked if there are any open
mmaps when CB.InitCallBackState happens. This work item goes through
the aforementioned list and invokes the vnode->cb_work work item for
each one that is currently using this server.
This causes the PTEs to be cleared, causing ->map_pages() or
->page_mkwrite() to be called again, thereby calling afs_validate()
again.
I've chosen to simply strip the PTEs at the point of notification reception
rather than invalidate all the pages as well because (a) it's faster, (b)
we may get a notification for other reasons than the data being altered (in
which case we don't want to clobber the pagecache) and (c) we need to ask
the server to find out - and I don't want to wait for the reply before
holding up userspace.
This was tested using the attached test program:
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
int main(int argc, char *argv[])
{
size_t size = getpagesize();
unsigned char *p;
bool mod = (argc == 3);
int fd;
if (argc != 2 && argc != 3) {
fprintf(stderr, "Format: %s <file> [mod]\n", argv[0]);
exit(2);
}
fd = open(argv[1], mod ? O_RDWR : O_RDONLY);
if (fd < 0) {
perror(argv[1]);
exit(1);
}
p = mmap(NULL, size, mod ? PROT_READ|PROT_WRITE : PROT_READ,
MAP_SHARED, fd, 0);
if (p == MAP_FAILED) {
perror("mmap");
exit(1);
}
for (;;) {
if (mod) {
p[0]++;
msync(p, size, MS_ASYNC);
fsync(fd);
}
printf("%02x", p[0]);
fflush(stdout);
sleep(1);
}
}
It runs in two modes: in one mode, it mmaps a file, then sits in a loop
reading the first byte, printing it and sleeping for a second; in the
second mode it mmaps a file, then sits in a loop incrementing the first
byte and flushing, then printing and sleeping.
Two instances of this program can be run on different machines, one doing
the reading and one doing the writing. The reader should see the changes
made by the writer, but without this patch, they aren't because validity
checking is being done lazily - only on entry to the filesystem.
Testing the InitCallBackState change is more complicated. The server has
to be taken offline, the saved callback state file removed and then the
server restarted whilst the reading-mode program continues to run. The
client machine then has to poke the server to trigger the InitCallBackState
call.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111668833.283156.382633263709075739.stgit@warthog.procyon.org.uk/
The AFS filesystem is currently triggering the silly-rename cleanup from
afs_d_revalidate() when it sees that a dentry has been changed by a third
party[1]. It should not be doing this as the cleanup includes deleting the
silly-rename target file on iput.
Fix this by removing the places in the d_revalidate handling that validate
anything other than the directory and the dirent. It probably should not
be looking to validate the target inode of the dentry also.
This includes removing the point in afs_d_revalidate() where the inode that
a dentry used to point to was marked as being deleted (AFS_VNODE_DELETED).
We don't know it got deleted. It could have been renamed or it could have
hard links remaining.
This was reproduced by cloning a git repo onto an afs volume on one
machine, switching to another machine and doing "git status", then
switching back to the first and doing "git status". The second status
would show weird output due to ".git/index" getting deleted by the above
mentioned mechanism.
A simpler way to do it is to do:
machine 1: touch a
machine 2: touch b; mv -f b a
machine 1: stat a
on an afs volume. The bug shows up as the stat failing with ENOENT and the
file server log showing that machine 1 deleted "a".
Fixes: 79ddbfa500 ("afs: Implement sillyrename for unlink and rename")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214217#c4 [1]
Link: https://lore.kernel.org/r/163111668100.283156.3851669884664475428.stgit@warthog.procyon.org.uk/
afs_d_revalidate() should only be validating the directory entry it is
given and the directory to which that belongs; it shouldn't be validating
the inode/vnode to which that dentry points. Besides, validation need to
be done even if we don't call afs_d_revalidate() - which might be the case
if we're starting from a file descriptor.
In order for afs_d_revalidate() to be fixed, validation points must be
added in some other places. Certain directory operations, such as
afs_unlink(), already check this, but not all and not all file operations
either.
Note that the validation of a vnode not only checks to see if the
attributes we have are correct, but also gets a promise from the server to
notify us if that file gets changed by a third party.
Add the following checks:
- Check the vnode we're going to make a hard link to.
- Check the vnode we're going to move/rename.
- Check the vnode we're going to read from.
- Check the vnode we're going to write to.
- Check the vnode we're going to sync.
- Check the vnode we're going to make a mapped page writable for.
Some of these aren't strictly necessary as we're going to perform a server
operation that might get the attributes anyway from which we can determine
if something changed - though it might not get us a callback promise.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111667354.283156.12720698333342917516.stgit@warthog.procyon.org.uk/
There's a loop in afs_extend_writeback() that adds extra pages to a write
we want to make to improve the efficiency of the writeback by making it
larger. This loop stops, however, if we hit a page we can't write back
from immediately, but it doesn't get rid of the page ref we speculatively
acquired.
This was caused by the removal of the cleanup loop when the code switched
from using find_get_pages_contig() to xarray scanning as the latter only
gets a single page at a time, not a batch.
Fix this by putting the page on a ref on an early break from the loop.
Unfortunately, we can't just add that page to the pagevec we're employing
as we'll go through that and add those pages to the RPC call.
This was found by the generic/074 test. It leaks ~4GiB of RAM each time it
is run - which can be observed with "top".
Fixes: e87b03f583 ("afs: Prepare for use of THPs")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111666635.283156.177701903478910460.stgit@warthog.procyon.org.uk/
The afs_read objects created by afs_req_issue_op() get leaked because
afs_alloc_read() returns a ref and then afs_fetch_data() gets its own ref
which is released when the operation completes, but the initial ref is
never released.
Fix this by discarding the initial ref at the end of afs_req_issue_op().
This leak also covered another bug whereby a ref isn't got on the key
attached to the read record by afs_req_issue_op(). This isn't a problem as
long as the afs_read req never goes away...
Fix this by calling key_get() in afs_req_issue_op().
This was found by the generic/074 test. It leaks a bunch of kmalloc-192
objects each time it is run, which can be observed by watching
/proc/slabinfo.
Fixes: f7605fa869cf ("afs: Fix leak of afs_read objects")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163010394740.3035676.8516846193899793357.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/163111665914.283156.3038561975681836591.stgit@warthog.procyon.org.uk/
We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
off in Fedora and RHEL8. Several other distros have followed suit.
I've heard of one problem in all that time: Someone migrated from an
older distro that supported "-o mand" to one that didn't, and the host
had a fstab entry with "mand" in it which broke on reboot. They didn't
actually _use_ mandatory locking so they just removed the mount option
and moved on.
This patch rips out mandatory locking support wholesale from the kernel,
along with the Kconfig option and the Documentation file. It also
changes the mount code to ignore the "mand" mount option instead of
erroring out, and to throw a big, ugly warning.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Variable ret is set to -ENOENT and -ENOMEM but this value is never
read as it is overwritten or not used later on, hence it is a
redundant assignment and can be removed.
Cleans up the following clang-analyzer warning:
fs/afs/dir.c:2014:4: warning: Value stored to 'ret' is never read
[clang-analyzer-deadcode.DeadStores].
fs/afs/dir.c:659:2: warning: Value stored to 'ret' is never read
[clang-analyzer-deadcode.DeadStores].
[DH made the following modifications:
- In afs_rename(), -ENOMEM should be placed in op->error instead of ret,
rather than the assignment being removed entirely. afs_put_operation()
will pick it up from there and return it.
- If afs_sillyrename() fails, its error code should be placed in op->error
rather than in ret also.
]
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/1619691492-83866-1-git-send-email-jiapeng.chong@linux.alibaba.com
Link: https://lore.kernel.org/r/162609465444.3133237.7562832521724298900.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162610729052.3408253.17364333638838151299.stgit@warthog.procyon.org.uk/ # v2
To quote Alexey[1]:
I was adding custom tracepoint to the kernel, grabbed full F34 kernel
.config, disabled modules and booted whole shebang as VM kernel.
Then did
perf record -a -e ...
It crashed:
general protection fault, probably for non-canonical address 0x435f5346592e4243: 0000 [#1] SMP PTI
CPU: 1 PID: 842 Comm: cat Not tainted 5.12.6+ #26
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
RIP: 0010:t_show+0x22/0xd0
Then reproducer was narrowed to
# cat /sys/kernel/tracing/printk_formats
Original F34 kernel with modules didn't crash.
So I started to disable options and after disabling AFS everything
started working again.
The root cause is that AFS was placing char arrays content into a
section full of _pointers_ to strings with predictable consequences.
Non canonical address 435f5346592e4243 is "CB.YFS_" which came from
CM_NAME macro.
Steps to reproduce:
CONFIG_AFS=y
CONFIG_TRACING=y
# cat /sys/kernel/tracing/printk_formats
Fix this by the following means:
(1) Add enum->string translation tables in the event header with the AFS
and YFS cache/callback manager operations listed by RPC operation ID.
(2) Modify the afs_cb_call tracepoint to print the string from the
translation table rather than using the string at the afs_call name
pointer.
(3) Switch translation table depending on the service we're being accessed
as (AFS or YFS) in the tracepoint print clause. Will this cause
problems to userspace utilities?
Note that the symbolic representation of the YFS service ID isn't
available to this header, so I've put it in as a number. I'm not sure
if this is the best way to do this.
(4) Remove the name wrangling (CM_NAME) macro and put the names directly
into the afs_call_type structs in cmservice.c.
Fixes: 8e8d7f13b6 ("afs: Add some tracepoints")
Reported-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/YLAXfvZ+rObEOdc%2F@localhost.localdomain/ [1]
Link: https://lore.kernel.org/r/643721.1623754699@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/162430903582.2896199.6098150063997983353.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162609463957.3133237.15916579353149746363.stgit@warthog.procyon.org.uk/ # v1 (repost)
Link: https://lore.kernel.org/r/162610726860.3408253.445207609466288531.stgit@warthog.procyon.org.uk/ # v2
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmDQ9Z4ACgkQ+7dXa6fL
C2sVfg/9H3OlL4Vwv5CERgb5kbDoALAh5epYMFyvNVy9l5iTvYekxG3qna6ee0+G
pTRHSU5tZUCUslpafv8LUz1RE/iM127y+BP+qq2joG9jkT4q3QfeV4oDr+dgZWCL
obp+rjQSTKkaGh3eqjDx7gCSp5mqQI/M8MXe1VOXxUzAnsf2nH4iJLv/A9NN17xW
l7sQfZJGHD1BPqsxxFiSr+UCkCLuLDybUDYL6+PZcFu0jSON0h4yEtIwsAA7a1Zw
TvgUpequrbyTORI3sHJ8eIixWosh4yLTJ4pDqs8qDqq3Wm5eTZjss0wxEaVfpaTu
cg/CcNUBiHJ/6Q8r+JiuPbEnfnQ9woL8951/CNi+cOGhTy9LNoG70orSZKKynmW5
QmpYBK5BkM57EPj7DYZJRI1Bwy41pJapFj8tXbMjObU+ZyaXMysmBDanIRK/cLoy
fr4Sz+1D8yJPQ0GDgC4051CxrhynOEnRo8JbESg8CD4PnqFeM7EoCh48H2oVvR4N
9v2xuaRyBvi2KTmKSktRe+s1DS80acVMUYP33nT2zAthvL91VVgMY3Hz2/QrlAp8
h1hREME8aRcN4LrBIgp/ET150hUh44U2A07PqnYYe0653MH7aHHFfk134ZuTmbub
Pfc1MHtWgPAWN4c140ILBRTidJeShszoJGgpD6tflkj10VI2s34=
=2c6l
-----END PGP SIGNATURE-----
Merge tag 'netfs-fixes-20210621' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull netfs fixes from David Howells:
"This contains patches to fix netfs_write_begin() and afs_write_end()
in the following ways:
(1) In netfs_write_begin(), extract the decision about whether to skip
a page out to its own helper and have that clear around the region
to be written, but not clear that region. This requires the
filesystem to patch it up afterwards if the hole doesn't get
completely filled.
(2) Use offset_in_thp() in (1) rather than manually calculating the
offset into the page.
(3) Due to (1), afs_write_end() now needs to handle short data write
into the page by generic_perform_write(). I've adopted an
analogous approach to ceph of just returning 0 in this case and
letting the caller go round again.
It also adds a note that (in the future) the len parameter may extend
beyond the page allocated. This is because the page allocation is
deferred to write_begin() and that gets to decide what size of THP to
allocate."
Jeff Layton points out:
"The netfs fix in particular fixes a data corruption bug in cephfs"
* tag 'netfs-fixes-20210621' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
netfs: fix test for whether we can skip read when writing beyond EOF
afs: Fix afs_write_end() to handle short writes
Fix afs_write_end() to correctly handle a short copy into the intended
write region of the page. Two things are necessary:
(1) If the page is not up to date, then we should just return 0
(ie. indicating a zero-length copy). The loop in
generic_perform_write() will go around again, possibly breaking up the
iterator into discrete chunks[1].
This is analogous to commit b9de313cf0
for ceph.
(2) The page should not have been set uptodate if it wasn't completely set
up by netfs_write_begin() (this will be fixed in the next patch), so
we need to set uptodate here in such a case.
Also remove the assertion that was checking that the page was set uptodate
since it's now set uptodate if it wasn't already a few lines above. The
assertion was from when uptodate was set elsewhere.
Changes:
v3: Remove the handling of len exceeding the end of the page.
Fixes: 3003bbd069 ("afs: Use the netfs_write_begin() helper")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/YMwVp268KTzTf8cN@zeniv-ca.linux.org.uk/ [1]
Link: https://lore.kernel.org/r/162367682522.460125.5652091227576721609.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162391825688.1173366.3437507255136307904.stgit@warthog.procyon.org.uk/ # v2
If a task is killed during a page fault, it does not currently call
sb_end_pagefault(), which means that the filesystem cannot be frozen
at any time thereafter. This may be reported by lockdep like this:
====================================
WARNING: fsstress/10757 still has locks held!
5.13.0-rc4-build4+ #91 Not tainted
------------------------------------
1 lock held by fsstress/10757:
#0: ffff888104eac530
(
sb_pagefaults
as filesystem freezing is modelled as a lock.
Fix this by removing all the direct returns from within the function,
and using 'ret' to indicate whether we were interrupted or successful.
Fixes: 1cf7a1518a ("afs: Implement shared-writeable mmap")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20210616154900.1958373-1-willy@infradead.org/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit e87b03f583 ("afs: Prepare for use of THPs"), the return
value for afs_write_back_from_locked_page was changed from a number
of pages to a length in bytes. The loop in afs_writepages_region uses
the return value to compute the index that will be used to find dirty
pages in the next iteration, but treats it as a number of pages and
wrongly multiplies it by PAGE_SIZE. This gives a very large index value,
potentially skipping any dirty data that was not covered in the first
pass, which is limited to 256M.
This causes fsync(), and indirectly close(), to only do a partial
writeback of a large file's dirty data. The rest is eventually written
back by background threads after dirty_expire_centisecs.
Fixes: e87b03f583 ("afs: Prepare for use of THPs")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20210604175504.4055-1-marc.c.dionne@gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix rename of one directory over another such that the nlink on the deleted
directory is cleared to 0 rather than being decremented to 1.
This was causing the generic/035 xfstest to fail.
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/162194384460.3999479.7605572278074191079.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The generic/464 xfstest causes kAFS to emit occasional warnings of the
form:
kAFS: vnode modified {100055:8a} 30->31 YFS.StoreData64 (c=6015)
This indicates that the data version received back from the server did not
match the expected value (the DV should be incremented monotonically for
each individual modification op committed to a vnode).
What is happening is that a lookup call is doing a bulk status fetch
speculatively on a bunch of vnodes in a directory besides getting the
status of the vnode it's actually interested in. This is racing with a
StoreData operation (though it could also occur with, say, a MakeDir op).
On the client, a modification operation locks the vnode, but the bulk
status fetch only locks the parent directory, so no ordering is imposed
there (thereby avoiding an avenue to deadlock).
On the server, the StoreData op handler doesn't lock the vnode until it's
received all the request data, and downgrades the lock after committing the
data until it has finished sending change notifications to other clients -
which allows the status fetch to occur before it has finished.
This means that:
- a status fetch can access the target vnode either side of the exclusive
section of the modification
- the status fetch could start before the modification, yet finish after,
and vice-versa.
- the status fetch and the modification RPCs can complete in either order.
- the status fetch can return either the before or the after DV from the
modification.
- the status fetch might regress the locally cached DV.
Some of these are handled by the previous fix[1], but that's not sufficient
because it checks the DV it received against the DV it cached at the start
of the op, but the DV might've been updated in the meantime by a locally
generated modification op.
Fix this by the following means:
(1) Keep track of when we're performing a modification operation on a
vnode. This is done by marking vnode parameters with a 'modification'
note that causes the AFS_VNODE_MODIFYING flag to be set on the vnode
for the duration.
(2) Alter the speculation race detection to ignore speculative status
fetches if either the vnode is marked as being modified or the data
version number is not what we expected.
Note that whilst the "vnode modified" warning does get recovered from as it
causes the client to refetch the status at the next opportunity, it will
also invalidate the pagecache, so changes might get lost.
Fixes: a9e5c87ca7 ("afs: Fix speculative status fetch going out of order wrt to modifications")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-and-reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/160605082531.252452.14708077925602709042.stgit@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/linux-fsdevel/161961335926.39335.2552653972195467566.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmCHJJAACgkQ+7dXa6fL
C2uv0A//S/sJyToPtj3xbzmRVmSGGWFYNRMaxBD2gYAq7swbDNiX4ZbBCe8A4FBY
zedeMfoNztHIRB2M9vvnhG4HJWXPKq2BaT0xzeteCcmZ65b5zBOrAXue0PQPqE20
xmK1RDls/y5Y2FaF92Ay0VZzXW7+y/M+RRSo+FCFzrIgpJrPprTnlZigrECYauGJ
Qdsv26rQ0flK6tyi6GVuWZIMvpINCt3WwpwQTkAUewz2VewA1tZ1xFe70sP0vF7R
MJNaS6A4uJmvoJJzb8rqdnBGiu76+TxmPaXn0IZKJBECZjBVJyk/duce0jgqbQ7C
PZz5j4C2xrPyu3Y98joj37HPEAHCy0DPRx2Es1mz5cHPzI1TDRClHzPrxyycz9gr
D9WnMiPj9ff9aDaV6XpWKyuHhPxaHpoOD3VGdrhx6bU19Jd3/mLHB3lSt1kJzWdg
QrSAk3KzMWAZigz/+I5xetOpbygKTPLEYgpdmdOSTrtACcm1wjnhIougu0FUIWXK
arPNFOIV9liN0qCQyDOcLx4UEcxXrb2W0AYeHHJDBFxJ7sT2WWUCjPZFW5bh3G+Y
goKv/XJRVWJxFlTXLZLZ5siclzzIlAAmSylh661ji836yRhqTQ3NJTB8QfnrGGsZ
QlD1hjpyqC8uwIGUvoh56KdLRTxj9Gj70gpVe/Lk3Z16mivqDUE=
=fSr0
-----END PGP SIGNATURE-----
Merge tag 'afs-netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS updates from David Howells:
"Use the new netfs lib.
Begin the process of overhauling the use of the fscache API by AFS and
the introduction of support for features such as Transparent Huge
Pages (THPs).
- Add some support for THPs, including using core VM helper functions
to find details of pages.
- Use the ITER_XARRAY I/O iterator to mediate access to the pagecache
as this handles THPs and doesn't require allocation of large bvec
arrays.
- Delegate address_space read/pre-write I/O methods for AFS to the
netfs helper library. A method is provided to the library that
allows it to issue a read against the server.
This includes a change in use for PG_fscache (it now indicates a
DIO write in progress from the marked page), so a number of waits
need to be deployed for it.
- Split the core AFS writeback function to make it easier to modify
in future patches to handle writing to the cache. [This might
feasibly make more sense moved out into my fscache-iter branch].
I've tested these with "xfstests -g quick" against an AFS volume
(xfstests needs patching to make it work). With this, AFS without a
cache passes all expected xfstests; with a cache, there's an extra
failure, but that's also there before these patches. Fixing that
probably requires a greater overhaul (as can be found on my
fscache-iter branch, but that's for a later time).
Thanks should go to Marc Dionne and Jeff Altman of AuriStor for
exercising the patches in their test farm also"
Link: https://lore.kernel.org/lkml/3785063.1619482429@warthog.procyon.org.uk/
* tag 'afs-netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Use the netfs_write_begin() helper
afs: Use new netfs lib read helper API
afs: Use the fs operation ops to handle FetchData completion
afs: Prepare for use of THPs
afs: Extract writeback extension into its own function
afs: Wait on PG_fscache before modifying/releasing a page
afs: Use ITER_XARRAY for writing
afs: Set up the iov_iter before calling afs_extract_data()
afs: Log remote unmarshalling errors
afs: Don't truncate iter during data fetch
afs: Move key to afs_read struct
afs: Print the operation debug_id when logging an unexpected data version
afs: Pass page into dirty region helpers to provide THP size
afs: Disable use of the fscache I/O routines
Pull vfs inode type handling updates from Al Viro:
"We should never change the type bits of ->i_mode or the method tables
(->i_op and ->i_fop) of a live inode.
Unfortunately, not all filesystems took care to prevent that"
* 'work.inode-type-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
spufs: fix bogosity in S_ISGID handling
9p: missing chunk of "fs/9p: Don't update file type when updating file attributes"
openpromfs: don't do unlock_new_inode() until the new inode is set up
hostfs_mknod(): don't bother with init_special_inode()
cifs: have cifs_fattr_to_inode() refuse to change type on live inode
cifs: have ->mkdir() handle race with another client sanely
do_cifs_create(): don't set ->i_mode of something we had not created
gfs2: be careful with inode refresh
ocfs2_inode_lock_update(): make sure we don't change the type bits of i_mode
orangefs_inode_is_stale(): i_mode type bits do *not* form a bitmap...
vboxsf: don't allow to change the inode type
afs: Fix updating of i_mode due to 3rd party change
ceph: don't allow type or device number to change on non-I_NEW inodes
ceph: fix up error handling with snapdirs
new helper: inode_wrong_type()
afs_listxattr() lists all the available special afs xattrs (i.e. those in
the "afs.*" space), no matter what type of server we're dealing with. But
OpenAFS servers, for example, cannot deal with some of the extra-capable
attributes that AuriStor (YFS) servers provide. Unfortunately, the
presence of the afs.yfs.* attributes causes errors[1] for anything that
tries to read them if the server is of the wrong type.
Fix the problem by removing afs_listxattr() so that none of the special
xattrs are listed (AFS doesn't support xattrs). It does mean, however,
that getfattr won't list them, though they can still be accessed with
getxattr() and setxattr().
This can be tested with something like:
getfattr -d -m ".*" /afs/example.com/path/to/file
With this change, none of the afs.* attributes should be visible.
Changes:
ver #2:
- Hide all of the afs.* xattrs, not just the ACL ones.
Fixes: ae46578b96 ("afs: Get YFS ACLs and information through xattrs")
Reported-by: Gaja Sophie Peters <gaja.peters@math.uni-hamburg.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Gaja Sophie Peters <gaja.peters@math.uni-hamburg.de>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003502.html [1]
Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003567.html # v1
Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003573.html # v2
If someone attempts to access YFS-related xattrs (e.g. afs.yfs.acl) on a
file on a non-YFS AFS server (such as OpenAFS), then the kernel will jump
to a NULL function pointer because the afs_fetch_acl_operation descriptor
doesn't point to a function for issuing an operation on a non-YFS
server[1].
Fix this by making afs_wait_for_operation() check that the issue_afs_rpc
method is set before jumping to it and setting -ENOTSUPP if not. This fix
also covers other potential operations that also only exist on YFS servers.
afs_xattr_get/set_yfs() then need to translate -ENOTSUPP to -ENODATA as the
former error is internal to the kernel.
The bug shows up as an oops like the following:
BUG: kernel NULL pointer dereference, address: 0000000000000000
[...]
Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[...]
Call Trace:
afs_wait_for_operation+0x83/0x1b0 [kafs]
afs_xattr_get_yfs+0xe6/0x270 [kafs]
__vfs_getxattr+0x59/0x80
vfs_getxattr+0x11c/0x140
getxattr+0x181/0x250
? __check_object_size+0x13f/0x150
? __fput+0x16d/0x250
__x64_sys_fgetxattr+0x64/0xb0
do_syscall_64+0x49/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb120a9defe
This was triggered with "cp -a" which attempts to copy xattrs, including
afs ones, but is easier to reproduce with getfattr, e.g.:
getfattr -d -m ".*" /afs/openafs.org/
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Reported-by: Gaja Sophie Peters <gaja.peters@math.uni-hamburg.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Gaja Sophie Peters <gaja.peters@math.uni-hamburg.de>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003498.html [1]
Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003566.html # v1
Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003572.html # v2
Fix afs_apply_status() to mask off the irrelevant bits from status->mode
when OR'ing them into i_mode. This can happen when a 3rd party chmod
occurs.
Also fix afs_inode_init_from_status() to mask off the mode bits when
initialising i_mode.
Fixes: 260a980317 ("[AFS]: Add "directory write" support.")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
=yPaw
-----END PGP SIGNATURE-----
Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull idmapped mounts from Christian Brauner:
"This introduces idmapped mounts which has been in the making for some
time. Simply put, different mounts can expose the same file or
directory with different ownership. This initial implementation comes
with ports for fat, ext4 and with Christoph's port for xfs with more
filesystems being actively worked on by independent people and
maintainers.
Idmapping mounts handle a wide range of long standing use-cases. Here
are just a few:
- Idmapped mounts make it possible to easily share files between
multiple users or multiple machines especially in complex
scenarios. For example, idmapped mounts will be used in the
implementation of portable home directories in
systemd-homed.service(8) where they allow users to move their home
directory to an external storage device and use it on multiple
computers where they are assigned different uids and gids. This
effectively makes it possible to assign random uids and gids at
login time.
- It is possible to share files from the host with unprivileged
containers without having to change ownership permanently through
chown(2).
- It is possible to idmap a container's rootfs and without having to
mangle every file. For example, Chromebooks use it to share the
user's Download folder with their unprivileged containers in their
Linux subsystem.
- It is possible to share files between containers with
non-overlapping idmappings.
- Filesystem that lack a proper concept of ownership such as fat can
use idmapped mounts to implement discretionary access (DAC)
permission checking.
- They allow users to efficiently changing ownership on a per-mount
basis without having to (recursively) chown(2) all files. In
contrast to chown (2) changing ownership of large sets of files is
instantenous with idmapped mounts. This is especially useful when
ownership of a whole root filesystem of a virtual machine or
container is changed. With idmapped mounts a single syscall
mount_setattr syscall will be sufficient to change the ownership of
all files.
- Idmapped mounts always take the current ownership into account as
idmappings specify what a given uid or gid is supposed to be mapped
to. This contrasts with the chown(2) syscall which cannot by itself
take the current ownership of the files it changes into account. It
simply changes the ownership to the specified uid and gid. This is
especially problematic when recursively chown(2)ing a large set of
files which is commong with the aforementioned portable home
directory and container and vm scenario.
- Idmapped mounts allow to change ownership locally, restricting it
to specific mounts, and temporarily as the ownership changes only
apply as long as the mount exists.
Several userspace projects have either already put up patches and
pull-requests for this feature or will do so should you decide to pull
this:
- systemd: In a wide variety of scenarios but especially right away
in their implementation of portable home directories.
https://systemd.io/HOME_DIRECTORY/
- container runtimes: containerd, runC, LXD:To share data between
host and unprivileged containers, unprivileged and privileged
containers, etc. The pull request for idmapped mounts support in
containerd, the default Kubernetes runtime is already up for quite
a while now: https://github.com/containerd/containerd/pull/4734
- The virtio-fs developers and several users have expressed interest
in using this feature with virtual machines once virtio-fs is
ported.
- ChromeOS: Sharing host-directories with unprivileged containers.
I've tightly synced with all those projects and all of those listed
here have also expressed their need/desire for this feature on the
mailing list. For more info on how people use this there's a bunch of
talks about this too. Here's just two recent ones:
https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdfhttps://fosdem.org/2021/schedule/event/containers_idmap/
This comes with an extensive xfstests suite covering both ext4 and
xfs:
https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts
It covers truncation, creation, opening, xattrs, vfscaps, setid
execution, setgid inheritance and more both with idmapped and
non-idmapped mounts. It already helped to discover an unrelated xfs
setgid inheritance bug which has since been fixed in mainline. It will
be sent for inclusion with the xfstests project should you decide to
merge this.
In order to support per-mount idmappings vfsmounts are marked with
user namespaces. The idmapping of the user namespace will be used to
map the ids of vfs objects when they are accessed through that mount.
By default all vfsmounts are marked with the initial user namespace.
The initial user namespace is used to indicate that a mount is not
idmapped. All operations behave as before and this is verified in the
testsuite.
Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users
to setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.
The user namespace the mount will be marked with can be specified by
passing a file descriptor refering to the user namespace as an
argument to the new mount_setattr() syscall together with the new
MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
of extensibility.
The following conditions must be met in order to create an idmapped
mount:
- The caller must currently have the CAP_SYS_ADMIN capability in the
user namespace the underlying filesystem has been mounted in.
- The underlying filesystem must support idmapped mounts.
- The mount must not already be idmapped. This also implies that the
idmapping of a mount cannot be altered once it has been idmapped.
- The mount must be a detached/anonymous mount, i.e. it must have
been created by calling open_tree() with the OPEN_TREE_CLONE flag
and it must not already have been visible in the filesystem.
The last two points guarantee easier semantics for userspace and the
kernel and make the implementation significantly simpler.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes are observed.
The manpage with a detailed description can be found here:
1d7b902e28
In order to support idmapped mounts, filesystems need to be changed
and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
patches to convert individual filesystem are not very large or
complicated overall as can be seen from the included fat, ext4, and
xfs ports. Patches for other filesystems are actively worked on and
will be sent out separately. The xfstestsuite can be used to verify
that port has been done correctly.
The mount_setattr() syscall is motivated independent of the idmapped
mounts patches and it's been around since July 2019. One of the most
valuable features of the new mount api is the ability to perform
mounts based on file descriptors only.
Together with the lookup restrictions available in the openat2()
RESOLVE_* flag namespace which we added in v5.6 this is the first time
we are close to hardened and race-free (e.g. symlinks) mounting and
path resolution.
While userspace has started porting to the new mount api to mount
proper filesystems and create new bind-mounts it is currently not
possible to change mount options of an already existing bind mount in
the new mount api since the mount_setattr() syscall is missing.
With the addition of the mount_setattr() syscall we remove this last
restriction and userspace can now fully port to the new mount api,
covering every use-case the old mount api could. We also add the
crucial ability to recursively change mount options for a whole mount
tree, both removing and adding mount options at the same time. This
syscall has been requested multiple times by various people and
projects.
There is a simple tool available at
https://github.com/brauner/mount-idmapped
that allows to create idmapped mounts so people can play with this
patch series. I'll add support for the regular mount binary should you
decide to pull this in the following weeks:
Here's an example to a simple idmapped mount of another user's home
directory:
u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt
u1001@f2-vm:/$ ls -al /home/ubuntu/
total 28
drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
-rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ ls -al /mnt/
total 28
drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
-rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
-rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ touch /mnt/my-file
u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file
u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file
u1001@f2-vm:/$ ls -al /mnt/my-file
-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file
u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file
u1001@f2-vm:/$ getfacl /mnt/my-file
getfacl: Removing leading '/' from absolute path names
# file: mnt/my-file
# owner: u1001
# group: u1001
user::rw-
user:u1001:rwx
group::rw-
mask::rwx
other::r--
u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
getfacl: Removing leading '/' from absolute path names
# file: home/ubuntu/my-file
# owner: ubuntu
# group: ubuntu
user::rw-
user:ubuntu:rwx
group::rw-
mask::rwx
other::r--"
* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
xfs: support idmapped mounts
ext4: support idmapped mounts
fat: handle idmapped mounts
tests: add mount_setattr() selftests
fs: introduce MOUNT_ATTR_IDMAP
fs: add mount_setattr()
fs: add attr_flags_to_mnt_flags helper
fs: split out functions to hold writers
namespace: only take read lock in do_reconfigure_mnt()
mount: make {lock,unlock}_mount_hash() static
namespace: take lock_mount_hash() directly when changing flags
nfs: do not export idmapped mounts
overlayfs: do not mount on top of idmapped mounts
ecryptfs: do not mount on top of idmapped mounts
ima: handle idmapped mounts
apparmor: handle idmapped mounts
fs: make helpers idmap mount aware
exec: handle idmapped mounts
would_dump: handle idmapped mounts
...