Change the afs filesystem to support the new afs driver.
The following changes have been made:
(1) The fscache_netfs struct is no more, and there's no need to register
the filesystem as a whole. There's also no longer a cell cookie.
(2) The volume cookie is now an fscache_volume cookie, allocated with
fscache_acquire_volume(). This function takes three parameters: a
string representing the "volume" in the index, a string naming the
cache to use (or NULL) and a u64 that conveys coherency metadata for
the volume.
For afs, I've made it render the volume name string as:
"afs,<cell>,<volume_id>"
and the coherency data is currently 0.
(3) The fscache_cookie_def is no more and needed information is passed
directly to fscache_acquire_cookie(). The cache no longer calls back
into the filesystem, but rather metadata changes are indicated at
other times.
fscache_acquire_cookie() is passed the same keying and coherency
information as before, except that these are now stored in big endian
form instead of cpu endian. This makes the cache more copyable.
(4) fscache_use_cookie() and fscache_unuse_cookie() are called when a file
is opened or closed to prevent a cache file from being culled and to
keep resources to hand that are needed to do I/O.
fscache_use_cookie() is given an indication if the cache is likely to
be modified locally (e.g. the file is open for writing).
fscache_unuse_cookie() is given a coherency update if we had the file
open for writing and will update that.
(5) fscache_invalidate() is now given uptodate auxiliary data and a file
size. It can also take a flag to indicate if this was due to a DIO
write. This is wrapped into afs_fscache_invalidate() now for
convenience.
(6) fscache_resize() now gets called from the finalisation of
afs_setattr(), and afs_setattr() does use/unuse of the cookie around
the call to support this.
(7) fscache_note_page_release() is called from afs_release_page().
(8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for
PG_fscache to be cleared.
Render the parts of the cookie key for an afs inode cookie as big endian.
Changes
=======
ver #2:
- Use gfpflags_allow_blocking() rather than using flag directly.
- fscache_acquire_volume() now returns errors.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Tested-by: kafs-testing@auristor.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819661382.215744.1485608824741611837.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906970002.143852.17678518584089878259.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967174665.1823006.1301789965454084220.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021568841.640689.6684240152253400380.stgit@warthog.procyon.org.uk/ # v4
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmGBFjQACgkQ+7dXa6fL
C2vuFhAAiJcYj/qiY9P6OMt593ppDlmrHCt7JzXVtnrVdX5RX0ZgvaG93w3DhPKx
/edBe/Jz+0FEGk3T+6fLyG9NZRmn8xGW9i+fhA88Vy6x1b1+pv2fG7lr/H5F6fg2
JC8dwa4nZdpkDzlfZImwseRJYyA5io/wjMva1BzJXXBXX7Y5chCQmFChrkyRzrX0
Gbj1SMGamDfbu/5OmxK04HDCPaQJ9irWxh/L9/cUcFXYeTYacDF/GwoNTV1pji3Y
Mg6qAiYd/P8c4zHG4ZjhtW1+d2wzuyu+KKKo32kPVM/Z1QWepJ48PcsVmnt67l0R
l18lzOu7i/S0vyL9sS+CXIgKcApyofYsm8X/M5zDC2NQDXa+wNSS/gEsf+waApxr
UtiDEjxDTm0uccHB1VhR4QMx7DrnYp0FBFMp1jUlIu2mMTdAaillzyAxcnGZUwTW
OMTZZxaLpKmvYUpZ9liQLU0Rcv2eQ2y+eI6RifQ4UpOEosSru5CcyxSdfKB/veHk
AkmqqqGRg+obSJoZxBMeTZD6SF0u/BrfBJAsS4d2UaNGkNlPXNi8GUUHsOurSPrh
Z9JqLdCf8u5/hpKQwNPStbjIS6M0KYvkdNuXnLrxbNuk2rgMg8/AraIV0kTTASSt
WV3L+WhNF+iJfVzQEfPQ8UJ7VQfgofdHhF5X1IdiznTe32DxojU=
=ax6b
-----END PGP SIGNATURE-----
Merge tag 'afs-next-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS updates from David Howells:
- Split the readpage handler for symlinks from the one for files. The
symlink readpage isn't given a file pointer, so the handling has to
be special-cased.
This has been posted as part of a patchset to foliate netfs, afs,
etc.[1] but I've moved it to this one as it's not actually doing
foliation but is more of a pre-cleanup.
- Fix file creation to set the mtime from the client's clock to keep
make happy if the server's clock isn't quite in sync.[2]
Link: https://lore.kernel.org/r/163005742570.2472992.7800423440314043178.stgit@warthog.procyon.org.uk/ [1]
Link: http://lists.infradead.org/pipermail/linux-afs/2021-October/004395.html [2]
* tag 'afs-next-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Set mtime from the client for yfs create operations
afs: Sort out symlink reading
For operations that create vnodes on the server such as CreateFile,
MakeDir or Symlink, the server will store its own current time as
the mtime if the client doesn't pass in a time in the accompanying
StoreStatus structure.
If the server and client clocks are not well synchronized, the client
may see timestamps in the future or inconsistent dependency checks
with "make" for files that are not modified after creation:
make[2]: Warning: File 'arch/x86/kernel/apic/modules.order' has
modification time 0.14 s in the future
make[2]: warning: Clock skew detected. Your build may be incomplete.
This is already handled correctly for non yfs operations; also
set the mtime for the corresponding yfs operations.
Changes:
v3: Replace S_IRWXUGO with 0777, per checkpatch
v2: [dhowells] Merge the two xdr_encode_YFSStoreStatus*() functions together
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lists.infradead.org/pipermail/linux-afs/2021-October/004395.html
Add memory folios, a new type to represent either order-0 pages or
the head page of a compound page. This should be enough infrastructure
to support filesystems converting from pages to folios.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmF9uI0ACgkQDpNsjXcp
gj7MUAf/R7LCZ+xFiIedw7SAgb/DGK0C9uVjuBEIZgAw21ZUw/GuPI6cuKBMFGGf
rRcdtlvMpwi7yZJcoNXxaqU/xPaaJMjf2XxscIvYJP1mjlZVuwmP9dOx0neNvWOc
T+8lqR6c1TLl82lpqIjGFLwvj2eVowq2d3J5jsaIJFd4odmmYVInrhJXOzC/LQ54
Niloj5ksehf+KUIRLDz7ycppvIHhlVsoAl0eM2dWBAtL0mvT7Nyn/3y+vnMfV2v3
Flb4opwJUgTJleYc16oxTn9svT2yS8q2uuUemRDLW8ABghoAtH3fUUk43RN+5Krd
LYCtbeawtkikPVXZMfWybsx5vn0c3Q==
=7SBe
-----END PGP SIGNATURE-----
Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache
Pull memory folios from Matthew Wilcox:
"Add memory folios, a new type to represent either order-0 pages or the
head page of a compound page. This should be enough infrastructure to
support filesystems converting from pages to folios.
The point of all this churn is to allow filesystems and the page cache
to manage memory in larger chunks than PAGE_SIZE. The original plan
was to use compound pages like THP does, but I ran into problems with
some functions expecting only a head page while others expect the
precise page containing a particular byte.
The folio type allows a function to declare that it's expecting only a
head page. Almost incidentally, this allows us to remove various calls
to VM_BUG_ON(PageTail(page)) and compound_head().
This converts just parts of the core MM and the page cache. For 5.17,
we intend to convert various filesystems (XFS and AFS are ready; other
filesystems may make it) and also convert more of the MM and page
cache to folios. For 5.18, multi-page folios should be ready.
The multi-page folios offer some improvement to some workloads. The
80% win is real, but appears to be an artificial benchmark (postgres
startup, which isn't a serious workload). Real workloads (eg building
the kernel, running postgres in a steady state, etc) seem to benefit
between 0-10%. I haven't heard of any performance losses as a result
of this series. Nobody has done any serious performance tuning; I
imagine that tweaking the readahead algorithm could provide some more
interesting wins. There are also other places where we could choose to
create large folios and currently do not, such as writes that are
larger than PAGE_SIZE.
I'd like to thank all my reviewers who've offered review/ack tags:
Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes
Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil
Babka, William Kucharski, Yu Zhao and Zi Yan.
I'd also like to thank those who gave feedback I incorporated but
haven't offered up review tags for this part of the series: Nick
Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard,
Hugh Dickins, and probably a few others who I forget"
* tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits)
mm/writeback: Add folio_write_one
mm/filemap: Add FGP_STABLE
mm/filemap: Add filemap_get_folio
mm/filemap: Convert mapping_get_entry to return a folio
mm/filemap: Add filemap_add_folio()
mm/filemap: Add filemap_alloc_folio
mm/page_alloc: Add folio allocation functions
mm/lru: Add folio_add_lru()
mm/lru: Convert __pagevec_lru_add_fn to take a folio
mm: Add folio_evictable()
mm/workingset: Convert workingset_refault() to take a folio
mm/filemap: Add readahead_folio()
mm/filemap: Add folio_mkwrite_check_truncate()
mm/filemap: Add i_blocks_per_folio()
mm/writeback: Add folio_redirty_for_writepage()
mm/writeback: Add folio_account_redirty()
mm/writeback: Add folio_clear_dirty_for_io()
mm/writeback: Add folio_cancel_dirty()
mm/writeback: Add folio_account_cleaned()
mm/writeback: Add filemap_dirty_folio()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmFfE4oACgkQ+7dXa6fL
C2txTBAAnWlEssljz7x09A/I9Js155U2hW9oDSoqkUxqZSe05oBbTPNycURvXAGZ
wZhNZdD5Xc4ITjLmPQQclgkfWc+deq6UKzw8E58XmjiO1Uq6WcqUsC95M1USAmaM
nRyhGrYRxJbv5eRDx3Ox3yoLntlSzvX1ZLhWr6DgAnb9uCdIWSGgy34XTd3aOSZa
OEtPR/tvBZygxMV9wsflD2GNNLe7QDrOMUnvFSlmxBOUolclbHj9uhB/fQXN7frN
Q/nf5QluBqZK13CIbiKSPy0wfl/hEdSFsOs5jAgMGm4IsZjSpsw2lvzxlfEaI7U/
QzNHpqAc0ynPI9fbvs2LTkNFR1oe+njOIVvu0QMjOXEdnyOGEbFjX5eDNiKSAih4
R3cNh2T16yUsx99lVbGkJAwbBQTmdp2yvfugQVX5qDNi+Ln8TFUKUHgruUv/FYJw
hUjcOL6cjGdWORpWkxSoEariA6zDjKCWiyMu5w2yzSufI+DJ0AI6MQVOeqaX6dm6
EldlxDO3w7uvXmwpH1RZsHXCqWfyiHn4P5LsSuVy/wM2O/VemaGQuHsxnLtMMJ+q
HGniSziE6LAvF0RvBrngFGhAY6rqMIGzXK/+S1Z/YwM9+tYnoYhbANDhjmywrcI5
GWaKePV5giTXlaI/XertjzEpQ2yo8r2HkYoVowV3NaRNrc3qgnQ=
=X7mM
-----END PGP SIGNATURE-----
Merge tag 'misc-fixes-20211007' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull netfslib, cachefiles and afs fixes from David Howells:
- Fix another couple of oopses in cachefiles tracing stemming from the
possibility of passing in a NULL object pointer
- Fix netfs_clear_unread() to set READ on the iov_iter so that source
it is passed to doesn't do the wrong thing (some drivers look at the
flag on iov_iter rather than other available information to determine
the direction)
- Fix afs_launder_page() to write back at the correct file position on
the server so as not to corrupt data
* tag 'misc-fixes-20211007' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix afs_launder_page() to set correct start file position
netfs: Fix READ/WRITE confusion when calling iov_iter_xarray()
cachefiles: Fix oops with cachefiles_cull() due to NULL object
wait_on_page_writeback_killable() only has one caller, so convert it to
call folio_wait_writeback_killable(). For the wait_on_page_writeback()
callers, add a compatibility wrapper around folio_wait_writeback().
Turning PageWriteback() into folio_test_writeback() eliminates a call
to compound_head() which saves 8 bytes and 15 bytes in the two
functions. Unfortunately, that is more than offset by adding the
wait_on_page_writeback compatibility wrapper for a net increase in text
of 7 bytes.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Howells <dhowells@redhat.com>
When an afs file or directory is modified locally such that the total file
size is extended, i_blocks needs to be recalculated too.
Fix this by making afs_write_end() and afs_edit_dir_add() call
afs_set_i_size() rather than setting inode->i_size directly as that also
recalculates inode->i_blocks.
This can be tested by creating and writing into directories and files and
then examining them with du. Without this change, directories show a 4
blocks (they start out at 2048 bytes) and files show 0 blocks; with this
change, they should show a number of blocks proportional to the file size
rounded up to 1024.
Fixes: 31143d5d51 ("AFS: implement basic file write support")
Fixes: 63a4681ff3 ("afs: Locally edit directory data for mkdir/create/unlink/...")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163113612442.352844.11162345591911691150.stgit@warthog.procyon.org.uk/
AFS-3 has two data fetch RPC variants, FS.FetchData and FS.FetchData64, and
Linux's afs client switches between them when talking to a non-YFS server
if the read size, the file position or the sum of the two have the upper 32
bits set of the 64-bit value.
This is a problem, however, since the file position and length fields of
FS.FetchData are *signed* 32-bit values.
Fix this by capturing the capability bits obtained from the fileserver when
it's sent an FS.GetCapabilities RPC, rather than just discarding them, and
then picking out the VICED_CAPABILITY_64BITFILES flag. This can then be
used to decide whether to use FS.FetchData or FS.FetchData64 - and also
FS.StoreData or FS.StoreData64 - rather than using upper_32_bits() to
switch on the parameter values.
This capabilities flag could also be used to limit the maximum size of the
file, but all servers must be checked for that.
Note that the issue does not exist with FS.StoreData - that uses *unsigned*
32-bit values. It's also not a problem with Auristor servers as its
YFS.FetchData64 op uses unsigned 64-bit values.
This can be tested by cloning a git repo through an OpenAFS client to an
OpenAFS server and then doing "git status" on it from a Linux afs
client[1]. Provided the clone has a pack file that's in the 2G-4G range,
the git status will show errors like:
error: packfile .git/objects/pack/pack-5e813c51d12b6847bbc0fcd97c2bca66da50079c.pack does not match index
error: packfile .git/objects/pack/pack-5e813c51d12b6847bbc0fcd97c2bca66da50079c.pack does not match index
This can be observed in the server's FileLog with something like the
following appearing:
Sun Aug 29 19:31:39 2021 SRXAFS_FetchData, Fid = 2303380852.491776.3263114, Host 192.168.11.201:7001, Id 1001
Sun Aug 29 19:31:39 2021 CheckRights: len=0, for host=192.168.11.201:7001
Sun Aug 29 19:31:39 2021 FetchData_RXStyle: Pos 18446744071815340032, Len 3154
Sun Aug 29 19:31:39 2021 FetchData_RXStyle: file size 2400758866
...
Sun Aug 29 19:31:40 2021 SRXAFS_FetchData returns 5
Note the file position of 18446744071815340032. This is the requested file
position sign-extended.
Fixes: b9b1f8d593 ("AFS: write support fixes")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
cc: openafs-devel@openafs.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214217#c9 [1]
Link: https://lore.kernel.org/r/951332.1631308745@warthog.procyon.org.uk/
Try to avoid taking the RCU read lock when checking the validity of a
vnode's callback state. The only thing it's needed for is to pin the
parent volume's server list whilst we search it to find the record of the
server we're currently using to see if it has been reinitialised (ie. it
sent us a CB.InitCallBackState* RPC).
Do this by the following means:
(1) Keep an additional per-cell counter (fs_s_break) that's incremented
each time any of the fileservers in the cell reinitialises.
Since the new counter can be accessed without RCU from the vnode, we
can check that first - and only if it differs, get the RCU read lock
and check the volume's server list.
(2) Replace afs_get_s_break_rcu() with afs_check_server_good() which now
indicates whether the callback promise is still expected to be present
on the server. This does the checks as described in (1).
(3) Restructure afs_check_validity() to take account of the change in (2).
We can also get rid of the valid variable and just use the need_clear
variable with the addition of the afs_cb_break_no_promise reason.
(4) afs_check_validity() probably shouldn't be altering vnode->cb_v_break
and vnode->cb_s_break when it doesn't have cb_lock exclusively locked.
Move the change to vnode->cb_v_break to __afs_break_callback().
Delegate the change to vnode->cb_s_break to afs_select_fileserver()
and set vnode->cb_fs_s_break there also.
(5) afs_validate() no longer needs to get the RCU read lock around its
call to afs_check_validity() - and can skip the call entirely if we
don't have a promise.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111669583.283156.1397603105683094563.stgit@warthog.procyon.org.uk/
Fix the coherency management of mmap'd data such that 3rd-party changes
become visible as soon as possible after the callback notification is
delivered by the fileserver. This is done by the following means:
(1) When we break a callback on a vnode specified by the CB.CallBack call
from the server, we queue a work item (vnode->cb_work) to go and
clobber all the PTEs mapping to that inode.
This causes the CPU to trip through the ->map_pages() and
->page_mkwrite() handlers if userspace attempts to access the page(s)
again.
(Ideally, this would be done in the service handler for CB.CallBack,
but the server is waiting for our reply before considering, and we
have a list of vnodes, all of which need breaking - and the process of
getting the mmap_lock and stripping the PTEs on all CPUs could be
quite slow.)
(2) Call afs_validate() from the ->map_pages() handler to check to see if
the file has changed and to get a new callback promise from the
server.
Also handle the fileserver telling us that it's dropping all callbacks,
possibly after it's been restarted by sending us a CB.InitCallBackState*
call by the following means:
(3) Maintain a per-cell list of afs files that are currently mmap'd
(cell->fs_open_mmaps).
(4) Add a work item to each server that is invoked if there are any open
mmaps when CB.InitCallBackState happens. This work item goes through
the aforementioned list and invokes the vnode->cb_work work item for
each one that is currently using this server.
This causes the PTEs to be cleared, causing ->map_pages() or
->page_mkwrite() to be called again, thereby calling afs_validate()
again.
I've chosen to simply strip the PTEs at the point of notification reception
rather than invalidate all the pages as well because (a) it's faster, (b)
we may get a notification for other reasons than the data being altered (in
which case we don't want to clobber the pagecache) and (c) we need to ask
the server to find out - and I don't want to wait for the reply before
holding up userspace.
This was tested using the attached test program:
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
int main(int argc, char *argv[])
{
size_t size = getpagesize();
unsigned char *p;
bool mod = (argc == 3);
int fd;
if (argc != 2 && argc != 3) {
fprintf(stderr, "Format: %s <file> [mod]\n", argv[0]);
exit(2);
}
fd = open(argv[1], mod ? O_RDWR : O_RDONLY);
if (fd < 0) {
perror(argv[1]);
exit(1);
}
p = mmap(NULL, size, mod ? PROT_READ|PROT_WRITE : PROT_READ,
MAP_SHARED, fd, 0);
if (p == MAP_FAILED) {
perror("mmap");
exit(1);
}
for (;;) {
if (mod) {
p[0]++;
msync(p, size, MS_ASYNC);
fsync(fd);
}
printf("%02x", p[0]);
fflush(stdout);
sleep(1);
}
}
It runs in two modes: in one mode, it mmaps a file, then sits in a loop
reading the first byte, printing it and sleeping for a second; in the
second mode it mmaps a file, then sits in a loop incrementing the first
byte and flushing, then printing and sleeping.
Two instances of this program can be run on different machines, one doing
the reading and one doing the writing. The reader should see the changes
made by the writer, but without this patch, they aren't because validity
checking is being done lazily - only on entry to the filesystem.
Testing the InitCallBackState change is more complicated. The server has
to be taken offline, the saved callback state file removed and then the
server restarted whilst the reading-mode program continues to run. The
client machine then has to poke the server to trigger the InitCallBackState
call.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111668833.283156.382633263709075739.stgit@warthog.procyon.org.uk/
The AFS filesystem is currently triggering the silly-rename cleanup from
afs_d_revalidate() when it sees that a dentry has been changed by a third
party[1]. It should not be doing this as the cleanup includes deleting the
silly-rename target file on iput.
Fix this by removing the places in the d_revalidate handling that validate
anything other than the directory and the dirent. It probably should not
be looking to validate the target inode of the dentry also.
This includes removing the point in afs_d_revalidate() where the inode that
a dentry used to point to was marked as being deleted (AFS_VNODE_DELETED).
We don't know it got deleted. It could have been renamed or it could have
hard links remaining.
This was reproduced by cloning a git repo onto an afs volume on one
machine, switching to another machine and doing "git status", then
switching back to the first and doing "git status". The second status
would show weird output due to ".git/index" getting deleted by the above
mentioned mechanism.
A simpler way to do it is to do:
machine 1: touch a
machine 2: touch b; mv -f b a
machine 1: stat a
on an afs volume. The bug shows up as the stat failing with ENOENT and the
file server log showing that machine 1 deleted "a".
Fixes: 79ddbfa500 ("afs: Implement sillyrename for unlink and rename")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214217#c4 [1]
Link: https://lore.kernel.org/r/163111668100.283156.3851669884664475428.stgit@warthog.procyon.org.uk/
afs_d_revalidate() should only be validating the directory entry it is
given and the directory to which that belongs; it shouldn't be validating
the inode/vnode to which that dentry points. Besides, validation need to
be done even if we don't call afs_d_revalidate() - which might be the case
if we're starting from a file descriptor.
In order for afs_d_revalidate() to be fixed, validation points must be
added in some other places. Certain directory operations, such as
afs_unlink(), already check this, but not all and not all file operations
either.
Note that the validation of a vnode not only checks to see if the
attributes we have are correct, but also gets a promise from the server to
notify us if that file gets changed by a third party.
Add the following checks:
- Check the vnode we're going to make a hard link to.
- Check the vnode we're going to move/rename.
- Check the vnode we're going to read from.
- Check the vnode we're going to write to.
- Check the vnode we're going to sync.
- Check the vnode we're going to make a mapped page writable for.
Some of these aren't strictly necessary as we're going to perform a server
operation that might get the attributes anyway from which we can determine
if something changed - though it might not get us a callback promise.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111667354.283156.12720698333342917516.stgit@warthog.procyon.org.uk/
There's a loop in afs_extend_writeback() that adds extra pages to a write
we want to make to improve the efficiency of the writeback by making it
larger. This loop stops, however, if we hit a page we can't write back
from immediately, but it doesn't get rid of the page ref we speculatively
acquired.
This was caused by the removal of the cleanup loop when the code switched
from using find_get_pages_contig() to xarray scanning as the latter only
gets a single page at a time, not a batch.
Fix this by putting the page on a ref on an early break from the loop.
Unfortunately, we can't just add that page to the pagevec we're employing
as we'll go through that and add those pages to the RPC call.
This was found by the generic/074 test. It leaks ~4GiB of RAM each time it
is run - which can be observed with "top".
Fixes: e87b03f583 ("afs: Prepare for use of THPs")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111666635.283156.177701903478910460.stgit@warthog.procyon.org.uk/
The afs_read objects created by afs_req_issue_op() get leaked because
afs_alloc_read() returns a ref and then afs_fetch_data() gets its own ref
which is released when the operation completes, but the initial ref is
never released.
Fix this by discarding the initial ref at the end of afs_req_issue_op().
This leak also covered another bug whereby a ref isn't got on the key
attached to the read record by afs_req_issue_op(). This isn't a problem as
long as the afs_read req never goes away...
Fix this by calling key_get() in afs_req_issue_op().
This was found by the generic/074 test. It leaks a bunch of kmalloc-192
objects each time it is run, which can be observed by watching
/proc/slabinfo.
Fixes: f7605fa869cf ("afs: Fix leak of afs_read objects")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163010394740.3035676.8516846193899793357.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/163111665914.283156.3038561975681836591.stgit@warthog.procyon.org.uk/
We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
off in Fedora and RHEL8. Several other distros have followed suit.
I've heard of one problem in all that time: Someone migrated from an
older distro that supported "-o mand" to one that didn't, and the host
had a fstab entry with "mand" in it which broke on reboot. They didn't
actually _use_ mandatory locking so they just removed the mount option
and moved on.
This patch rips out mandatory locking support wholesale from the kernel,
along with the Kconfig option and the Documentation file. It also
changes the mount code to ignore the "mand" mount option instead of
erroring out, and to throw a big, ugly warning.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Variable ret is set to -ENOENT and -ENOMEM but this value is never
read as it is overwritten or not used later on, hence it is a
redundant assignment and can be removed.
Cleans up the following clang-analyzer warning:
fs/afs/dir.c:2014:4: warning: Value stored to 'ret' is never read
[clang-analyzer-deadcode.DeadStores].
fs/afs/dir.c:659:2: warning: Value stored to 'ret' is never read
[clang-analyzer-deadcode.DeadStores].
[DH made the following modifications:
- In afs_rename(), -ENOMEM should be placed in op->error instead of ret,
rather than the assignment being removed entirely. afs_put_operation()
will pick it up from there and return it.
- If afs_sillyrename() fails, its error code should be placed in op->error
rather than in ret also.
]
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/1619691492-83866-1-git-send-email-jiapeng.chong@linux.alibaba.com
Link: https://lore.kernel.org/r/162609465444.3133237.7562832521724298900.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162610729052.3408253.17364333638838151299.stgit@warthog.procyon.org.uk/ # v2
To quote Alexey[1]:
I was adding custom tracepoint to the kernel, grabbed full F34 kernel
.config, disabled modules and booted whole shebang as VM kernel.
Then did
perf record -a -e ...
It crashed:
general protection fault, probably for non-canonical address 0x435f5346592e4243: 0000 [#1] SMP PTI
CPU: 1 PID: 842 Comm: cat Not tainted 5.12.6+ #26
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
RIP: 0010:t_show+0x22/0xd0
Then reproducer was narrowed to
# cat /sys/kernel/tracing/printk_formats
Original F34 kernel with modules didn't crash.
So I started to disable options and after disabling AFS everything
started working again.
The root cause is that AFS was placing char arrays content into a
section full of _pointers_ to strings with predictable consequences.
Non canonical address 435f5346592e4243 is "CB.YFS_" which came from
CM_NAME macro.
Steps to reproduce:
CONFIG_AFS=y
CONFIG_TRACING=y
# cat /sys/kernel/tracing/printk_formats
Fix this by the following means:
(1) Add enum->string translation tables in the event header with the AFS
and YFS cache/callback manager operations listed by RPC operation ID.
(2) Modify the afs_cb_call tracepoint to print the string from the
translation table rather than using the string at the afs_call name
pointer.
(3) Switch translation table depending on the service we're being accessed
as (AFS or YFS) in the tracepoint print clause. Will this cause
problems to userspace utilities?
Note that the symbolic representation of the YFS service ID isn't
available to this header, so I've put it in as a number. I'm not sure
if this is the best way to do this.
(4) Remove the name wrangling (CM_NAME) macro and put the names directly
into the afs_call_type structs in cmservice.c.
Fixes: 8e8d7f13b6 ("afs: Add some tracepoints")
Reported-by: Alexey Dobriyan (SK hynix) <adobriyan@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/YLAXfvZ+rObEOdc%2F@localhost.localdomain/ [1]
Link: https://lore.kernel.org/r/643721.1623754699@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/162430903582.2896199.6098150063997983353.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162609463957.3133237.15916579353149746363.stgit@warthog.procyon.org.uk/ # v1 (repost)
Link: https://lore.kernel.org/r/162610726860.3408253.445207609466288531.stgit@warthog.procyon.org.uk/ # v2
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmDQ9Z4ACgkQ+7dXa6fL
C2sVfg/9H3OlL4Vwv5CERgb5kbDoALAh5epYMFyvNVy9l5iTvYekxG3qna6ee0+G
pTRHSU5tZUCUslpafv8LUz1RE/iM127y+BP+qq2joG9jkT4q3QfeV4oDr+dgZWCL
obp+rjQSTKkaGh3eqjDx7gCSp5mqQI/M8MXe1VOXxUzAnsf2nH4iJLv/A9NN17xW
l7sQfZJGHD1BPqsxxFiSr+UCkCLuLDybUDYL6+PZcFu0jSON0h4yEtIwsAA7a1Zw
TvgUpequrbyTORI3sHJ8eIixWosh4yLTJ4pDqs8qDqq3Wm5eTZjss0wxEaVfpaTu
cg/CcNUBiHJ/6Q8r+JiuPbEnfnQ9woL8951/CNi+cOGhTy9LNoG70orSZKKynmW5
QmpYBK5BkM57EPj7DYZJRI1Bwy41pJapFj8tXbMjObU+ZyaXMysmBDanIRK/cLoy
fr4Sz+1D8yJPQ0GDgC4051CxrhynOEnRo8JbESg8CD4PnqFeM7EoCh48H2oVvR4N
9v2xuaRyBvi2KTmKSktRe+s1DS80acVMUYP33nT2zAthvL91VVgMY3Hz2/QrlAp8
h1hREME8aRcN4LrBIgp/ET150hUh44U2A07PqnYYe0653MH7aHHFfk134ZuTmbub
Pfc1MHtWgPAWN4c140ILBRTidJeShszoJGgpD6tflkj10VI2s34=
=2c6l
-----END PGP SIGNATURE-----
Merge tag 'netfs-fixes-20210621' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull netfs fixes from David Howells:
"This contains patches to fix netfs_write_begin() and afs_write_end()
in the following ways:
(1) In netfs_write_begin(), extract the decision about whether to skip
a page out to its own helper and have that clear around the region
to be written, but not clear that region. This requires the
filesystem to patch it up afterwards if the hole doesn't get
completely filled.
(2) Use offset_in_thp() in (1) rather than manually calculating the
offset into the page.
(3) Due to (1), afs_write_end() now needs to handle short data write
into the page by generic_perform_write(). I've adopted an
analogous approach to ceph of just returning 0 in this case and
letting the caller go round again.
It also adds a note that (in the future) the len parameter may extend
beyond the page allocated. This is because the page allocation is
deferred to write_begin() and that gets to decide what size of THP to
allocate."
Jeff Layton points out:
"The netfs fix in particular fixes a data corruption bug in cephfs"
* tag 'netfs-fixes-20210621' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
netfs: fix test for whether we can skip read when writing beyond EOF
afs: Fix afs_write_end() to handle short writes
Fix afs_write_end() to correctly handle a short copy into the intended
write region of the page. Two things are necessary:
(1) If the page is not up to date, then we should just return 0
(ie. indicating a zero-length copy). The loop in
generic_perform_write() will go around again, possibly breaking up the
iterator into discrete chunks[1].
This is analogous to commit b9de313cf0
for ceph.
(2) The page should not have been set uptodate if it wasn't completely set
up by netfs_write_begin() (this will be fixed in the next patch), so
we need to set uptodate here in such a case.
Also remove the assertion that was checking that the page was set uptodate
since it's now set uptodate if it wasn't already a few lines above. The
assertion was from when uptodate was set elsewhere.
Changes:
v3: Remove the handling of len exceeding the end of the page.
Fixes: 3003bbd069 ("afs: Use the netfs_write_begin() helper")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/YMwVp268KTzTf8cN@zeniv-ca.linux.org.uk/ [1]
Link: https://lore.kernel.org/r/162367682522.460125.5652091227576721609.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162391825688.1173366.3437507255136307904.stgit@warthog.procyon.org.uk/ # v2
If a task is killed during a page fault, it does not currently call
sb_end_pagefault(), which means that the filesystem cannot be frozen
at any time thereafter. This may be reported by lockdep like this:
====================================
WARNING: fsstress/10757 still has locks held!
5.13.0-rc4-build4+ #91 Not tainted
------------------------------------
1 lock held by fsstress/10757:
#0: ffff888104eac530
(
sb_pagefaults
as filesystem freezing is modelled as a lock.
Fix this by removing all the direct returns from within the function,
and using 'ret' to indicate whether we were interrupted or successful.
Fixes: 1cf7a1518a ("afs: Implement shared-writeable mmap")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20210616154900.1958373-1-willy@infradead.org/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit e87b03f583 ("afs: Prepare for use of THPs"), the return
value for afs_write_back_from_locked_page was changed from a number
of pages to a length in bytes. The loop in afs_writepages_region uses
the return value to compute the index that will be used to find dirty
pages in the next iteration, but treats it as a number of pages and
wrongly multiplies it by PAGE_SIZE. This gives a very large index value,
potentially skipping any dirty data that was not covered in the first
pass, which is limited to 256M.
This causes fsync(), and indirectly close(), to only do a partial
writeback of a large file's dirty data. The rest is eventually written
back by background threads after dirty_expire_centisecs.
Fixes: e87b03f583 ("afs: Prepare for use of THPs")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20210604175504.4055-1-marc.c.dionne@gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix rename of one directory over another such that the nlink on the deleted
directory is cleared to 0 rather than being decremented to 1.
This was causing the generic/035 xfstest to fail.
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/162194384460.3999479.7605572278074191079.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The generic/464 xfstest causes kAFS to emit occasional warnings of the
form:
kAFS: vnode modified {100055:8a} 30->31 YFS.StoreData64 (c=6015)
This indicates that the data version received back from the server did not
match the expected value (the DV should be incremented monotonically for
each individual modification op committed to a vnode).
What is happening is that a lookup call is doing a bulk status fetch
speculatively on a bunch of vnodes in a directory besides getting the
status of the vnode it's actually interested in. This is racing with a
StoreData operation (though it could also occur with, say, a MakeDir op).
On the client, a modification operation locks the vnode, but the bulk
status fetch only locks the parent directory, so no ordering is imposed
there (thereby avoiding an avenue to deadlock).
On the server, the StoreData op handler doesn't lock the vnode until it's
received all the request data, and downgrades the lock after committing the
data until it has finished sending change notifications to other clients -
which allows the status fetch to occur before it has finished.
This means that:
- a status fetch can access the target vnode either side of the exclusive
section of the modification
- the status fetch could start before the modification, yet finish after,
and vice-versa.
- the status fetch and the modification RPCs can complete in either order.
- the status fetch can return either the before or the after DV from the
modification.
- the status fetch might regress the locally cached DV.
Some of these are handled by the previous fix[1], but that's not sufficient
because it checks the DV it received against the DV it cached at the start
of the op, but the DV might've been updated in the meantime by a locally
generated modification op.
Fix this by the following means:
(1) Keep track of when we're performing a modification operation on a
vnode. This is done by marking vnode parameters with a 'modification'
note that causes the AFS_VNODE_MODIFYING flag to be set on the vnode
for the duration.
(2) Alter the speculation race detection to ignore speculative status
fetches if either the vnode is marked as being modified or the data
version number is not what we expected.
Note that whilst the "vnode modified" warning does get recovered from as it
causes the client to refetch the status at the next opportunity, it will
also invalidate the pagecache, so changes might get lost.
Fixes: a9e5c87ca7 ("afs: Fix speculative status fetch going out of order wrt to modifications")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-and-reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/160605082531.252452.14708077925602709042.stgit@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/linux-fsdevel/161961335926.39335.2552653972195467566.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmCHJJAACgkQ+7dXa6fL
C2uv0A//S/sJyToPtj3xbzmRVmSGGWFYNRMaxBD2gYAq7swbDNiX4ZbBCe8A4FBY
zedeMfoNztHIRB2M9vvnhG4HJWXPKq2BaT0xzeteCcmZ65b5zBOrAXue0PQPqE20
xmK1RDls/y5Y2FaF92Ay0VZzXW7+y/M+RRSo+FCFzrIgpJrPprTnlZigrECYauGJ
Qdsv26rQ0flK6tyi6GVuWZIMvpINCt3WwpwQTkAUewz2VewA1tZ1xFe70sP0vF7R
MJNaS6A4uJmvoJJzb8rqdnBGiu76+TxmPaXn0IZKJBECZjBVJyk/duce0jgqbQ7C
PZz5j4C2xrPyu3Y98joj37HPEAHCy0DPRx2Es1mz5cHPzI1TDRClHzPrxyycz9gr
D9WnMiPj9ff9aDaV6XpWKyuHhPxaHpoOD3VGdrhx6bU19Jd3/mLHB3lSt1kJzWdg
QrSAk3KzMWAZigz/+I5xetOpbygKTPLEYgpdmdOSTrtACcm1wjnhIougu0FUIWXK
arPNFOIV9liN0qCQyDOcLx4UEcxXrb2W0AYeHHJDBFxJ7sT2WWUCjPZFW5bh3G+Y
goKv/XJRVWJxFlTXLZLZ5siclzzIlAAmSylh661ji836yRhqTQ3NJTB8QfnrGGsZ
QlD1hjpyqC8uwIGUvoh56KdLRTxj9Gj70gpVe/Lk3Z16mivqDUE=
=fSr0
-----END PGP SIGNATURE-----
Merge tag 'afs-netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS updates from David Howells:
"Use the new netfs lib.
Begin the process of overhauling the use of the fscache API by AFS and
the introduction of support for features such as Transparent Huge
Pages (THPs).
- Add some support for THPs, including using core VM helper functions
to find details of pages.
- Use the ITER_XARRAY I/O iterator to mediate access to the pagecache
as this handles THPs and doesn't require allocation of large bvec
arrays.
- Delegate address_space read/pre-write I/O methods for AFS to the
netfs helper library. A method is provided to the library that
allows it to issue a read against the server.
This includes a change in use for PG_fscache (it now indicates a
DIO write in progress from the marked page), so a number of waits
need to be deployed for it.
- Split the core AFS writeback function to make it easier to modify
in future patches to handle writing to the cache. [This might
feasibly make more sense moved out into my fscache-iter branch].
I've tested these with "xfstests -g quick" against an AFS volume
(xfstests needs patching to make it work). With this, AFS without a
cache passes all expected xfstests; with a cache, there's an extra
failure, but that's also there before these patches. Fixing that
probably requires a greater overhaul (as can be found on my
fscache-iter branch, but that's for a later time).
Thanks should go to Marc Dionne and Jeff Altman of AuriStor for
exercising the patches in their test farm also"
Link: https://lore.kernel.org/lkml/3785063.1619482429@warthog.procyon.org.uk/
* tag 'afs-netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Use the netfs_write_begin() helper
afs: Use new netfs lib read helper API
afs: Use the fs operation ops to handle FetchData completion
afs: Prepare for use of THPs
afs: Extract writeback extension into its own function
afs: Wait on PG_fscache before modifying/releasing a page
afs: Use ITER_XARRAY for writing
afs: Set up the iov_iter before calling afs_extract_data()
afs: Log remote unmarshalling errors
afs: Don't truncate iter during data fetch
afs: Move key to afs_read struct
afs: Print the operation debug_id when logging an unexpected data version
afs: Pass page into dirty region helpers to provide THP size
afs: Disable use of the fscache I/O routines
Pull vfs inode type handling updates from Al Viro:
"We should never change the type bits of ->i_mode or the method tables
(->i_op and ->i_fop) of a live inode.
Unfortunately, not all filesystems took care to prevent that"
* 'work.inode-type-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
spufs: fix bogosity in S_ISGID handling
9p: missing chunk of "fs/9p: Don't update file type when updating file attributes"
openpromfs: don't do unlock_new_inode() until the new inode is set up
hostfs_mknod(): don't bother with init_special_inode()
cifs: have cifs_fattr_to_inode() refuse to change type on live inode
cifs: have ->mkdir() handle race with another client sanely
do_cifs_create(): don't set ->i_mode of something we had not created
gfs2: be careful with inode refresh
ocfs2_inode_lock_update(): make sure we don't change the type bits of i_mode
orangefs_inode_is_stale(): i_mode type bits do *not* form a bitmap...
vboxsf: don't allow to change the inode type
afs: Fix updating of i_mode due to 3rd party change
ceph: don't allow type or device number to change on non-I_NEW inodes
ceph: fix up error handling with snapdirs
new helper: inode_wrong_type()
afs_listxattr() lists all the available special afs xattrs (i.e. those in
the "afs.*" space), no matter what type of server we're dealing with. But
OpenAFS servers, for example, cannot deal with some of the extra-capable
attributes that AuriStor (YFS) servers provide. Unfortunately, the
presence of the afs.yfs.* attributes causes errors[1] for anything that
tries to read them if the server is of the wrong type.
Fix the problem by removing afs_listxattr() so that none of the special
xattrs are listed (AFS doesn't support xattrs). It does mean, however,
that getfattr won't list them, though they can still be accessed with
getxattr() and setxattr().
This can be tested with something like:
getfattr -d -m ".*" /afs/example.com/path/to/file
With this change, none of the afs.* attributes should be visible.
Changes:
ver #2:
- Hide all of the afs.* xattrs, not just the ACL ones.
Fixes: ae46578b96 ("afs: Get YFS ACLs and information through xattrs")
Reported-by: Gaja Sophie Peters <gaja.peters@math.uni-hamburg.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Gaja Sophie Peters <gaja.peters@math.uni-hamburg.de>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003502.html [1]
Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003567.html # v1
Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003573.html # v2
If someone attempts to access YFS-related xattrs (e.g. afs.yfs.acl) on a
file on a non-YFS AFS server (such as OpenAFS), then the kernel will jump
to a NULL function pointer because the afs_fetch_acl_operation descriptor
doesn't point to a function for issuing an operation on a non-YFS
server[1].
Fix this by making afs_wait_for_operation() check that the issue_afs_rpc
method is set before jumping to it and setting -ENOTSUPP if not. This fix
also covers other potential operations that also only exist on YFS servers.
afs_xattr_get/set_yfs() then need to translate -ENOTSUPP to -ENODATA as the
former error is internal to the kernel.
The bug shows up as an oops like the following:
BUG: kernel NULL pointer dereference, address: 0000000000000000
[...]
Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[...]
Call Trace:
afs_wait_for_operation+0x83/0x1b0 [kafs]
afs_xattr_get_yfs+0xe6/0x270 [kafs]
__vfs_getxattr+0x59/0x80
vfs_getxattr+0x11c/0x140
getxattr+0x181/0x250
? __check_object_size+0x13f/0x150
? __fput+0x16d/0x250
__x64_sys_fgetxattr+0x64/0xb0
do_syscall_64+0x49/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb120a9defe
This was triggered with "cp -a" which attempts to copy xattrs, including
afs ones, but is easier to reproduce with getfattr, e.g.:
getfattr -d -m ".*" /afs/openafs.org/
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Reported-by: Gaja Sophie Peters <gaja.peters@math.uni-hamburg.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Gaja Sophie Peters <gaja.peters@math.uni-hamburg.de>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003498.html [1]
Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003566.html # v1
Link: http://lists.infradead.org/pipermail/linux-afs/2021-March/003572.html # v2
Fix afs_apply_status() to mask off the irrelevant bits from status->mode
when OR'ing them into i_mode. This can happen when a 3rd party chmod
occurs.
Also fix afs_inode_init_from_status() to mask off the mode bits when
initialising i_mode.
Fixes: 260a980317 ("[AFS]: Add "directory write" support.")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
=yPaw
-----END PGP SIGNATURE-----
Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull idmapped mounts from Christian Brauner:
"This introduces idmapped mounts which has been in the making for some
time. Simply put, different mounts can expose the same file or
directory with different ownership. This initial implementation comes
with ports for fat, ext4 and with Christoph's port for xfs with more
filesystems being actively worked on by independent people and
maintainers.
Idmapping mounts handle a wide range of long standing use-cases. Here
are just a few:
- Idmapped mounts make it possible to easily share files between
multiple users or multiple machines especially in complex
scenarios. For example, idmapped mounts will be used in the
implementation of portable home directories in
systemd-homed.service(8) where they allow users to move their home
directory to an external storage device and use it on multiple
computers where they are assigned different uids and gids. This
effectively makes it possible to assign random uids and gids at
login time.
- It is possible to share files from the host with unprivileged
containers without having to change ownership permanently through
chown(2).
- It is possible to idmap a container's rootfs and without having to
mangle every file. For example, Chromebooks use it to share the
user's Download folder with their unprivileged containers in their
Linux subsystem.
- It is possible to share files between containers with
non-overlapping idmappings.
- Filesystem that lack a proper concept of ownership such as fat can
use idmapped mounts to implement discretionary access (DAC)
permission checking.
- They allow users to efficiently changing ownership on a per-mount
basis without having to (recursively) chown(2) all files. In
contrast to chown (2) changing ownership of large sets of files is
instantenous with idmapped mounts. This is especially useful when
ownership of a whole root filesystem of a virtual machine or
container is changed. With idmapped mounts a single syscall
mount_setattr syscall will be sufficient to change the ownership of
all files.
- Idmapped mounts always take the current ownership into account as
idmappings specify what a given uid or gid is supposed to be mapped
to. This contrasts with the chown(2) syscall which cannot by itself
take the current ownership of the files it changes into account. It
simply changes the ownership to the specified uid and gid. This is
especially problematic when recursively chown(2)ing a large set of
files which is commong with the aforementioned portable home
directory and container and vm scenario.
- Idmapped mounts allow to change ownership locally, restricting it
to specific mounts, and temporarily as the ownership changes only
apply as long as the mount exists.
Several userspace projects have either already put up patches and
pull-requests for this feature or will do so should you decide to pull
this:
- systemd: In a wide variety of scenarios but especially right away
in their implementation of portable home directories.
https://systemd.io/HOME_DIRECTORY/
- container runtimes: containerd, runC, LXD:To share data between
host and unprivileged containers, unprivileged and privileged
containers, etc. The pull request for idmapped mounts support in
containerd, the default Kubernetes runtime is already up for quite
a while now: https://github.com/containerd/containerd/pull/4734
- The virtio-fs developers and several users have expressed interest
in using this feature with virtual machines once virtio-fs is
ported.
- ChromeOS: Sharing host-directories with unprivileged containers.
I've tightly synced with all those projects and all of those listed
here have also expressed their need/desire for this feature on the
mailing list. For more info on how people use this there's a bunch of
talks about this too. Here's just two recent ones:
https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdfhttps://fosdem.org/2021/schedule/event/containers_idmap/
This comes with an extensive xfstests suite covering both ext4 and
xfs:
https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts
It covers truncation, creation, opening, xattrs, vfscaps, setid
execution, setgid inheritance and more both with idmapped and
non-idmapped mounts. It already helped to discover an unrelated xfs
setgid inheritance bug which has since been fixed in mainline. It will
be sent for inclusion with the xfstests project should you decide to
merge this.
In order to support per-mount idmappings vfsmounts are marked with
user namespaces. The idmapping of the user namespace will be used to
map the ids of vfs objects when they are accessed through that mount.
By default all vfsmounts are marked with the initial user namespace.
The initial user namespace is used to indicate that a mount is not
idmapped. All operations behave as before and this is verified in the
testsuite.
Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users
to setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.
The user namespace the mount will be marked with can be specified by
passing a file descriptor refering to the user namespace as an
argument to the new mount_setattr() syscall together with the new
MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
of extensibility.
The following conditions must be met in order to create an idmapped
mount:
- The caller must currently have the CAP_SYS_ADMIN capability in the
user namespace the underlying filesystem has been mounted in.
- The underlying filesystem must support idmapped mounts.
- The mount must not already be idmapped. This also implies that the
idmapping of a mount cannot be altered once it has been idmapped.
- The mount must be a detached/anonymous mount, i.e. it must have
been created by calling open_tree() with the OPEN_TREE_CLONE flag
and it must not already have been visible in the filesystem.
The last two points guarantee easier semantics for userspace and the
kernel and make the implementation significantly simpler.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes are observed.
The manpage with a detailed description can be found here:
1d7b902e28
In order to support idmapped mounts, filesystems need to be changed
and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
patches to convert individual filesystem are not very large or
complicated overall as can be seen from the included fat, ext4, and
xfs ports. Patches for other filesystems are actively worked on and
will be sent out separately. The xfstestsuite can be used to verify
that port has been done correctly.
The mount_setattr() syscall is motivated independent of the idmapped
mounts patches and it's been around since July 2019. One of the most
valuable features of the new mount api is the ability to perform
mounts based on file descriptors only.
Together with the lookup restrictions available in the openat2()
RESOLVE_* flag namespace which we added in v5.6 this is the first time
we are close to hardened and race-free (e.g. symlinks) mounting and
path resolution.
While userspace has started porting to the new mount api to mount
proper filesystems and create new bind-mounts it is currently not
possible to change mount options of an already existing bind mount in
the new mount api since the mount_setattr() syscall is missing.
With the addition of the mount_setattr() syscall we remove this last
restriction and userspace can now fully port to the new mount api,
covering every use-case the old mount api could. We also add the
crucial ability to recursively change mount options for a whole mount
tree, both removing and adding mount options at the same time. This
syscall has been requested multiple times by various people and
projects.
There is a simple tool available at
https://github.com/brauner/mount-idmapped
that allows to create idmapped mounts so people can play with this
patch series. I'll add support for the regular mount binary should you
decide to pull this in the following weeks:
Here's an example to a simple idmapped mount of another user's home
directory:
u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt
u1001@f2-vm:/$ ls -al /home/ubuntu/
total 28
drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
-rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ ls -al /mnt/
total 28
drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
-rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
-rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ touch /mnt/my-file
u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file
u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file
u1001@f2-vm:/$ ls -al /mnt/my-file
-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file
u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file
u1001@f2-vm:/$ getfacl /mnt/my-file
getfacl: Removing leading '/' from absolute path names
# file: mnt/my-file
# owner: u1001
# group: u1001
user::rw-
user:u1001:rwx
group::rw-
mask::rwx
other::r--
u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
getfacl: Removing leading '/' from absolute path names
# file: home/ubuntu/my-file
# owner: ubuntu
# group: ubuntu
user::rw-
user:ubuntu:rwx
group::rw-
mask::rwx
other::r--"
* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
xfs: support idmapped mounts
ext4: support idmapped mounts
fat: handle idmapped mounts
tests: add mount_setattr() selftests
fs: introduce MOUNT_ATTR_IDMAP
fs: add mount_setattr()
fs: add attr_flags_to_mnt_flags helper
fs: split out functions to hold writers
namespace: only take read lock in do_reconfigure_mnt()
mount: make {lock,unlock}_mount_hash() static
namespace: take lock_mount_hash() directly when changing flags
nfs: do not export idmapped mounts
overlayfs: do not mount on top of idmapped mounts
ecryptfs: do not mount on top of idmapped mounts
ima: handle idmapped mounts
apparmor: handle idmapped mounts
fs: make helpers idmap mount aware
exec: handle idmapped mounts
would_dump: handle idmapped mounts
...
AF_RXRPC sockets use UDP ports in encap mode. This causes socket and dst
from an incoming packet to get stolen and attached to the UDP socket from
whence it is leaked when that socket is closed.
When a network namespace is removed, the wait for dst records to be cleaned
up happens before the cleanup of the rxrpc and UDP socket, meaning that the
wait never finishes.
Fix this by moving the rxrpc (and, by dependence, the afs) private
per-network namespace registrations to the device group rather than subsys
group. This allows cached rxrpc local endpoints to be cleared and their
UDP sockets closed before we try waiting for the dst records.
The symptom is that lines looking like the following:
unregister_netdevice: waiting for lo to become free
get emitted at regular intervals after running something like the
referenced syzbot test.
Thanks to Vadim for tracking this down and work out the fix.
Reported-by: syzbot+df400f2f24a1677cd7e0@syzkaller.appspotmail.com
Reported-by: Vadim Fedorenko <vfedorenko@novek.ru>
Fixes: 5271953cad ("rxrpc: Use the UDP encap_rcv hook")
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/161196443016.3868642.5577440140646403533.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.
As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.
Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The generic_fillattr() helper fills in the basic attributes associated
with an inode. Enable it to handle idmapped mounts. If the inode is
accessed through an idmapped mount map it into the mount's user
namespace before we store the uid and gid. If the initial user namespace
is passed nothing changes so non-idmapped mounts will see identical
behavior as before.
Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The posix acl permission checking helpers determine whether a caller is
privileged over an inode according to the acls associated with the
inode. Add helpers that make it possible to handle acls on idmapped
mounts.
The vfs and the filesystems targeted by this first iteration make use of
posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
translate basic posix access and default permissions such as the
ACL_USER and ACL_GROUP type according to the initial user namespace (or
the superblock's user namespace) to and from the caller's current user
namespace. Adapt these two helpers to handle idmapped mounts whereby we
either map from or into the mount's user namespace depending on in which
direction we're translating.
Similarly, cap_convert_nscap() is used by the vfs to translate user
namespace and non-user namespace aware filesystem capabilities from the
superblock's user namespace to the caller's user namespace. Enable it to
handle idmapped mounts by accounting for the mount's user namespace.
In addition the fileystems targeted in the first iteration of this patch
series make use of the posix_acl_chmod() and, posix_acl_update_mode()
helpers. Both helpers perform permission checks on the target inode. Let
them handle idmapped mounts. These two helpers are called when posix
acls are set by the respective filesystems to handle this case we extend
the ->set() method to take an additional user namespace argument to pass
the mount's user namespace down.
Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The number of dirent records used by an AFS directory entry should be
calculated using the assumption that there is a 16-byte name field in the
first block, rather than a 20-byte name field (which is actually the case).
This miscalculation is historic and effectively standard, so we have to use
it.
The calculation we need to use is:
1 + (((strlen(name) + 1) + 15) >> 5)
where we are adding one to the strlen() result to account for the NUL
termination.
Fix this by the following means:
(1) Create an inline function to do the calculation for a given name
length.
(2) Use the function to calculate the number of records used for a dirent
in afs_dir_iterate_block().
Use this to move the over-end check out of the loop since it only
needs to be done once.
Further use this to only go through the loop for the 2nd+ records
composing an entry. The only test there now is for if the record is
allocated - and we already checked the first block at the top of the
outer loop.
(3) Add a max name length check in afs_dir_iterate_block().
(4) Make afs_edit_dir_add() and afs_edit_dir_remove() use the function
from (1) to calculate the number of blocks rather than doing it
incorrectly themselves.
Fixes: 63a4681ff3 ("afs: Locally edit directory data for mkdir/create/unlink/...")
Fixes: ^1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
AFS has a structured layout in its directory contents (AFS dirs are
downloaded as files and parsed locally by the client for lookup/readdir).
The slots in the directory are defined by union afs_xdr_dirent. This,
however, only directly allows a name of a length that will fit into that
union. To support a longer name, the next 1-8 contiguous entries are
annexed to the first one and the name flows across these.
afs_dir_iterate_block() uses strnlen(), limited to the space to the end of
the page, to find out how long the name is. This worked fine until
6a39e62abb. With that commit, the compiler determines the size of the
array and asserts that the string fits inside that array. This is a
problem for AFS because we *expect* it to overflow one or more arrays.
A similar problem also occurs in afs_dir_scan_block() when a directory file
is being locally edited to avoid the need to redownload it. There strlen()
was being used safely because each page has the last byte set to 0 when the
file is downloaded and validated (in afs_dir_check_page()).
Fix this by changing the afs_xdr_dirent union name field to an
indeterminate-length array and dropping the overflow field.
(Note that whilst looking at this, I realised that the calculation of the
number of slots a dirent used is non-standard and not quite right, but I'll
address that in a separate patch.)
The issue can be triggered by something like:
touch /afs/example.com/thisisaveryveryverylongname
and it generates a report that looks like:
detected buffer overflow in strnlen
------------[ cut here ]------------
kernel BUG at lib/string.c:1149!
...
RIP: 0010:fortify_panic+0xf/0x11
...
Call Trace:
afs_dir_iterate_block+0x12b/0x35b
afs_dir_iterate+0x14e/0x1ce
afs_do_lookup+0x131/0x417
afs_lookup+0x24f/0x344
lookup_open.isra.0+0x1bb/0x27d
open_last_lookups+0x166/0x237
path_openat+0xe0/0x159
do_filp_open+0x48/0xa4
? kmem_cache_alloc+0xf5/0x16e
? __clear_close_on_exec+0x13/0x22
? _raw_spin_unlock+0xa/0xb
do_sys_openat2+0x72/0xde
do_sys_open+0x3b/0x58
do_syscall_64+0x2d/0x3a
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 6a39e62abb ("lib: string.h: detect intra-object overflow in fortified string functions")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: Daniel Axtens <dja@axtens.net>
There's a memory leak in afs_parse_source() whereby multiple source=
parameters overwrite fc->source in the fs_context struct without freeing
the previously recorded source.
Fix this by only permitting a single source parameter and rejecting with
an error all subsequent ones.
This was caught by syzbot with the kernel memory leak detector, showing
something like the following trace:
unreferenced object 0xffff888114375440 (size 32):
comm "repro", pid 5168, jiffies 4294923723 (age 569.948s)
backtrace:
slab_post_alloc_hook+0x42/0x79
__kmalloc_track_caller+0x125/0x16a
kmemdup_nul+0x24/0x3c
vfs_parse_fs_string+0x5a/0xa1
generic_parse_monolithic+0x9d/0xc5
do_new_mount+0x10d/0x15a
do_mount+0x5f/0x8e
__do_sys_mount+0xff/0x127
do_syscall_64+0x2d/0x3a
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 13fcc68370 ("afs: Add fs_context support")
Reported-by: syzbot+86dc6632faaca40133ab@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When doing a lookup in a directory, the afs filesystem uses a bulk
status fetch to speculatively retrieve the statuses of up to 48 other
vnodes found in the same directory and it will then either update extant
inodes or create new ones - effectively doing 'lookup ahead'.
To avoid the possibility of deadlocking itself, however, the filesystem
doesn't lock all of those inodes; rather just the directory inode is
locked (by the VFS).
When the operation completes, afs_inode_init_from_status() or
afs_apply_status() is called, depending on whether the inode already
exists, to commit the new status.
A case exists, however, where the speculative status fetch operation may
straddle a modification operation on one of those vnodes. What can then
happen is that the speculative bulk status RPC retrieves the old status,
and whilst that is happening, the modification happens - which returns
an updated status, then the modification status is committed, then we
attempt to commit the speculative status.
This results in something like the following being seen in dmesg:
kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus
showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
say that the vnode had data version 8 when we'd already recorded version
9 due to a local modification. This was causing the cache to be
invalidated for that vnode when it shouldn't have been. If it happens
on a data file, this might lead to local changes being lost.
Fix this by ignoring speculative status updates if the data version
doesn't match the expected value.
Note that it is possible to get a DV regression if a volume gets
restored from a backup - but we should get a callback break in such a
case that should trigger a recheck anyway. It might be worth checking
the volume creation time in the volsync info and, if a change is
observed in that (as would happen on a restore), invalidate all caches
associated with the volume.
Fixes: 5cf9dd55a0 ("afs: Prospectively look up extra files when doing a single lookup")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When afs_write_end() is called with copied == 0, it tries to set the
dirty region, but there's no way to actually encode a 0-length region in
the encoding in page->private.
"0,0", for example, indicates a 1-byte region at offset 0. The maths
miscalculates this and sets it incorrectly.
Fix it to just do nothing but unlock and put the page in this case. We
don't actually need to mark the page dirty as nothing presumably
changed.
Fixes: 65dd2d6072 ("afs: Alter dirty range encoding in page->private")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cleanup for the yfs_store_opaque_acl2_operation calls the wrong
function to destroy the ACL content buffer. It's an afs_acl struct, not
a yfs_acl struct - and the free function for latter may pass invalid
pointers to kfree().
Fix this by using the afs_acl_put() function. The yfs_acl_put()
function is then no longer used and can be removed.
general protection fault, probably for non-canonical address 0x7ebde00000000: 0000 [#1] SMP PTI
...
RIP: 0010:compound_head+0x0/0x11
...
Call Trace:
virt_to_cache+0x8/0x51
kfree+0x5d/0x79
yfs_free_opaque_acl+0x16/0x29
afs_put_operation+0x60/0x114
__vfs_setxattr+0x67/0x72
__vfs_setxattr_noperm+0x66/0xe9
vfs_setxattr+0x67/0xce
setxattr+0x14e/0x184
__do_sys_fsetxattr+0x66/0x8f
do_syscall_64+0x2d/0x3a
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When using the afs.yfs.acl xattr to change an AuriStor ACL, a warning
can be generated when the request is marshalled because the buffer
pointer isn't increased after adding the last element, thereby
triggering the check at the end if the ACL wasn't empty. This just
causes something like the following warning, but doesn't stop the call
from happening successfully:
kAFS: YFS.StoreOpaqueACL2: Request buffer underflow (36<108)
Fix this simply by increasing the count prior to the check.
Fixes: f5e4546347 ("afs: Implement YFS ACL setting")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dirty region bounds stored in page->private on an afs page are 15 bits
on a 32-bit box and can, at most, represent a range of up to 32K within a
32K page with a resolution of 1 byte. This is a problem for powerpc32 with
64K pages enabled.
Further, transparent huge pages may get up to 2M, which will be a problem
for the afs filesystem on all 32-bit arches in the future.
Fix this by decreasing the resolution. For the moment, a 64K page will
have a resolution determined from PAGE_SIZE. In the future, the page will
need to be passed in to the helper functions so that the page size can be
assessed and the resolution determined dynamically.
Note that this might not be the ideal way to handle this, since it may
allow some leakage of undirtied zero bytes to the server's copy in the case
of a 3rd-party conflict. Fixing that would require a separately allocated
record and is a more complicated fix.
Fixes: 4343d00872 ("afs: Get rid of the afs_writeback record")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Fix afs_invalidatepage() to adjust the dirty region recorded in
page->private when truncating a page. If the dirty region is entirely
removed, then the private data is cleared and the page dirty state is
cleared.
Without this, if the page is truncated and then expanded again by truncate,
zeros from the expanded, but no-longer dirty region may get written back to
the server if the page gets laundered due to a conflicting 3rd-party write.
It mustn't, however, shorten the dirty region of the page if that page is
still mmapped and has been marked dirty by afs_page_mkwrite(), so a flag is
stored in page->private to record this.
Fixes: 4343d00872 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
Currently, page->private on an afs page is used to store the range of
dirtied data within the page, where the range includes the lower bound, but
excludes the upper bound (e.g. 0-1 is a range covering a single byte).
This, however, requires a superfluous bit for the last-byte bound so that
on a 4KiB page, it can say 0-4096 to indicate the whole page, the idea
being that having both numbers the same would indicate an empty range.
This is unnecessary as the PG_private bit is clear if it's an empty range
(as is PG_dirty).
Alter the way the dirty range is encoded in page->private such that the
upper bound is reduced by 1 (e.g. 0-0 is then specified the same single
byte range mentioned above).
Applying this to both bounds frees up two bits, one of which can be used in
a future commit.
This allows the afs filesystem to be compiled on ppc32 with 64K pages;
without this, the following warnings are seen:
../fs/afs/internal.h: In function 'afs_page_dirty_to':
../fs/afs/internal.h:881:15: warning: right shift count >= width of type [-Wshift-count-overflow]
881 | return (priv >> __AFS_PAGE_PRIV_SHIFT) & __AFS_PAGE_PRIV_MASK;
| ^~
../fs/afs/internal.h: In function 'afs_page_dirty':
../fs/afs/internal.h:886:28: warning: left shift count >= width of type [-Wshift-count-overflow]
886 | return ((unsigned long)to << __AFS_PAGE_PRIV_SHIFT) | from;
| ^~
Fixes: 4343d00872 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
The afs filesystem uses page->private to store the dirty range within a
page such that in the event of a conflicting 3rd-party write to the server,
we write back just the bits that got changed locally.
However, there are a couple of problems with this:
(1) I need a bit to note if the page might be mapped so that partial
invalidation doesn't shrink the range.
(2) There aren't necessarily sufficient bits to store the entire range of
data altered (say it's a 32-bit system with 64KiB pages or transparent
huge pages are in use).
So wrap the accesses in inline functions so that future commits can change
how this works.
Also move them out of the tracing header into the in-directory header.
There's not really any need for them to be in the tracing header.
Signed-off-by: David Howells <dhowells@redhat.com>
In afs, page->private is set to indicate the dirty region of a page. This
is done in afs_write_begin(), but that can't take account of whether the
copy into the page actually worked.
Fix this by moving the change of page->private into afs_write_end().
Fixes: 4343d00872 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the leak of the target page in afs_write_begin() when it fails.
Fixes: 15b4650e55 ("afs: convert to new aops")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Nick Piggin <npiggin@gmail.com>
Fix afs to take a ref on a page when it sets PG_private on it and to drop
the ref when removing the flag.
Note that in afs_write_begin(), a lot of the time, PG_private is already
set on a page to which we're going to add some data. In such a case, we
leave the bit set and mustn't increment the page count.
As suggested by Matthew Wilcox, use attach/detach_page_private() where
possible.
Fixes: 31143d5d51 ("AFS: implement basic file write support")
Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Fix afs_launder_page() to not clear PG_writeback on the page it is
laundering as the flag isn't set in this case.
Fixes: 4343d00872 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
The "op" pointer is freed earlier when we call afs_put_operation().
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Colin Ian King <colin.king@canonical.com>
The patch dca54a7bbb8c: "afs: Add tracing for cell refcount and active user
count" from Oct 13, 2020, leads to the following Smatch complaint:
fs/afs/cell.c:596 afs_unuse_cell()
warn: variable dereferenced before check 'cell' (see line 592)
Fix this by moving the retrieval of the cell debug ID to after the check of
the validity of the cell pointer.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: dca54a7bbb ("afs: Add tracing for cell refcount and active user count")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Dan Carpenter <dan.carpenter@oracle.com>
The prevention of splice-write without explicit ops made the
copy_file_write() syscall to an afs file (as done by the generic/112
xfstest) fail with EINVAL.
Fix by using iter_file_splice_write() for afs.
Fixes: 36e2c7421f ("fs: don't allow splice read/write without explicit ops")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl+JqvsACgkQ+7dXa6fL
C2sQeRAAiW+rGQa46ybOv8qBLgVEiH6OzNUk62fesbhl5rKRgapS2r2SJbau29gv
KqBld29lp4o84/THJ6hZ6Al1iwZnO2FW+h53Q47M5TH9AbwZhf/zooRSoGb5AmKJ
/FR4+K/zTLk7VCSptTtXlKG81bOKQF+tmi6sLdxx70h2T+Ythm3zdVq8PCZZhhGG
hw1IfuNDNbUDGus5WgZfIdVkdFCs5WW5cEhrPgqKR0mXYQklnkwgtov/+RAh/Ewf
bZ1JFqap15KJ2AcKgw79NZ01MBJ4KyZckzKwgaTVlIEtCMQMBDhJpbKzdtP767rh
xkxdzmDmXOop9RNQ8WrIIt6EjpB0We2qxQVGLsPHdEixmlv5n2BjL/wdWa3u/4QZ
ymjv7B8jcFSpXQzVnrmeNYsgHo1oJ7G8q8Xh+CJm9BNBtNUiz1raQfxWpS5TeWT4
dQsr/Z/PMCBCBOb6F08pjF8od67DUCx+x1nkCG4qFCeXiC4ouTNkfMZnxWH9mVKP
v5Hca2E0ilR3PI5+tn8o/ZPul5NDTesBnAvAAEBQDT79Ff8kAgEJ3M3ipy38VoKV
xFiNpNHwDPp15vzneJ7f7Ir27VauRlhCLEuSUtyYH4pD458K3UQAAQRFZB7bkzMX
TQTbmEPUbtrs/mOEMKQN6ew8dwawRpMy01j0KnDl0KBFDtfTLU8=
=0anM
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20201016' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull afs updates from David Howells:
"A collection of fixes to fix afs_cell struct refcounting, thereby
fixing a slew of related syzbot bugs:
- Fix the cell tree in the netns to use an rwsem rather than RCU.
There seem to be some problems deriving from the use of RCU and a
seqlock to walk the rbtree, but it's not entirely clear what since
there are several different failures being seen.
Changing things to use an rwsem instead makes it more robust. The
extra performance derived from using RCU isn't necessary in this
case since the only time we're looking up a cell is during mount or
when cells are being manually added.
- Fix the refcounting by splitting the usage counter into a memory
refcount and an active users counter. The usage counter was doing
double duty, keeping track of whether a cell is still in use and
keeping track of when it needs to be destroyed - but this makes the
clean up tricky. Separating these out simplifies the logic.
- Fix purging a cell that has an alias. A cell alias pins the cell
it's an alias of, but the alias is always later in the list. Trying
to purge in a single pass causes rmmod to hang in such a case.
- Fix cell removal. If a cell's manager is requeued whilst it's
removing itself, the manager will run again and re-remove itself,
causing problems in various places. Follow Hillf Danton's
suggestion to insert a more terminal state that causes the manager
to do nothing post-removal.
In additional to the above, two other changes:
- Add a tracepoint for the cell refcount and active users count. This
helped with debugging the above and may be useful again in future.
- Downgrade an assertion to a print when a still-active server is
seen during purging. This was happening as a consequence of
incomplete cell removal before the servers were cleaned up"
* tag 'afs-fixes-20201016' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Don't assert on unpurgeable server records
afs: Add tracing for cell refcount and active user count
afs: Fix cell removal
afs: Fix cell purging with aliases
afs: Fix cell refcounting by splitting the usage counter
afs: Fix rapid cell addition/removal by not using RCU on cells tree
Don't give an assertion failure on unpurgeable afs_server records - which
kills the thread - but rather emit a trace line when we are purging a
record (which only happens during network namespace removal or rmmod) and
print a notice of the problem.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a tracepoint to log the cell refcount and active user count and pass in
a reason code through various functions that manipulate these counters.
Additionally, a helper function, afs_see_cell(), is provided to log
interesting places that deal with a cell without actually doing any
accounting directly.
Signed-off-by: David Howells <dhowells@redhat.com>
When the afs module is removed, one of the things that has to be done is to
purge the cell database. afs_cell_purge() cancels the management timer and
then starts the cell manager work item to do the purging. This does a
single run through and then assumes that all cells are now purged - but
this is no longer the case.
With the introduction of alias detection, a later cell in the database can
now be holding an active count on an earlier cell (cell->alias_of). The
purge scan passes by the earlier cell first, but this can't be got rid of
until it has discarded the alias. Ordinarily, afs_unuse_cell() would
handle this by setting the management timer to trigger another pass - but
afs_set_cell_timer() doesn't do anything if the namespace is being removed
(net->live == false). rmmod then hangs in the wait on cells_outstanding in
afs_cell_purge().
Fix this by making afs_set_cell_timer() directly queue the cell manager if
net->live is false. This causes additional management passes.
Queueing the cell manager increments cells_outstanding to make sure the
wait won't complete until all cells are destroyed.
Fixes: 8a070a9648 ("afs: Detect cell aliases 1 - Cells with root volumes")
Signed-off-by: David Howells <dhowells@redhat.com>
Management of the lifetime of afs_cell struct has some problems due to the
usage counter being used to determine whether objects of that type are in
use in addition to whether anyone might be interested in the structure.
This is made trickier by cell objects being cached for a period of time in
case they're quickly reused as they hold the result of a setup process that
may be slow (DNS lookups, AFS RPC ops).
Problems include the cached root volume from alias resolution pinning its
parent cell record, rmmod occasionally hanging and occasionally producing
assertion failures.
Fix this by splitting the count of active users from the struct reference
count. Things then work as follows:
(1) The cell cache keeps +1 on the cell's activity count and this has to
be dropped before the cell can be removed. afs_manage_cell() tries to
exchange the 1 to a 0 with the cells_lock write-locked, and if
successful, the record is removed from the net->cells.
(2) One struct ref is 'owned' by the activity count. That is put when the
active count is reduced to 0 (final_destruction label).
(3) A ref can be held on a cell whilst it is queued for management on a
work queue without confusing the active count. afs_queue_cell() is
added to wrap this.
(4) The queue's ref is dropped at the end of the management. This is
split out into a separate function, afs_manage_cell_work().
(5) The root volume record is put after a cell is removed (at the
final_destruction label) rather then in the RCU destruction routine.
(6) Volumes hold struct refs, but aren't active users.
(7) Both counts are displayed in /proc/net/afs/cells.
There are some management function changes:
(*) afs_put_cell() now just decrements the refcount and triggers the RCU
destruction if it becomes 0. It no longer sets a timer to have the
manager do this.
(*) afs_use_cell() and afs_unuse_cell() are added to increase and decrease
the active count. afs_unuse_cell() sets the management timer.
(*) afs_queue_cell() is added to queue a cell with approprate refs.
There are also some other fixes:
(*) Don't let /proc/net/afs/cells access a cell's vllist if it's NULL.
(*) Make sure that candidate cells in lookups are properly destroyed
rather than being simply kfree'd. This ensures the bits it points to
are destroyed also.
(*) afs_dec_cells_outstanding() is now called in cell destruction rather
than at "final_destruction". This ensures that cell->net is still
valid to the end of the destructor.
(*) As a consequence of the previous two changes, move the increment of
net->cells_outstanding that was at the point of insertion into the
tree to the allocation routine to correctly balance things.
Fixes: 989782dcdc ("afs: Overhaul cell database management")
Signed-off-by: David Howells <dhowells@redhat.com>
There are a number of problems that are being seen by the rapidly mounting
and unmounting an afs dynamic root with an explicit cell and volume
specified (which should probably be rejected, but that's a separate issue):
What the tests are doing is to look up/create a cell record for the name
given and then tear it down again without actually using it to try to talk
to a server. This is repeated endlessly, very fast, and the new cell
collides with the old one if it's not quick enough to reuse it.
It appears (as suggested by Hillf Danton) that the search through the RB
tree under a read_seqbegin_or_lock() under RCU conditions isn't safe and
that it's not blocking the write_seqlock(), despite taking two passes at
it. He suggested that the code should take a ref on the cell it's
attempting to look at - but this shouldn't be necessary until we've
compared the cell names. It's possible that I'm missing a barrier
somewhere.
However, using an RCU search for this is overkill, really - we only need to
access the cell name in a few places, and they're places where we're may
end up sleeping anyway.
Fix this by switching to an R/W semaphore instead.
Additionally, draw the down_read() call inside the function (renamed to
afs_find_cell()) since all the callers were taking the RCU read lock (or
should've been[*]).
[*] afs_probe_cell_name() should have been, but that doesn't appear to be
involved in the bug reports.
The symptoms of this look like:
general protection fault, probably for non-canonical address 0xf27d208691691fdb: 0000 [#1] PREEMPT SMP KASAN
KASAN: maybe wild-memory-access in range [0x93e924348b48fed8-0x93e924348b48fedf]
...
RIP: 0010:strncasecmp lib/string.c:52 [inline]
RIP: 0010:strncasecmp+0x5f/0x240 lib/string.c:43
afs_lookup_cell_rcu+0x313/0x720 fs/afs/cell.c:88
afs_lookup_cell+0x2ee/0x1440 fs/afs/cell.c:249
afs_parse_source fs/afs/super.c:290 [inline]
...
Fixes: 989782dcdc ("afs: Overhaul cell database management")
Reported-by: syzbot+459a5dce0b4cb70fd076@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hillf Danton <hdanton@sina.com>
cc: syzkaller-bugs@googlegroups.com
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EWUgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnoxEADCVSNBRkpV0OVkOEC3wf8EGhXhk01Jnjtl
u5Mg2V55hcgJ0thQxBV/V28XyqmsEBrmAVi0Yf8Vr9Qbq4Ze08Wae4ChS4rEOyh1
jTcGYWx5aJB3ChLvV/HI0nWQ3bkj03mMrL3SW8rhhf5DTyKHsVeTenpx42Qu/FKf
fRzi09FSr3Pjd0B+EX6gunwJnlyXQC5Fa4AA0GhnXJzAznANXxHkkcXu8a6Yw75x
e28CfhIBliORsK8sRHLoUnPpeTe1vtxCBhBMsE+gJAj9ZUOWMzvNFIPP4FvfawDy
6cCQo2m1azJ/IdZZCDjFUWyjh+wxdKMp+NNryEcoV+VlqIoc3n98rFwrSL+GIq5Z
WVwEwq+AcwoMCsD29Lu1ytL2PQ/RVqcJP5UheMrbL4vzefNfJFumQVZLIcX0k943
8dFL2QHL+H/hM9Dx5y5rjeiWkAlq75v4xPKVjh/DHb4nehddCqn/+DD5HDhNANHf
c1kmmEuYhvLpIaC4DHjE6DwLh8TPKahJjwsGuBOTr7D93NUQD+OOWsIhX6mNISIl
FFhP8cd0/ZZVV//9j+q+5B4BaJsT+ZtwmrelKFnPdwPSnh+3iu8zPRRWO+8P8fRC
YvddxuJAmE6BLmsAYrdz6Xb/wqfyV44cEiyivF0oBQfnhbtnXwDnkDWSfJD1bvCm
ZwfpDh2+Tg==
=LzyE
-----END PGP SIGNATURE-----
Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- Series of merge handling cleanups (Baolin, Christoph)
- Series of blk-throttle fixes and cleanups (Baolin)
- Series cleaning up BDI, seperating the block device from the
backing_dev_info (Christoph)
- Removal of bdget() as a generic API (Christoph)
- Removal of blkdev_get() as a generic API (Christoph)
- Cleanup of is-partition checks (Christoph)
- Series reworking disk revalidation (Christoph)
- Series cleaning up bio flags (Christoph)
- bio crypt fixes (Eric)
- IO stats inflight tweak (Gabriel)
- blk-mq tags fixes (Hannes)
- Buffer invalidation fixes (Jan)
- Allow soft limits for zone append (Johannes)
- Shared tag set improvements (John, Kashyap)
- Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)
- DM no-wait support (Mike, Konstantin)
- Request allocation improvements (Ming)
- Allow md/dm/bcache to use IO stat helpers (Song)
- Series improving blk-iocost (Tejun)
- Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
Xianting, Yang, Yufen, yangerkun)
* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
block: fix uapi blkzoned.h comments
blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
blk-mq: get rid of the dead flush handle code path
block: get rid of unnecessary local variable
block: fix comment and add lockdep assert
blk-mq: use helper function to test hw stopped
block: use helper function to test queue register
block: remove redundant mq check
block: invoke blk_mq_exit_sched no matter whether have .exit_sched
percpu_ref: don't refer to ref->data if it isn't allocated
block: ratelimit handle_bad_sector() message
blk-throttle: Re-use the throtl_set_slice_end()
blk-throttle: Open code __throtl_de/enqueue_tg()
blk-throttle: Move service tree validation out of the throtl_rb_first()
blk-throttle: Move the list operation after list validation
blk-throttle: Fix IO hang for a corner case
blk-throttle: Avoid tracking latency if low limit is invalid
blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
block: Remove redundant 'return' statement
...
The afs filesystem has a lock[*] that it uses to serialise I/O operations
going to the server (vnode->io_lock), as the server will only perform one
modification operation at a time on any given file or directory. This
prevents the the filesystem from filling up all the call slots to a server
with calls that aren't going to be executed in parallel anyway, thereby
allowing operations on other files to obtain slots.
[*] Note that is probably redundant for directories at least since
i_rwsem is used to serialise directory modifications and
lookup/reading vs modification. The server does allow parallel
non-modification ops, however.
When a file truncation op completes, we truncate the in-memory copy of the
file to match - but we do it whilst still holding the io_lock, the idea
being to prevent races with other operations.
However, if writeback starts in a worker thread simultaneously with
truncation (whilst notify_change() is called with i_rwsem locked, writeback
pays it no heed), it may manage to set PG_writeback bits on the pages that
will get truncated before afs_setattr_success() manages to call
truncate_pagecache(). Truncate will then wait for those pages - whilst
still inside io_lock:
# cat /proc/8837/stack
[<0>] wait_on_page_bit_common+0x184/0x1e7
[<0>] truncate_inode_pages_range+0x37f/0x3eb
[<0>] truncate_pagecache+0x3c/0x53
[<0>] afs_setattr_success+0x4d/0x6e
[<0>] afs_wait_for_operation+0xd8/0x169
[<0>] afs_do_sync_operation+0x16/0x1f
[<0>] afs_setattr+0x1fb/0x25d
[<0>] notify_change+0x2cf/0x3c4
[<0>] do_truncate+0x7f/0xb2
[<0>] do_sys_ftruncate+0xd1/0x104
[<0>] do_syscall_64+0x2d/0x3a
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
The writeback operation, however, stalls indefinitely because it needs to
get the io_lock to proceed:
# cat /proc/5940/stack
[<0>] afs_get_io_locks+0x58/0x1ae
[<0>] afs_begin_vnode_operation+0xc7/0xd1
[<0>] afs_store_data+0x1b2/0x2a3
[<0>] afs_write_back_from_locked_page+0x418/0x57c
[<0>] afs_writepages_region+0x196/0x224
[<0>] afs_writepages+0x74/0x156
[<0>] do_writepages+0x2d/0x56
[<0>] __writeback_single_inode+0x84/0x207
[<0>] writeback_sb_inodes+0x238/0x3cf
[<0>] __writeback_inodes_wb+0x68/0x9f
[<0>] wb_writeback+0x145/0x26c
[<0>] wb_do_writeback+0x16a/0x194
[<0>] wb_workfn+0x74/0x177
[<0>] process_one_work+0x174/0x264
[<0>] worker_thread+0x117/0x1b9
[<0>] kthread+0xec/0xf1
[<0>] ret_from_fork+0x1f/0x30
and thus deadlock has occurred.
Note that whilst afs_setattr() calls filemap_write_and_wait(), the fact
that the caller is holding i_rwsem doesn't preclude more pages being
dirtied through an mmap'd region.
Fix this by:
(1) Use the vnode validate_lock to mediate access between afs_setattr()
and afs_writepages():
(a) Exclusively lock validate_lock in afs_setattr() around the whole
RPC operation.
(b) If WB_SYNC_ALL isn't set on entry to afs_writepages(), trying to
shared-lock validate_lock and returning immediately if we couldn't
get it.
(c) If WB_SYNC_ALL is set, wait for the lock.
The validate_lock is also used to validate a file and to zap its cache
if the file was altered by a third party, so it's probably a good fit
for this.
(2) Move the truncation outside of the io_lock in setattr, using the same
hook as is used for local directory editing.
This requires the old i_size to be retained in the operation record as
we commit the revised status to the inode members inside the io_lock
still, but we still need to know if we reduced the file size.
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Set up a readahead size by default, as very few users have a good
reason to change it. This means code, ecryptfs, and orangefs now
set up the values while they were previously missing it, while ubifs,
mtd and vboxsf manually set it to 0 to avoid readahead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull networking fixes from David Miller:
1) Use netif_rx_ni() when necessary in batman-adv stack, from Jussi
Kivilinna.
2) Fix loss of RTT samples in rxrpc, from David Howells.
3) Memory leak in hns_nic_dev_probe(), from Dignhao Liu.
4) ravb module cannot be unloaded, fix from Yuusuke Ashizuka.
5) We disable BH for too lokng in sctp_get_port_local(), add a
cond_resched() here as well, from Xin Long.
6) Fix memory leak in st95hf_in_send_cmd, from Dinghao Liu.
7) Out of bound access in bpf_raw_tp_link_fill_link_info(), from
Yonghong Song.
8) Missing of_node_put() in mt7530 DSA driver, from Sumera
Priyadarsini.
9) Fix crash in bnxt_fw_reset_task(), from Michael Chan.
10) Fix geneve tunnel checksumming bug in hns3, from Yi Li.
11) Memory leak in rxkad_verify_response, from Dinghao Liu.
12) In tipc, don't use smp_processor_id() in preemptible context. From
Tuong Lien.
13) Fix signedness issue in mlx4 memory allocation, from Shung-Hsi Yu.
14) Missing clk_disable_prepare() in gemini driver, from Dan Carpenter.
15) Fix ABI mismatch between driver and firmware in nfp, from Louis
Peens.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (110 commits)
net/smc: fix sock refcounting in case of termination
net/smc: reset sndbuf_desc if freed
net/smc: set rx_off for SMCR explicitly
net/smc: fix toleration of fake add_link messages
tg3: Fix soft lockup when tg3_reset_task() fails.
doc: net: dsa: Fix typo in config code sample
net: dp83867: Fix WoL SecureOn password
nfp: flower: fix ABI mismatch between driver and firmware
tipc: fix shutdown() of connectionless socket
ipv6: Fix sysctl max for fib_multipath_hash_policy
drivers/net/wan/hdlc: Change the default of hard_header_len to 0
net: gemini: Fix another missing clk_disable_unprepare() in probe
net: bcmgenet: fix mask check in bcmgenet_validate_flow()
amd-xgbe: Add support for new port mode
net: usb: dm9601: Add USB ID of Keenetic Plus DSL
vhost: fix typo in error message
net: ethernet: mlx4: Fix memory allocation in mlx4_buddy_init()
pktgen: fix error message with wrong function name
net: ethernet: ti: am65-cpsw: fix rmii 100Mbit link mode
cxgb4: fix thermal zone device registration
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl9HwPIACgkQ+7dXa6fL
C2vHdw//S5s+nPNuJUpz3aeYDC1Yl1YP+r2+XBfcCNem504GKfsvTBbLo7FYP6Oc
TCLJouPxoNSWAtlrJ86YJu8EcFdYNI4w6mCHncIrLXiiIZhsAeUMAw1GiASL5VUr
QhLjJJU4zUBg3/RWLRC1pUXBXAa3sYx5r1p8j3KdBAkAmAzXdFWSRFeffVeXIk46
FZ52YoQkplJYqDL+1oQCjJLetVJGGvc68AIOjJOLh6CAjrZbx1aiY79XA+flsoE9
B+KVjhEn0f9dVybgtF4rEqS88y1FJtSQgul6FOsys+Rx2I0G7ei8PB5TQ2wNCQhy
37gGfg1AV6apcqXKcHJdovVnApoQzzdCJESCgbMvsczqM88dP7pLWrPz4wpxs655
3t6qUyXI6yMOJhUWPOoFi30+4NM+MmsqrbYLZt2f/aXRhKnyaeaLd+VAIlkgz3Lj
sZuuHQsTH0xiR2uCsINWEc7d7UV6WjeVUJ77LzYiiRzEC2pX80tGu5EKHN8W9oBk
xRAuExXEGtyOR0p3/S3StkT490Tt4bIxUwnKYAaQZEydMCnTlWVbtIXwzmlCbCSB
p+P/7twR7LiQlHCTJU94jH3Jfpm3jpFpgjQZoZcx6ZLvtSH4+QP11nMY9+sRxEz1
hpB12AY7Wp2N7P+GhBPuXUCQpzW751aNZz4X9Etu6kRgwjuyaJM=
=FWHJ
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-fixes-20200820' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc, afs: Fix probing issues
Here are some fixes for rxrpc and afs to fix issues in the RTT measuring in
rxrpc and thence the Volume Location server probing in afs:
(1) Move the serial number of a received ACK into a local variable to
simplify the next patch.
(2) Fix the loss of RTT samples due to extra interposed ACKs causing
baseline information to be discarded too early. This is a particular
problem for afs when it sends a single very short call to probe a
server it hasn't talked to recently.
(3) Fix rxrpc_kernel_get_srtt() to indicate whether it actually has seen
any valid samples or not.
(4) Remove a field that's set/woken, but never read/waited on.
(5) Expose the RTT and other probe information through procfs to make
debugging of this stuff easier.
(6) Fix VL rotation in afs to only use summary information from VL probing
and not the probe running state (which gets clobbered when next a
probe is issued).
(7) Fix VL rotation to actually return the error aggregated from the probe
errors.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The fall through annotation comes after a return statement so it's not
reachable.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
If an error occurs during the construction of an afs superblock, it's
possible that an error occurs after a superblock is created, but before
we've created the root dentry. If the superblock has a dynamic root
(ie. what's normally mounted on /afs), the afs_kill_super() will call
afs_dynroot_depopulate() to unpin any created dentries - but this will
oops if the root hasn't been created yet.
Fix this by skipping that bit of code if there is no root dentry.
This leads to an oops looking like:
general protection fault, ...
KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f]
...
RIP: 0010:afs_dynroot_depopulate+0x25f/0x529 fs/afs/dynroot.c:385
...
Call Trace:
afs_kill_super+0x13b/0x180 fs/afs/super.c:535
deactivate_locked_super+0x94/0x160 fs/super.c:335
afs_get_tree+0x1124/0x1460 fs/afs/super.c:598
vfs_get_tree+0x89/0x2f0 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x1387/0x2070 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount fs/namespace.c:3390 [inline]
__x64_sys_mount+0x27f/0x300 fs/namespace.c:3390
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
which is oopsing on this line:
inode_lock(root->d_inode);
presumably because sb->s_root was NULL.
Fixes: 0da0b7fd73 ("afs: Display manually added cells in dynamic root mount")
Reported-by: syzbot+c1eff8205244ae7e11a6@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The afs_put_operation() function needs to put the reference to the key
that's authenticating the operation.
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The error handling in the VL server rotation in the case of there being no
contactable servers is not correct. In such a case, the records of all the
servers in the list are scanned and the errors and abort codes are mapped
and prioritised and one error is chosen. This is then forgotten and the
default error is used (EDESTADDRREQ).
Fix this by using the calculated error.
Also we need to note whether a server responded on one of its endpoints so
that we can priorise an error from an abort message over local and network
errors.
Fixes: 4584ae96ae ("afs: Fix missing net error handling")
Signed-off-by: David Howells <dhowells@redhat.com>
Don't use the running state for VL server probes to make decisions about
which server to use as the state is cleared at the start of a probe and
intermediate values might also be misleading.
Instead, add a separate 'latest known' rtt in the afs_vlserver struct and a
flag to indicate if the server is known to be responding and update these
as and when we know what to change them to.
Fixes: 3bf0fb6f33 ("afs: Probe multiple fileservers simultaneously")
Signed-off-by: David Howells <dhowells@redhat.com>
Convert various bitfields in afs_vlserver::probe to a mask and then expose
this and some other bits of information through /proc/net/afs/<cell>/vlservers
to make it easier to debug VL server communication issues.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix rxrpc_kernel_get_srtt() to indicate the validity of the returned
smoothed RTT. If we haven't had any valid samples yet, the SRTT isn't
useful.
Fixes: c410bf0193 ("rxrpc: Fix the excessive initial retransmission timeout")
Signed-off-by: David Howells <dhowells@redhat.com>