OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Jeff Layton	d801327d95	ceph: convert ceph_write_begin to netfs_write_begin Convert ceph_write_begin to use the netfs_write_begin helper. Most of the ops we need for it are already in place from the readpage conversion but we do add a new check_write_begin op since ceph needs to be able to vet whether there is an incompatible writeback already in flight before reading in the page. With this, we can also remove the old ceph_do_readpage helper. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2021-04-27 23:52:22 +02:00
Jeff Layton	f0702876e1	ceph: convert ceph_readpage to netfs_readpage Have the ceph KConfig select NETFS_SUPPORT. Add a new netfs ops structure and the operations for it. Convert ceph_readpage to use the new netfs_readpage helper. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2021-04-27 23:52:22 +02:00
Jeff Layton	10a7052c78	ceph: fix fscache invalidation Ensure that we invalidate the fscache whenever we invalidate the pagecache. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2021-04-27 23:52:22 +02:00
Jeff Layton	7c46b31809	ceph: rework PageFsCache handling With the new fscache API, the PageFsCache bit now indicates that the page is being written to the cache and shouldn't be modified or released until it's finished. Change releasepage and invalidatepage to wait on that bit before returning. Also define FSCACHE_USE_NEW_IO_API so that we opt into the new fscache API. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2021-04-27 23:52:21 +02:00
Jeff Layton	e7df4524cd	ceph: rip out old fscache readpage handling With the new netfs read helper functions, we won't need a lot of this infrastructure as it handles the pagecache pages itself. Rip out the read handling for now, and much of the old infrastructure that deals in individual pages. The cookie handling is mostly unchanged, however. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2021-04-27 23:52:21 +02:00
Ilya Dryomov	8b01888992	Merge remote-tracking branch 'dhowells/netfs-lib' Pick up David Howells' netfs helper library and the new fscache API. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2021-04-27 23:10:58 +02:00
Linus Torvalds	f1c921fb70	selinux/stable-5.13 PR 20210426 -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmCHM2sUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNfCg/9GmoCyCh+ZRj5RGQ6M+yJas1+yyJQ uEfTNde54yfATUTaaWYnZG59yqzM3I2uaV11U7tqg8ajiFPxJKqbs5R9jl3lnSjH 0Dg22nXPSCOTKcU0x/DeLoKRr+M9jO1K/nQ8NEZvYX4nC/OgtCvJqb/oEQZIKAk5 2a7OEmNNQyFGd274p9dELaDHxN9UIaJ2PzQFXtq7ROHgBXQO4ONb2ajOf6mDSFQb vP/CDHwaH+pcE28w44oRy0/YBkO1SrdqoFQchg5yFagM5tQRLGkXK4OFSs5KHi5Q YMtmaOzMPIv1e5eaC1HuuMJYA4pPb30T9hFHP7tmBVZfmZaFaDeUs+BhMm98WTiS o0iTP7tfs36/poOR1Q0/sB06uvF9hUAAX1ZuE95YySifbXU9hsUc9b0uQSwCdg9P /J9rcdHLTpWqjw9n02mezWmAvo5U8ZvbDs+0xPIwI+3RTUP5t6mp+Hd5Tc7bPTq1 0rpWXx+FQoSytFap5qiUSiwBp+HF6HQnNIXB0Muf6wctChoTjvo7TwoxH//z4kEm +SddhOCNkB7VC/X7hOxhl0F/rdHuXvb1AFIWjpTLJH2CR1PvMtF+sGey+uPT6hKZ /gvhmQGjFdph99eGlfVbCNvx1pM61O25IscaYD1T2wGImw+z7dX4WkG3WoOdDSkR bRjrBkcHh0gLhWk= =HTEy -----END PGP SIGNATURE----- Merge tag 'selinux-pr-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux updates from Paul Moore: - Add support for measuring the SELinux state and policy capabilities using IMA. - A handful of SELinux/NFS patches to compare the SELinux state of one mount with a set of mount options. Olga goes into more detail in the patch descriptions, but this is important as it allows more flexibility when using NFS and SELinux context mounts. - Properly differentiate between the subjective and objective LSM credentials; including support for the SELinux and Smack. My clumsy attempt at a proper fix for AppArmor didn't quite pass muster so John is working on a proper AppArmor patch, in the meantime this set of patches shouldn't change the behavior of AppArmor in any way. This change explains the bulk of the diffstat beyond security/. - Fix a problem where we were not properly terminating the permission list for two SELinux object classes. * tag 'selinux-pr-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: selinux: add proper NULL termination to the secclass_map permissions smack: differentiate between subjective and objective task credentials selinux: clarify task subjective and objective credentials lsm: separate security_task_getsecid() into subjective and objective variants nfs: account for selinux security context when deciding to share superblock nfs: remove unneeded null check in nfs_fill_super() lsm,selinux: add new hook to compare new mount to an existing mount selinux: fix misspellings using codespell tool selinux: fix misspellings using codespell tool selinux: measure state and policy capabilities selinux: Allow context mounts for unpriviliged overlayfs	2021-04-27 13:42:11 -07:00
Linus Torvalds	fafe1e39ed	AFS: Use the new netfs lib -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmCHJJAACgkQ+7dXa6fL C2uv0A//S/sJyToPtj3xbzmRVmSGGWFYNRMaxBD2gYAq7swbDNiX4ZbBCe8A4FBY zedeMfoNztHIRB2M9vvnhG4HJWXPKq2BaT0xzeteCcmZ65b5zBOrAXue0PQPqE20 xmK1RDls/y5Y2FaF92Ay0VZzXW7+y/M+RRSo+FCFzrIgpJrPprTnlZigrECYauGJ Qdsv26rQ0flK6tyi6GVuWZIMvpINCt3WwpwQTkAUewz2VewA1tZ1xFe70sP0vF7R MJNaS6A4uJmvoJJzb8rqdnBGiu76+TxmPaXn0IZKJBECZjBVJyk/duce0jgqbQ7C PZz5j4C2xrPyu3Y98joj37HPEAHCy0DPRx2Es1mz5cHPzI1TDRClHzPrxyycz9gr D9WnMiPj9ff9aDaV6XpWKyuHhPxaHpoOD3VGdrhx6bU19Jd3/mLHB3lSt1kJzWdg QrSAk3KzMWAZigz/+I5xetOpbygKTPLEYgpdmdOSTrtACcm1wjnhIougu0FUIWXK arPNFOIV9liN0qCQyDOcLx4UEcxXrb2W0AYeHHJDBFxJ7sT2WWUCjPZFW5bh3G+Y goKv/XJRVWJxFlTXLZLZ5siclzzIlAAmSylh661ji836yRhqTQ3NJTB8QfnrGGsZ QlD1hjpyqC8uwIGUvoh56KdLRTxj9Gj70gpVe/Lk3Z16mivqDUE= =fSr0 -----END PGP SIGNATURE----- Merge tag 'afs-netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS updates from David Howells: "Use the new netfs lib. Begin the process of overhauling the use of the fscache API by AFS and the introduction of support for features such as Transparent Huge Pages (THPs). - Add some support for THPs, including using core VM helper functions to find details of pages. - Use the ITER_XARRAY I/O iterator to mediate access to the pagecache as this handles THPs and doesn't require allocation of large bvec arrays. - Delegate address_space read/pre-write I/O methods for AFS to the netfs helper library. A method is provided to the library that allows it to issue a read against the server. This includes a change in use for PG_fscache (it now indicates a DIO write in progress from the marked page), so a number of waits need to be deployed for it. - Split the core AFS writeback function to make it easier to modify in future patches to handle writing to the cache. [This might feasibly make more sense moved out into my fscache-iter branch]. I've tested these with "xfstests -g quick" against an AFS volume (xfstests needs patching to make it work). With this, AFS without a cache passes all expected xfstests; with a cache, there's an extra failure, but that's also there before these patches. Fixing that probably requires a greater overhaul (as can be found on my fscache-iter branch, but that's for a later time). Thanks should go to Marc Dionne and Jeff Altman of AuriStor for exercising the patches in their test farm also" Link: https://lore.kernel.org/lkml/3785063.1619482429@warthog.procyon.org.uk/ * tag 'afs-netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Use the netfs_write_begin() helper afs: Use new netfs lib read helper API afs: Use the fs operation ops to handle FetchData completion afs: Prepare for use of THPs afs: Extract writeback extension into its own function afs: Wait on PG_fscache before modifying/releasing a page afs: Use ITER_XARRAY for writing afs: Set up the iov_iter before calling afs_extract_data() afs: Log remote unmarshalling errors afs: Don't truncate iter during data fetch afs: Move key to afs_read struct afs: Print the operation debug_id when logging an unexpected data version afs: Pass page into dirty region helpers to provide THP size afs: Disable use of the fscache I/O routines	2021-04-27 13:27:39 -07:00
Linus Torvalds	820c4bae40	Network filesystem helper library -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmCHPZwACgkQ+7dXa6fL C2uJxw/9FVNssHxtA8iFDvZskE4YHiL6vMgOgKOeVmBfUvxqJcxWQXcF8ycbon5y jGcDRV1DWTv395ckALHqmD6SlH/5q+OBt4cCOXCebOlzbC63JmjJ6xOjHntZKw3i 9c3GITNca5AsPXHXHGIcoRY4/4FntpLoVpyfYJ4ZZJCY7a7QUbgnEIIy9/Ps8Clw BahhiKChl2JCgV3KZBk/ypkf0IBduxKgT+IUxA9o7H5UsLzvUgnfd5uMIALLPMI1 NXzUHBJoUtnWcB52nWPufJx9YwkMfSx70mutT0T74CFxbJakwRgAl2tWr5g989qM /fQrsOhMlU3NaXYaRPelbxkuzvy3hU1xSe3GLiZcxmh4Cb/YAX0TrHRecO62NWff pu/UWQS8Du5Gy8DrHScuo8baI1KFfyiV2lWQPfBO8kPaEB2ERw+PN6fWSh993Cn9 4UHaR3Oyn4qyVXeirNZg+frado+BEZAbNMZwn0lyi6jnLeyir6qABOdpQk34SB35 D4jfdPOBxeh3OVFkc+EBJ98i3/nal2+yXrNOqkP4OwmF0HqGt0YKKSaLNigXaDdO 3CKmQlBqBZsUdRYHJyJsofrifkKjP78zx2WyUJPms8MGX9z+9kYR3f1erifLesCT Kb2TrAFx4ZgqS5tFh6UHnX4x0qy2RckgNrKTMpv38K8lNqplvLo= =tZgy -----END PGP SIGNATURE----- Merge tag 'netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull network filesystem helper library updates from David Howells: "Here's a set of patches for 5.13 to begin the process of overhauling the local caching API for network filesystems. This set consists of two parts: (1) Add a helper library to handle the new VM readahead interface. This is intended to be used unconditionally by the filesystem (whether or not caching is enabled) and provides a common framework for doing caching, transparent huge pages and, in the future, possibly fscrypt and read bandwidth maximisation. It also allows the netfs and the cache to align, expand and slice up a read request from the VM in various ways; the netfs need only provide a function to read a stretch of data to the pagecache and the helper takes care of the rest. (2) Add an alternative fscache/cachfiles I/O API that uses the kiocb facility to do async DIO to transfer data to/from the netfs's pages, rather than using readpage with wait queue snooping on one side and vfs_write() on the other. It also uses less memory, since it doesn't do buffered I/O on the backing file. Note that this uses SEEK_HOLE/SEEK_DATA to locate the data available to be read from the cache. Whilst this is an improvement from the bmap interface, it still has a problem with regard to a modern extent-based filesystem inserting or removing bridging blocks of zeros. Fixing that requires a much greater overhaul. This is a step towards overhauling the fscache API. The change is opt-in on the part of the network filesystem. A netfs should not try to mix the old and the new API because of conflicting ways of handling pages and the PG_fscache page flag and because it would be mixing DIO with buffered I/O. Further, the helper library can't be used with the old API. This does not change any of the fscache cookie handling APIs or the way invalidation is done at this time. In the near term, I intend to deprecate and remove the old I/O API (fscache_allocate_page{,s}(), fscache_read_or_alloc_page{,s}(), fscache_write_page() and fscache_uncache_page()) and eventually replace most of fscache/cachefiles with something simpler and easier to follow. This patchset contains the following parts: - Some helper patches, including provision of an ITER_XARRAY iov iterator and a function to do readahead expansion. - Patches to add the netfs helper library. - A patch to add the fscache/cachefiles kiocb API. - A pair of patches to fix some review issues in the ITER_XARRAY and read helpers as spotted by Al and Willy. Jeff Layton has patches to add support in Ceph for this that he intends for this merge window. I have a set of patches to support AFS that I will post a separate pull request for. With this, AFS without a cache passes all expected xfstests; with a cache, there's an extra failure, but that's also there before these patches. Fixing that probably requires a greater overhaul. Ceph also passes the expected tests. I also have patches in a separate branch to tidy up the handling of PG_fscache/PG_private_2 and their contribution to page refcounting in the core kernel here, but I haven't included them in this set and will route them separately" Link: https://lore.kernel.org/lkml/3779937.1619478404@warthog.procyon.org.uk/ * tag 'netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: netfs: Miscellaneous fixes iov_iter: Four fixes for ITER_XARRAY fscache, cachefiles: Add alternate API to use kiocb for read/write to cache netfs: Add a tracepoint to log failures that would be otherwise unseen netfs: Define an interface to talk to a cache netfs: Add write_begin helper netfs: Gather stats netfs: Add tracepoints netfs: Provide readahead and readpage netfs helpers netfs, mm: Add set/end/wait_on_page_fscache() aliases netfs, mm: Move PG_fscache helper funcs to linux/netfs.h netfs: Documentation for helper library netfs: Make a netfs helper module mm: Implement readahead_control pageset expansion mm/readahead: Handle ractl nr_pages being modified fs: Document file_ra_state mm/filemap: Pass the file_ra_state in the ractl mm: Add set/end/wait functions for PG_private_2 iov_iter: Add ITER_XARRAY	2021-04-27 13:08:12 -07:00
Linus Torvalds	34a456eb1f	fs.idmapped.helpers.v5.13 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYIfiiwAKCRCRxhvAZXjc ogtMAQC+MtgJZdcH5iDHNEyI36JaWUccKRV7PdvfF1YgnXO45gD+IYxR1c/EQQyD kh2AmqhET6jVhe9Nsob5yxduksI+ygo= =oh/d -----END PGP SIGNATURE----- Merge tag 'fs.idmapped.helpers.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull fs mapping helper updates from Christian Brauner: "This adds kernel-doc to all new idmapping helpers and improves their naming which was triggered by a discussion with some fs developers. Some of the names are based on suggestions by Vivek and Al. Also remove the open-coded permission checking in a few places with simple helpers. Overall this should lead to more clarity and make it easier to maintain" * tag 'fs.idmapped.helpers.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fs: introduce two inode i_{u,g}id initialization helpers fs: introduce fsuidgid_has_mapping() helper fs: document and rename fsid helpers fs: document mapping helpers	2021-04-27 12:49:42 -07:00
Linus Torvalds	cc15422c1f	fs.idmapped.docs.v5.13 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYIfiFwAKCRCRxhvAZXjc oswFAP4sL0oA7mBGDzoxktIMWKY+f7KKDjb9gXc8fDQV9bbcNwD6A9QPJCahfab9 cndByav/xcB/7n/NXLecNYr8NcfTgg8= =mdyh -----END PGP SIGNATURE----- Merge tag 'fs.idmapped.docs.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull fs helper kernel-doc updates from Christian Brauner: "In the last cycles we forgot to update the kernel-docs in some places that were changed during the idmapped mount work. Lukas and Randy took the chance to not just fixup those places but also fixup and expand kernel-docs for some additional helpers. No functional changes" * tag 'fs.idmapped.docs.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fs: update kernel-doc for vfs_rename() fs: turn some comments into kernel-doc xattr: fix kernel-doc for mnt_userns and vfs xattr helpers namei: fix kernel-doc for struct renamedata and more libfs: fix kernel-doc for mnt_userns	2021-04-27 12:42:03 -07:00
Linus Torvalds	b34b95ebbb	New code for 5.13: - When a swap file is rejected, actually log the /name/ of the swapfile. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmBerWkACgkQ+H93GTRK tOu2zhAAm22CPqLH4/MJ8E3+DSCcZ1b/IKPemHk8lcJbPEmhoMyRxG6TFndo4p38 7IfsC1BChP9RYj+XwP3CWlya2c68cCu87RScDOShtFpxRQbSSv3HYsYrssfKLAXq Mn6GaIoLJu16gtQmw5kIbaPyJDlik+dyjNHowzv13hrYhjVqp97GgSiNq0uMMyJ5 7yfyWCig42V06FCVLVEatNKecI0gZy2n1ZxwMTIzqfcI+ItzYCHf6TGirKd1IVF2 kqjqEg2kQmft8pr0fAmSXxF23fRrjCFm8zhDnBcHWsPSAXmONXVEhicSotX60NrB ++KE6/eiNJNGZSdhhzGyIAYZsdlDHq0GS7OM0LVxp19/yc4kAYoi3UIrg0PBHkwt D6gGWqDpMuPwq++Ao9dWzM1tpXlmKGCC5Ju5Oqv1BIPcgMBJ6cd9ZVgCJgwn0+k8 y6wrlW1JDJqY5JcYfPCCd1Y7EPR+3lTtYhD39/UKDEg89lNXur9PEPDW+eN8jsqE EHJri5eQuGfprsmXp7qqOEur0BU5vzi9+O95b/EiJSQiHu52EtvbZYppllj1V9nj Nm/uz8Uq2BBehFOdSDZ3QlzlOXlronq74DCaK9RdRmWoipIH09cTkPoctGahH2uk OpE/9y1PIU+Kl1N2GU5W7rGud/jMFyYK1BVmHeIkvYqbLMNSXug= =9Oxg -----END PGP SIGNATURE----- Merge tag 'iomap-5.13-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull iomap update from Darrick Wong: "A single patch to the iomap code, which augments what gets logged when someone tries to swapon an unacceptable swap file. (Yes, this is a continuation of the swapfile drama from last season...)" * tag 'iomap-5.13-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: improve the warnings from iomap_swapfile_activate	2021-04-27 12:27:23 -07:00
Linus Torvalds	a4f7fae101	Merge branch 'miklos.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fileattr conversion updates from Miklos Szeredi via Al Viro: "This splits the handling of FS_IOC_[GS]ETFLAGS from ->ioctl() into a separate method. The interface is reasonably uniform across the filesystems that support it and gives nice boilerplate removal" * 'miklos.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits) ovl: remove unneeded ioctls fuse: convert to fileattr fuse: add internal open/release helpers fuse: unsigned open flags fuse: move ioctl to separate source file vfs: remove unused ioctl helpers ubifs: convert to fileattr reiserfs: convert to fileattr ocfs2: convert to fileattr nilfs2: convert to fileattr jfs: convert to fileattr hfsplus: convert to fileattr efivars: convert to fileattr xfs: convert to fileattr orangefs: convert to fileattr gfs2: convert to fileattr f2fs: convert to fileattr ext4: convert to fileattr ext2: convert to fileattr btrfs: convert to fileattr ...	2021-04-27 11:18:24 -07:00
Linus Torvalds	5e67208885	Merge branch 'work.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull coredump updates from Al Viro: "Just a couple of patches this cycle: use of seek + write instead of expanding truncate and minor header cleanup" * 'work.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: coredump.h: move CONFIG_COREDUMP-only stuff inside the ifdef coredump: don't bother with do_truncate()	2021-04-27 11:04:27 -07:00
Linus Torvalds	d1466bc583	Merge branch 'work.inode-type-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs inode type handling updates from Al Viro: "We should never change the type bits of ->i_mode or the method tables (->i_op and ->i_fop) of a live inode. Unfortunately, not all filesystems took care to prevent that" * 'work.inode-type-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: spufs: fix bogosity in S_ISGID handling 9p: missing chunk of "fs/9p: Don't update file type when updating file attributes" openpromfs: don't do unlock_new_inode() until the new inode is set up hostfs_mknod(): don't bother with init_special_inode() cifs: have cifs_fattr_to_inode() refuse to change type on live inode cifs: have ->mkdir() handle race with another client sanely do_cifs_create(): don't set ->i_mode of something we had not created gfs2: be careful with inode refresh ocfs2_inode_lock_update(): make sure we don't change the type bits of i_mode orangefs_inode_is_stale(): i_mode type bits do not form a bitmap... vboxsf: don't allow to change the inode type afs: Fix updating of i_mode due to 3rd party change ceph: don't allow type or device number to change on non-I_NEW inodes ceph: fix up error handling with snapdirs new helper: inode_wrong_type()	2021-04-27 10:57:42 -07:00
Linus Torvalds	57fa2369ab	CFI on arm64 series for v5.13-rc1 - Clean up list_sort prototypes (Sami Tolvanen) - Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen) -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmCHCR8ACgkQiXL039xt wCZyFQ//fnUZaXR2K354zDyW6CJljMf+d94RF6rH+J6eMTH2/HXa5v0iJokwABLf ussP6qF4k5wtmI22Gm9A5Zc3e4iiry5pC0jOdk0mk4gzWwFN9MdgNxJZIGA3xqhS bsBK4AGrVKjtZl48G1/ZxJuNDeJhVp6GNK2n6/Gl4rZF6R7D/Upz0XelyJRdDpcM HIGma7jZl6xfGU0mdWCzpOGK1zdMca1WVs7A4YuurSbLn5PZJrcNVWLouDqt/Si2 AduSri1gyPClicgvqWjMOzhUpuw/nJtBLRl1x1EsWk/KSZ1/uNVjlewfzdN4fZrr zbtFr2gLubYLK6JOX7/LqoHlOTgE3tYLL+WIVN75DsOGZBKgHhmebTmWLyqzV0SL oqcyM5d3ucC6msdtAK5Fv4MSp8rpjqlK1Ha4SGRT6kC2wut7AhZ3KD7eyRIz8mV9 Sa9mhignGFJnTEUp+LSbYdrAudgSKxB40WyXPmswAXX4VJFRD4ONrrcAON/SzkUT Hw/JdFRCKkJjgwNQjIQoZcUNMTbFz2PlNIEnjJWm38YImQKQlCb2mXaZKCwBkf45 aheCZk17eKoxTCXFMd+KxlyNEtS2yBfq/PpZgvw7GW/pfFbWUg1+2O41LnihIe5v zu0hN1wNCQqgfxiMZqX1OTb9C/2vybzGsXILt+9nppjZ8EBU7iU= =wU6U -----END PGP SIGNATURE----- Merge tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull CFI on arm64 support from Kees Cook: "This builds on last cycle's LTO work, and allows the arm64 kernels to be built with Clang's Control Flow Integrity feature. This feature has happily lived in Android kernels for almost 3 years[1], so I'm excited to have it ready for upstream. The wide diffstat is mainly due to the treewide fixing of mismatched list_sort prototypes. Other things in core kernel are to address various CFI corner cases. The largest code portion is the CFI runtime implementation itself (which will be shared by all architectures implementing support for CFI). The arm64 pieces are Acked by arm64 maintainers rather than coming through the arm64 tree since carrying this tree over there was going to be awkward. CFI support for x86 is still under development, but is pretty close. There are a handful of corner cases on x86 that need some improvements to Clang and objtool, but otherwise works well. Summary: - Clean up list_sort prototypes (Sami Tolvanen) - Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)" * tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: arm64: allow CONFIG_CFI_CLANG to be selected KVM: arm64: Disable CFI for nVHE arm64: ftrace: use function_nocfi for ftrace_call arm64: add __nocfi to __apply_alternatives arm64: add __nocfi to functions that jump to a physical address arm64: use function_nocfi with __pa_symbol arm64: implement function_nocfi psci: use function_nocfi for cpu_resume lkdtm: use function_nocfi treewide: Change list_sort to use const pointers bpf: disable CFI in dispatcher functions kallsyms: strip ThinLTO hashes from static functions kthread: use WARN_ON_FUNCTION_MISMATCH workqueue: use WARN_ON_FUNCTION_MISMATCH module: ensure __cfi_check alignment mm: add generic function_nocfi macro cfi: add __cficanonical add support for Clang CFI	2021-04-27 10:16:46 -07:00
Linus Torvalds	288321a9c6	pstore update for v5.13-rc1 - Add mem_type property to expand support for >2 memory types (Mukesh Ojha) -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmCHBv0ACgkQiXL039xt wCYkdQ//f8yGzy0fDYXtoA6Nw+wBC9NE2b1iaZ7WIvTAamcSuay4QGQHXOZZO8tc vCJG1parx7aRwPg5zpeYODUTBC6aVxLdRd2Zk46Pahy8XOel+Rr60oqr3LKdPaEW xzh+rkRd4uXSI/MjvNuNviiLFSkp7qqb+M3R2Gkto8C3cSVi3OURSPdyvdrAiWan 29xwwNvxva2cgWvP8GZL9V63piWzqpJWW8Y8Mxc3TMQDqwNczH8xqxfPEoP+jhIV WyD+gDlRFxjjohs2pQFxrdrXzIydT9wPH9Qe0h9QmFIXpOLBHMRo56rFy6kbZ1fj kAGtbd6//MfVBSBWneNrhocEFEZDK8V/WHvCY1OV9iGgB38wvI290hJqwt92NEyL vMg8Bcn4TmlW3LekjQZWzqn0BqHLPVSmOW7mImaO0HA33EpcnwqWqJuWxuyVelAG TZ0s0DU30eaquYX26Qi39+EkM7N/YsFc1VD5ejaghNkZrxPpdibTk3BtgKtQN0gT 9m3E8r+LiVX5RI4H18aW8AjzCE7p/BKfHRqCKyq77era8BhUGEXXmjGIwYDRlRQA eS90PJUKBnloLVS88urMkVsaF5MjM8hKOgbc8crxiISoC+fd3pQT1dIwRFGbMsiG iJP7bLfkKU/HEaQhWQwXqiaXb/qDWqRL6nrjsrUfW1rnlJ6INfc= =j6qi -----END PGP SIGNATURE----- Merge tag 'pstore-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore update from Kees Cook: - Add mem_type property to expand support for >2 memory types (Mukesh Ojha) * tag 'pstore-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore: Add mem_type property DT parsing support	2021-04-27 10:08:10 -07:00
Linus Torvalds	7e4910b9ac	seccomp updates for v5.13-rc1 - Fix "cacheable" typo in comments (Cui GaoSheng) - Fix CONFIG for /proc/$pid/status Seccomp_filters (Kenta.Tada@sony.com) -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmCHBe0ACgkQiXL039xt wCadDA//cy6LXlzJ78tRy1Zj4/iRlvfLGQ6rNuhoWkm9nuLOTJlzmb9lPxLFo1lo N4FDuXE0daPvmgy/XVu9wBKDZsgTlegzikGARfQmeHJ7Wj1H8ibz1OJPd1o60p4Y pfeImxefoNKxx7IxnNFDMLHgVi+CtnOZklwlj+bobIWjzclNB2EacumnyJlPuboW 4ZHBSkG1roLkBB4Q10fI7OHV8lSuQp/IyrAypLybydJ0xiZgvGD3NPOA4N8KH8nR A0kbA953Rld/PFzw5inRqepyPZKtT07LJyfl1ff60OtKOHkVBXPv6pYrdgWs0A9y XZxdHjVV/MHLvcK9dBoZZGi0/907fvcEgtMacaRekevD5sqiqtNOH5B5rQsMwtXs s/Kvg1KgmVJBQwFcMRuAfXqnnPy2672XvDU5/uptVbhpOIcIVeHtGvygPkiobuuO V1sE+huGCw+xnfRIOOmytRTpkHMlIS9ev1ApfXtuUtbXbM0W1G6H7adc0KE4bApm D/fpv97myH42r/UghOL5EHVaLcnw8embVr/ij4WpMiC1TrhWy0XU27oJisG6xRw6 A2Q4ybO3VM85LgteeQg10BZFmnuwfHMRJPBL4TOhNSs5GBx5EmkEFByozvMst5xR W/GIDn7g7jy1H0wuQOQ7NCgU5+RDDslCOjCIJdSipwpsTc65QCQ= =m4xO -----END PGP SIGNATURE----- Merge tag 'seccomp-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp updates from Kees Cook: - Fix "cacheable" typo in comments (Cui GaoSheng) - Fix CONFIG for /proc/$pid/status Seccomp_filters (Kenta.Tada@sony.com) * tag 'seccomp-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: seccomp: Fix "cacheable" typo in comments seccomp: Fix CONFIG tests for Seccomp_filters	2021-04-27 10:03:12 -07:00
Hao Xu	7b289c3833	io_uring: maintain drain logic for multishot poll requests Now that we have multishot poll requests, one SQE can emit multiple CQEs. given below example: sqe0(multishot poll)-->sqe1-->sqe2(drain req) sqe2 is designed to issue after sqe0 and sqe1 completed, but since sqe0 is a multishot poll request, sqe2 may be issued after sqe0's event triggered twice before sqe1 completed. This isn't what users leverage drain requests for. Here the solution is to wait for multishot poll requests fully completed. To achieve this, we should reconsider the req_need_defer equation, the original one is: all_sqes(excluding dropped ones) == all_cqes(including dropped ones) This means we issue a drain request when all the previous submitted SQEs have generated their CQEs. Now we should consider multishot requests, we deduct all the multishot CQEs except the cancellation one, In this way a multishot poll request behave like a normal request, so: all_sqes == all_cqes - multishot_cqes(except cancellations) Here we introduce cq_extra for it. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/1618298439-136286-1-git-send-email-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-27 07:38:58 -06:00
Palash Oswal	6d042ffb59	io_uring: Check current->io_uring in io_uring_cancel_sqpoll syzkaller identified KASAN: null-ptr-deref Write in io_uring_cancel_sqpoll. io_uring_cancel_sqpoll is called by io_sq_thread before calling io_uring_alloc_task_context. This leads to current->io_uring being NULL. io_uring_cancel_sqpoll should not have to deal with threads where current->io_uring is NULL. In order to cast a wider safety net, perform input sanitisation directly in io_uring_cancel_sqpoll and return for NULL value of current->io_uring. This is safe since if current->io_uring isn't set, then there's no way for the task to have submitted any requests. Reported-by: syzbot+be51ca5a4d97f017cd50@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Palash Oswal <hello@oswalpalash.com> Link: https://lore.kernel.org/r/20210427125148.21816-1-hello@oswalpalash.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-27 07:37:39 -06:00
Hyeongseok Kim	c6e2f52e30	exfat: speed up iterate/lookup by fixing start point of traversing cluster chain When directory iterate and lookup is called, there's a buggy rewinding of start point for traversing cluster chain to the parent directory entry's first cluster. This caused repeated cluster chain traversing from the first entry of the parent directory that would show worse performance if huge amounts of files exist under the parent directory. Fix not to rewind, make continue from currently referenced cluster and dir entry. Tested with 50,000 files under single directory / 256GB sdcard, with command "time ls -l > /dev/null", Before : 0m08.69s real 0m00.27s user 0m05.91s system After : 0m07.01s real 0m00.25s user 0m04.34s system Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2021-04-27 20:45:07 +09:00
Hyeongseok Kim	23befe490b	exfat: improve write performance when dirsync enabled Degradation of write speed caused by frequent disk access for cluster bitmap update on every cluster allocation could be improved by selective syncing bitmap buffer. Change to flush bitmap buffer only for the directory related operations. Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com> Acked-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2021-04-27 20:45:06 +09:00
Hyeongseok Kim	654762df2e	exfat: add support ioctl and FITRIM function Add FITRIM ioctl to enable discarding unused blocks while mounted. As current exFAT doesn't have generic ioctl handler, add empty ioctl function first, and add FITRIM handler. Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2021-04-27 20:45:06 +09:00
Hyeongseok Kim	5c2d728507	exfat: introduce bitmap_lock for cluster bitmap access s_lock which is for protecting concurrent access of file operations is too huge for cluster bitmap protection, so introduce a new bitmap_lock to narrow the lock range if only need to access cluster bitmap. Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com> Acked-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2021-04-27 20:45:06 +09:00
Hyeongseok Kim	77edfc6e51	exfat: fix erroneous discard when clear cluster bit If mounted with discard option, exFAT issues discard command when clear cluster bit to remove file. But the input parameter of cluster-to-sector calculation is abnormally added by reserved cluster size which is 2, leading to discard unrelated sectors included in target+2 cluster. With fixing this, remove the wrong comments in set/clear/find bitmap functions. Fixes: `1e49a94cf7` ("exfat: add bitmap operations") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com> Acked-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2021-04-27 20:45:06 +09:00
David Howells	53b776c77a	netfs: Miscellaneous fixes Fix some miscellaneous things in the new netfs lib[1]: (1) The kerneldoc for netfs_readpage() shouldn't say netfs_page(). (2) netfs_readpage() can get an integer overflow on 32-bit when it multiplies page_index(page) by PAGE_SIZE. It should use page_file_offset() instead. (3) netfs_write_begin() should use page_offset() to avoid the same overflow. Note that netfs_readpage() needs to use page_file_offset() rather than page_offset() as it may see swap-over-NFS. Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/161789062190.6155.12711584466338493050.stgit@warthog.procyon.org.uk/ [1]	2021-04-26 23:23:41 +01:00
Linus Torvalds	55ba0fe059	for-5.13-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmCHGMYACgkQxWXV+ddt WDsFeA/+MZ+5UiYYucH5RVw/VExOQzSvlRVxxnOeR2s8V/gFj/Ip7d9E9UezA+lX di2byKXPzfjL9+xoyqfEZLpPgDQrhHTIQCzuNSvzMBykJIL6Sf1OZTtUZWU3HDH7 S51UZtghgTPzeOhxsiBHqSFo9danT0w3KQhliE10Ur855ziKSvL2Tb7dvM6q5TS7 mFTAj/Y2aanDaFKQjjBzzA+GZ0LFIGuErg1PADmF5XjbyY06ho1xqQ1A0t2/XL9x UpHdRP3E5XRAMl4uyYOUtbvUB1cROzoS6ySHJOJ9Bbz+IC0cLf5xTJkLE25bGkSi GjFNvnQOha1s8oMIlqkw64hKQqwp+gu2iZ7m1o76Z31k7CpLAC+rg11gbpODRuoh 7B3EzowKyVihMHF8URAdC3A+9gbpPvyuGKDSy07yULh/2vas6dEzR3cPVEU0yXyJ 3DO1ds0lVY3B/T9LKPQ785hQ7VdpgZ8BdIOVRtjgV2QQEa9eFh9VzybQjU8yBXGd vflBe8kQfASIZ5E0rcUGPUVIJoesM8U1pSlx9jvvTQVkOC/DQjtBx/5ePCL2iVfd izY8uWlCdguF/P1CYFf1M0auASSzl3bip1NnSMVvZ90dgDEK4XaIyd16kMGDCbU2 UMOePMsLDApWcCVTqM/J+lFLa7rajRccdKby7F/zSpZIRgadPF8= =J5jh -----END PGP SIGNATURE----- Merge tag 'for-5.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The updates this time are mostly stabilization, preparation and minor improvements. User visible improvements: - readahead for send, improving run time of full send by 10% and for incremental by 25% - make reflinks respect O_SYNC, O_DSYNC and S_SYNC flags - export supported sectorsize values in sysfs (currently only page size, more once full subpage support lands) - more graceful errors and warnings on 32bit systems when logical addresses for metadata reach the limit posed by unsigned long in page::index - error: fail mount if there's a metadata block beyond the limit - error: new metadata block would be at unreachable address - warn when 5/8th of the limit is reached, for 4K page systems it's 10T, for 64K page it's 160T - zoned mode - relocated zones get reset at the end instead of discard - automatic background reclaim of zones that have 75%+ of unusable space, the threshold is tunable in sysfs Fixes: - fsync and tree mod log fixes - fix inefficient preemptive reclaim calculations - fix exhaustion of the system chunk array due to concurrent allocations - fix fallback to no compression when racing with remount - preemptive fix for dm-crypt on zoned device that does not properly advertise zoned support Core changes: - add inode lock to synchronize mmap and other block updates (eg. deduplication, fallocate, fsync) - kmap conversions to new kmap_local API - subpage support (continued) - new helpers for page state/extent buffer tracking - metadata changes now support read and write - error handling through out relocation call paths - many other cleanups and code simplifications" * tag 'for-5.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (112 commits) btrfs: zoned: automatically reclaim zones btrfs: rename delete_unused_bgs_mutex to reclaim_bgs_lock btrfs: zoned: reset zones of relocated block groups btrfs: more graceful errors/warnings on 32bit systems when reaching limits btrfs: zoned: fix unpaired block group unfreeze during device replace btrfs: fix race when picking most recent mod log operation for an old root btrfs: fix metadata extent leak after failure to create subvolume btrfs: handle remount to no compress during compression btrfs: zoned: fail mount if the device does not support zone append btrfs: fix race between transaction aborts and fsyncs leading to use-after-free btrfs: introduce submit_eb_subpage() to submit a subpage metadata page btrfs: make lock_extent_buffer_for_io() to be subpage compatible btrfs: introduce write_one_subpage_eb() function btrfs: introduce end_bio_subpage_eb_writepage() function btrfs: check return value of btrfs_commit_transaction in relocation btrfs: do proper error handling in merge_reloc_roots btrfs: handle extent corruption with select_one_root properly btrfs: cleanup error handling in prepare_to_merge btrfs: do not panic in __add_reloc_root btrfs: handle __add_reloc_root failures in btrfs_recover_relocation ...	2021-04-26 13:48:02 -07:00
Linus Torvalds	2a19866b6e	40 cifs/smb3 changesets, including 4 for stable -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmCHFqEACgkQiiy9cAdy T1FLpQv9F3JpIUOCODcrUe7wn51R7Of38QU6YP1nu4df0SN40KXjIBiweTKF3W7v qtLHNtHCjmxiirlCRpmLO7rCwf59Y31UweQTQa47XosgZZWKfp8PhsUuRib2WDLa /Sw4SBlmVktg4hutxvTiWYW2AbAMZhfTP1SHt61BXItUwt9c2boTBfsqpF0PbYh/ 39AOlFgJ6HVYS4G1YLTQ29kNU/qNkFtY9M4PH4Km68tyL5MP14s35Ee1+cIxtuFT 0FGYXl+D3RhXxU/jN8YfdRBfG8AoA0xFEr59HhJnmaoCd9fa8f3PNBn1pX4KcPuS EhvymK6FfFZk25Y/bcrZhsbJ8eNtx3o7dy7MFNaKeHneFnVvKOAQTtqrRQrj+O2K byV1KgXy0SDaJTKMzPSZTldf42ZvQhCMS0Rd5cRt226gly0CVwMAMqwb8M/SaGVJ dus00y05XybVVApsj8/ZizZgtXTJDyFyfcc2MMx3aNIkg6sB74TwH1XE37U7al8n y64KPcMw =6NqG -----END PGP SIGNATURE----- Merge tag '5.12-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs updates from Steve French: - improvements to root directory metadata caching - addition of new "rasize" mount parameter which can significantly increase read ahead performance (e.g. copy can be much faster, especially with multichannel) - addition of support for insert and collapse range - improvements to error handling in mount * tag '5.12-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6: (40 commits) cifs: update internal version number smb3: add rasize mount parameter to improve readahead performance smb3: limit noisy error cifs: fix leak in cifs_smb3_do_mount() ctx cifs: remove unnecessary copies of tcon->crfid.fid cifs: Return correct error code from smb2_get_enc_key cifs: fix out-of-bound memory access when calling smb3_notify() at mount point smb2: fix use-after-free in smb2_ioctl_query_info() cifs: export supported mount options via new mount_params /proc file cifs: log mount errors using cifs_errorf() cifs: add fs_context param to parsing helpers cifs: make fs_context error logging wrapper cifs: add FALLOC_FL_INSERT_RANGE support cifs: add support for FALLOC_FL_COLLAPSE_RANGE cifs: check the timestamp for the cached dirent when deciding on revalidate cifs: pass the dentry instead of the inode down to the revalidation check functions cifs: add a timestamp to track when the lease of the cached dir was taken cifs: add a function to get a cached dir based on its dentry cifs: Grab a reference for the dentry of the cached directory during the lifetime of the cache cifs: store a pointer to the root dentry in cifs_sb_info once we have completed mounting the share ...	2021-04-26 13:41:30 -07:00
Linus Torvalds	c065c42966	Highlights: - Update NFSv2 and NFSv3 XDR encoding functions - Add batch Receive posting to the server's RPC/RDMA transport (take 2) - Reduce page allocator traffic in svcrdma -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmBzLCIACgkQM2qzM29m f5evWA//fE2WlZDoTP8Iq1BGreGrGyzqOIkJakDGoZs4VOaUJN9WxWEcmBHI4t22 yom7aZ7S7VtMF6SoGMnohYoNwkloPJ1kfYBVZuUxDUCIHGVrLaAGZwjtojQftUS0 19ZdiSx8D8xWEqI/cpbHsj+CNCH1F5IDGjJzNhlz+rIFLrRPBMeHwbY8qf/zusm/ uZ4tGJtKWmFXvdT9duGkoNXRd0gBdcfDeFN//JYrLPS9sX4zs4/2KnDh25YJR8Jf EQryc19l4ztlVT8PXJfI4I/fG0Sfv5AWuYYzaFIncp1PkmiunqGL1yai6eeW4LEN 8v3QEUNy9J7J20FrsL1ge1icbEyObfNFvkgYqNhmGIBdbEDdLWPppL6fQKUYl/zi HSnAoJsJYyzYKbE7BMLy3wwZ775GsTrkU/7tfiu/M9KgvXJtDy0m7vsxt3/qULXn Bg4KAKnmYIcuigdG6BATJ23jISRK7cHH/YYlORj3lsB/KQi+c2U/5zQFaROOwbtc Ny+y92zQZVu9ZMkUfs2+b5qdg8Z/J0p8n9MfS7GcpCIyGZTnvfs8SfM8BXuHH2Yn BdL3xDoPlbYQnAzp16oVfdoYm8RWzFBv+xHc360ielMOQbP4ntdC0oexIApsXzyz Yv4OK9itsLagwcuAxPWnfnIgGxj9hPsp/peUsChsglhx+4NSGWA= =BOO0 -----END PGP SIGNATURE----- Merge tag 'nfsd-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd updates from Chuck Lever: "Highlights: - Update NFSv2 and NFSv3 XDR encoding functions - Add batch Receive posting to the server's RPC/RDMA transport (take 2) - Reduce page allocator traffic in svcrdma" * tag 'nfsd-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (70 commits) NFSD: Use DEFINE_SPINLOCK() for spinlock sunrpc: Remove unused function ip_map_lookup NFSv4.2: fix copy stateid copying for the async copy UAPI: nfsfh.h: Replace one-element array with flexible-array member svcrdma: Clean up dto_q critical section in svc_rdma_recvfrom() svcrdma: Remove svc_rdma_recv_ctxt::rc_pages and ::rc_arg svcrdma: Remove sc_read_complete_q svcrdma: Single-stage RDMA Read SUNRPC: Move svc_xprt_received() call sites SUNRPC: Export svc_xprt_received() svcrdma: Retain the page backing rq_res.head[0].iov_base svcrdma: Remove unused sc_pages field svcrdma: Normalize Send page handling svcrdma: Add a "deferred close" helper svcrdma: Maintain a Receive water mark svcrdma: Use svc_rdma_refresh_recvs() in wc_receive svcrdma: Add a batch Receive posting mechanism svcrdma: Remove stale comment for svc_rdma_wc_receive() svcrdma: Provide an explanatory comment in CMA event handler svcrdma: RPCDBG_FACILITY is no longer used ...	2021-04-26 13:34:32 -07:00
Linus Torvalds	b5b3097d9c	Changes since last update: - avoid memory failure when applying rolling decompression; - optimize endio decompression logic for non-atomic contexts; - complete a missing case which can be safely selected for inplace I/O and thus decreasing more memory footprint; - check unsupported on-disk inode i_format strictly; - support adjustable lz4 sliding window size to decrease runtime memory footprint; - support on-disk compression configurations; - support big pcluster decompression; - several code cleanups / spelling correction. -----BEGIN PGP SIGNATURE----- iIsEABYIADMWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYIZfvhUcaHNpYW5na2Fv QHJlZGhhdC5jb20ACgkQOTcx3B+15gRi3gD6A2+hqDgBIASDRhgJvcG8IXyCSNSi RnIykjj1PTXPtNgA/R26f2YGUP04v343tuK7Wm6voKzSVW4Uud2DwhSlXPIE =9bB5 -----END PGP SIGNATURE----- Merge tag 'erofs-for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "In this cycle, we would like to introduce a new feature called big pcluster so EROFS can compress file data into more than 1 fs block and different pcluster size can be selected for each (sub-)files by design. The current EROFS test results on my laptop are [1]: Testscript: erofs-openbenchmark Testdata: enwik9 (1000000000 bytes) ________________________________________________________________ \| file system \| size \| seq read \| rand read \| rand9m read \| \|_______________\|___________\|_ MiB/s __\|__ MiB/s __\|___ MiB/s ___\| \|___erofs_4k____\|_556879872_\|_ 781.4 __\|__ 55.3 ___\|___ 25.3 ___\| \|___erofs_16k___\|_452509696_\|_ 864.8 __\|_ 123.2 ___\|___ 20.8 ___\| \|___erofs_32k___\|_415223808_\|_ 899.8 __\|_ 105.8 __\|___ 16.8 ____\| \|___erofs_64k___\|_393814016_\|_ 906.6 __\|__ 66.6 __\|___ 11.8 ____\| \|__squashfs_8k__\|_556191744_\|_ 64.9 __\|__ 19.3 ___\|____ 9.1 ____\| \|__squashfs_16k_\|_502661120_\|_ 98.9 __\|__ 38.0 ___\|____ 9.8 ____\| \|__squashfs_32k_\|_458784768_\|_ 115.4 __\|__ 71.6 __\|___ 10.0 ____\| \|_squashfs_128k_\|_398204928_\|_ 257.2 __\|_ 253.8 __\|___ 10.9 ____\| \|____ext4_4k____\|____()_____\|_ 786.6 __\|__ 28.6 ___\|___ 27.8 ____\| which has been verified but I'd like warn it as experimental for a while. This matches erofs-utils dev branch and I'll also release a new userspace version for this later. Apart from that, several improvements are also included: eg complete a missing case for inplace I/O, optimize endio decompression logic for non-atomic contexts and support adjustable sliding window size, ... In addition to those, there are some cleanups as always. Summary: - avoid memory failure when applying rolling decompression - optimize endio decompression logic for non-atomic contexts - complete a missing case which can be safely selected for inplace I/O and thus decreasing more memory footprint - check unsupported on-disk inode i_format strictly - support adjustable lz4 sliding window size to decrease runtime memory footprint - support on-disk compression configurations - support big pcluster decompression - several code cleanups / spelling correction" * tag 'erofs-for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (21 commits) erofs: enable big pcluster feature erofs: support decompress big pcluster for lz4 backend erofs: support parsing big pcluster compact indexes erofs: support parsing big pcluster compress indexes erofs: adjust per-CPU buffers according to max_pclusterblks erofs: add big physical cluster definition erofs: fix up inplace I/O pointer for big pcluster erofs: introduce physical cluster slab pools erofs: introduce multipage per-CPU buffers erofs: reserve physical_clusterbits[] erofs: Clean up spelling mistakes found in fs/erofs erofs: add on-disk compression configurations erofs: introduce on-disk lz4 fs configurations erofs: support adjust lz4 history window size erofs: introduce erofs_sb_has_xxx() helpers erofs: add unsupported inode i_format check erofs: don't use erofs_map_blocks() any more erofs: complete a missing case for inplace I/O erofs: use sync decompression for atomic contexts only erofs: use workqueue decompression for atomic contexts only ...	2021-04-26 13:28:12 -07:00
Linus Torvalds	befbfe07e6	File locking fixes for v5.13 -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmCGnowTHGpsYXl0b25A a2VybmVsLm9yZwAKCRAADmhBGVaCFcKnEAC0MjXWbAvisoEMQDej+1FrqJSvUuMb kYGyjWQxLoQwb2Yj4FAjOwg0PtCq5r29CtgKvVjr4Dq2RpVzslG1Yt7ql6eRta8k rA2tjU1qosYLgMrj7PkItLC+rvFKZeF3X54SFFrLCjuu6/rMZH2v3d3C6oUsruba mWOdkX0Q2vApGfn7ooFOIe3UE29IG1p/6azCfcjjVUi19ibCFyxhxN4IU0nU+x4+ 86KIDwud7iijY/pBcHs1g6F9lD4TyA/XKqXgonC71rtqD7zlZWwRhugNaKmCqK12 2CskoxFpuVeFtI/PLe/mf9q1aVElZppa2fKQhIrWey3L7dVdU583kbhIiSlo5mvC 0jFy8r1+JcWfKB+HGjSFQQvG3FkST+ZZ6+eVlOoY5Wdxc/kzlQLBSBrWkWDtsjvm +oCmhX9T0ecwUH+AWEr27WP8eSsidSjHJAZY6DGuSwSZig9qEOo9Ayc7qTj3lB/I KGL8z8d+x27jXnNMG2+b8acYNC/dhMyIb5Z69/qPptvThteUne/WvTMU14eRCvqm C7R1QpQRvgtGiJl8PWkzjxUoKI2XktSL+arbRsqIP3mxlJ6pZJyJpaxDMYTcfz9D sWzapnORBKXxvK2xcuXip8v9w3yqgONA8KE5xQrTL4aCg16bXJVXI4c9nN4frNBD z2DRhnGw6nXoSQ== =88Qz -----END PGP SIGNATURE----- Merge tag 'locks-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: "When we reworked the blocked locks into a tree structure instead of a flat list a few releases ago, we lost the ability to see all of the file locks in /proc/locks. Luo's patch fixes it to dump out all of the blocked locks instead, which restores the full output. This changes the format of /proc/locks as the blocked locks are shown at multiple levels of indentation now, but lslocks (the only common program I've ID'ed that scrapes this info) seems to be OK with that. Tian also contributed a small patch to remove a useless assignment" * tag 'locks-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs/locks: remove useless assignment in fcntl_getlk fs/locks: print full locks information	2021-04-26 13:24:39 -07:00
Linus Torvalds	2f9ef0559e	It's been a relatively busy cycle in docsland, though more than usually well contained to Documentation/ itself. Highlights include: - The Chinese translators have been busy and show no signs of stopping anytime soon. Italian has also caught up. - Aditya Srivastava has been working on improvements to the kernel-doc script. - Thorsten continues his work on reporting-issues.rst and related documentation around regression reporting. - Lots of documentation updates, typo fixes, etc. as usual -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmCG5moPHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5YCoUH/1q/O+IvS+JNkxneDxbB6OC799BQpabZHi7/ HbYfgfX0nKrV3NAwIhigsIj6WHRE+5p2rKiHOuQxL3daJyfZSqQl0/yI0Ag7Of4g 7y1FKBQrfqS6tJcyNckdtBfxYUQP9yCJY0xfIexkTNiujbmkMKDSJD7lKXd0AaTM styCvTbgTPTzadL5bIHj/GxJ9s8DsxO3y9LGdRc+GrNzPFliMYWlJgbR28zjEKBm UQzy7JGNBX3qTJwgjvv/myqRDy6MligvGrP+wG0KTnAHXKkvDFl3p46kPwzdk1JE +F5sbboUWh20GLYy9t4MZOcq38FUcEPlRPXkxsGNyA8co5ij8+g= =7db3 -----END PGP SIGNATURE----- Merge tag 'docs-5.13' of git://git.lwn.net/linux Pull documentation updates from Jonathan Corbet: "It's been a relatively busy cycle in docsland, though more than usually well contained to Documentation/ itself. Highlights include: - The Chinese translators have been busy and show no signs of stopping anytime soon. Italian has also caught up. - Aditya Srivastava has been working on improvements to the kernel-doc script. - Thorsten continues his work on reporting-issues.rst and related documentation around regression reporting. - Lots of documentation updates, typo fixes, etc. as usual" * tag 'docs-5.13' of git://git.lwn.net/linux: (139 commits) docs/zh_CN: add openrisc translation to zh_CN index docs/zh_CN: add openrisc index.rst translation docs/zh_CN: add openrisc todo.rst translation docs/zh_CN: add openrisc openrisc_port.rst translation docs/zh_CN: add core api translation to zh_CN index docs/zh_CN: add core-api index.rst translation docs/zh_CN: add core-api irq index.rst translation docs/zh_CN: add core-api irq irqflags-tracing.rst translation docs/zh_CN: add core-api irq irq-domain.rst translation docs/zh_CN: add core-api irq irq-affinity.rst translation docs/zh_CN: add core-api irq concepts.rst translation docs: sphinx-pre-install: don't barf on beta Sphinx releases scripts: kernel-doc: improve parsing for kernel-doc comments syntax docs/zh_CN: two minor fixes in zh_CN/doc-guide/ Documentation: dev-tools: Add Testing Overview docs/zh_CN: add translations in zh_CN/dev-tools/gcov docs: reporting-issues: make people CC the regressions list MAINTAINERS: add regressions mailing list doc:it_IT: align Italian documentation docs/zh_CN: sync reporting-issues.rst ...	2021-04-26 13:22:43 -07:00
Linus Torvalds	c01c0716cc	Driver core changes for 5.13-rc1 Here is the "big" set of driver core changes for 5.13-rc1. Nothing major, just lots of little core changes and cleanups, notable things are: - finally set fw_devlink=on by default. All reported issues with this have been shaken out over the past 9 months or so, but we will be paying attention to any fallout here in case we need to revert this as the default boot value (symptoms of problems are a simple lack of booting) - fixes found to be needed by fw_devlink=on value in some subsystems (like clock). - delayed work initialization cleanup - driver core cleanups and minor updates - software node cleanups and tweaks - devtmpfs cleanups - minor debugfs cleanups All of these have been in linux-next for a while with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYIazPA8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ylzUwCguQ+VUs1d0voq/oKiqR+lbXnQf3kAn0jf/eom ucRSdeIc21eEE83Ei9aZ =pchl -----END PGP SIGNATURE----- Merge tag 'driver-core-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the "big" set of driver core changes for 5.13-rc1. Nothing major, just lots of little core changes and cleanups, notable things are: - finally set 'fw_devlink=on' by default. All reported issues with this have been shaken out over the past 9 months or so, but we will be paying attention to any fallout here in case we need to revert this as the default boot value (symptoms of problems are a simple lack of booting) - fixes found to be needed by fw_devlink=on value in some subsystems (like clock). - delayed work initialization cleanup - driver core cleanups and minor updates - software node cleanups and tweaks - devtmpfs cleanups - minor debugfs cleanups All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (53 commits) devm-helpers: Fix devm_delayed_work_autocancel() kerneldoc PM / wakeup: use dev_set_name() directly software node: Allow node addition to already existing device kunit: software node: adhear to KUNIT formatting standard node: fix device cleanups in error handling code kobject_uevent: remove warning in init_uevent_argv() debugfs: Make debugfs_allow RO after init Revert "driver core: platform: Make platform_get_irq_optional() optional" media: ipu3-cio2: Switch to use SOFTWARE_NODE_REFERENCE() software node: Introduce SOFTWARE_NODE_REFERENCE() helper macro software node: Imply kobj_to_swnode() to be no-op software node: Deduplicate code in fwnode_create_software_node() software node: Introduce software_node_alloc()/software_node_free() software node: Free resources explicitly when swnode_register() fails debugfs: drop pointless nul-termination in debugfs_read_file_bool() driver core: add helper for deferred probe reason setting driver core: Improve fw_devlink & deferred_probe_timeout interaction of: property: fw_devlink: Add support for remote-endpoint driver core: platform: Make platform_get_irq_optional() optional driver core: Replace printf() specifier and drop unneeded casting ...	2021-04-26 11:05:36 -07:00
Chao Yu	9557727876	f2fs: drop inplace IO if fs status is abnormal If filesystem has cp_error or need_fsck status, let's drop inplace IO to avoid further corruption of fs data. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-04-26 09:50:39 -07:00
Chao Yu	8af85f712f	f2fs: compress: remove unneed check condition In only call path of __cluster_may_compress(), __f2fs_write_data_pages() has checked SBI_POR_DOING condition, and also cluster_may_compress() has checked CP_ERROR_FLAG condition, so remove redundant check condition in __cluster_may_compress() for cleanup. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-04-26 09:50:36 -07:00
Linus Torvalds	a4a78bc8ea	Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - crypto_destroy_tfm now ignores errors as well as NULL pointers Algorithms: - Add explicit curve IDs in ECDH algorithm names - Add NIST P384 curve parameters - Add ECDSA Drivers: - Add support for Green Sardine in ccp - Add ecdh/curve25519 to hisilicon/hpre - Add support for AM64 in sa2ul" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (184 commits) fsverity: relax build time dependency on CRYPTO_SHA256 fscrypt: relax Kconfig dependencies for crypto API algorithms crypto: camellia - drop duplicate "depends on CRYPTO" crypto: s5p-sss - consistently use local 'dev' variable in probe() crypto: s5p-sss - remove unneeded local variable initialization crypto: s5p-sss - simplify getting of_device_id match data ccp: ccp - add support for Green Sardine crypto: ccp - Make ccp_dev_suspend and ccp_dev_resume void functions crypto: octeontx2 - add support for OcteonTX2 98xx CPT block. crypto: chelsio/chcr - Remove useless MODULE_VERSION crypto: ux500/cryp - Remove duplicate argument crypto: chelsio - remove unused function crypto: sa2ul - Add support for AM64 crypto: sa2ul - Support for per channel coherency dt-bindings: crypto: ti,sa2ul: Add new compatible for AM64 crypto: hisilicon - enable new error types for QM crypto: hisilicon - add new error type for SEC crypto: hisilicon - support new error types for ZIP crypto: hisilicon - dynamic configuration 'err_info' crypto: doc - fix kernel-doc notation in chacha.c and af_alg.c ...	2021-04-26 08:51:23 -07:00
Pavel Begunkov	0b8c0e7c96	io_uring: fix NULL reg-buffer io_import_fixed() doesn't expect a registered buffer slot to be NULL and would fail stumbling on it. We don't allow it, but if during __io_sqe_buffers_update() rsrc removal succeeds but following register fails, we'll get such a situation. Do it atomically and don't remove buffers until we sure that a new one can be set. Fixes: `634d00df5e` ("io_uring: add full-fledged dynamic buffers support") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/830020f9c387acddd51962a3123b5566571b8c6d.1619446608.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-26 09:03:57 -06:00
Dai Ngo	d9092b4bb2	NFSv4.2: Remove ifdef CONFIG_NFSD from NFSv4.2 client SSC code. The client SSC code should not depend on any of the CONFIG_NFSD config. This patch removes all CONFIG_NFSD from NFSv4.2 client SSC code and simplifies the config of CONFIG_NFS_V4_2_SSC_HELPER, NFSD_V4_2_INTER_SSC. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-04-26 09:29:17 -04:00
Pavel Begunkov	9f59a9d88d	io_uring: simplify SQPOLL cancellations All sqpoll rings (even sharing sqpoll task) are currently dead bound to the task that created them, iow when owner task dies it kills all its SQPOLL rings and their inflight requests via task_work infra. It's neither the nicist way nor the most convenient as adds extra locking/waiting and dependencies. Leave it alone and rely on SIGKILL being delivered on its thread group exit, so there are only two cases left: 1) thread group is dying, so sqpoll task gets a signal and exit itself cancelling all requests. 2) an sqpoll ring is dying. Because refs_kill() is called the sqpoll not going to submit any new request, and that's what we need. And io_ring_exit_work() will do all the cancellation itself before actually killing ctx, so sqpoll doesn't need to worry about it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3cd7f166b9c326a2c932b70e71a655b03257b366.1619389911.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-26 06:59:25 -06:00
Pavel Begunkov	28090c1338	io_uring: fix work_exit sqpoll cancellations After closing an SQPOLL ring, io_ring_exit_work() kicks in and starts doing cancellations via io_uring_try_cancel_requests(). It will go through io_uring_try_cancel_iowq(), which uses ctx->tctx_list, but as SQPOLL task don't have a ctx note, its io-wq won't be reachable and so is left not cancelled. It will eventually cancelled when one of the tasks dies, but if a thread group survives for long and changes rings, it will spawn lots of unreclaimed resources and live locked works. Cancel SQPOLL task's io-wq separately in io_ring_exit_work(). Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a71a7fe345135d684025bb529d5cb1d8d6b46e10.1619389911.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-26 06:59:25 -06:00
Colin Ian King	615cee49b3	io_uring: Fix uninitialized variable up.resv The variable up.resv is not initialized and is being checking for a non-zero value in the call to _io_register_rsrc_update. Fix this by explicitly setting the variable to 0. Addresses-Coverity: ("Uninitialized scalar variable)" Fixes: `c3bdad0271` ("io_uring: add generic rsrc update with tags") Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20210426094735.8320-1-colin.king@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-26 06:51:09 -06:00
Pavel Begunkov	a2b4198cab	io_uring: fix invalid error check after malloc Now we allocate io_mapped_ubuf instead of bvec, so we clearly have to check its address after allocation. Fixes: `41edf1a5ec` ("io_uring: keep table of pointers to ubufs") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d28eb1bc4384284f69dbce35b9f70c115ff6176f.1619392565.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-26 06:50:35 -06:00
Steve French	a8a6082d4a	cifs: update internal version number To 2.32 Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 23:59:27 -05:00
Steve French	b8d64f8ced	smb3: add rasize mount parameter to improve readahead performance In some cases readahead of more than the read size can help (to allow parallel i/o of read ahead which can improve performance). Ceph introduced a mount parameter "rasize" to allow controlling this. Add mount parameter "rasize" to allow control of amount of readahead requested of the server. If rasize not set, rasize defaults to negotiated rsize as before. Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 23:59:08 -05:00
Steve French	423333bcba	smb3: limit noisy error For servers which don't support copy_range (SMB3 CopyChunk), the logging of: CIFS: VFS: \\server\share refcpy ioctl error -95 getting resume key can fill the client logs and make debugging real problems more difficult. Change the -EOPNOTSUPP on copy_range to a "warn once" Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:24 -05:00
David Disseldorp	315db9a05b	cifs: fix leak in cifs_smb3_do_mount() ctx cifs_smb3_do_mount() calls smb3_fs_context_dup() and then cifs_setup_volume_info(). The latter's subsequent smb3_parse_devname() call overwrites the cifs_sb->ctx->UNC string already dup'ed by smb3_fs_context_dup(), resulting in a leak. E.g. unreferenced object 0xffff888002980420 (size 32): comm "mount", pid 160, jiffies 4294892541 (age 30.416s) hex dump (first 32 bytes): 5c 5c 31 39 32 2e 31 36 38 2e 31 37 34 2e 31 30 \\192.168.174.10 34 5c 72 61 70 69 64 6f 2d 73 68 61 72 65 00 00 4\rapido-share.. backtrace: [<00000000069e12f6>] kstrdup+0x28/0x50 [<00000000b61f4032>] smb3_fs_context_dup+0x127/0x1d0 [cifs] [<00000000c6e3e3bf>] cifs_smb3_do_mount+0x77/0x660 [cifs] [<0000000063467a6b>] smb3_get_tree+0xdf/0x220 [cifs] [<00000000716f731e>] vfs_get_tree+0x1b/0x90 [<00000000491d3892>] path_mount+0x62a/0x910 [<0000000046b2e774>] do_mount+0x50/0x70 [<00000000ca7b64dd>] __x64_sys_mount+0x81/0xd0 [<00000000b5122496>] do_syscall_64+0x33/0x40 [<000000002dd397af>] entry_SYSCALL_64_after_hwframe+0x44/0xae This change is a bandaid until the cifs_setup_volume_info() TODO and error handling issues are resolved. Signed-off-by: David Disseldorp <ddiss@suse.de> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> CC: <stable@vger.kernel.org> # v5.11+ Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:24 -05:00
Muhammad Usama Anjum	ad7567bc65	cifs: remove unnecessary copies of tcon->crfid.fid pfid is being set to tcon->crfid.fid and they are copied in each other multiple times. Remove the memcopy between same pointers - memory locations. Addresses-Coverity: ("Overlapped copy") Fixes: `9e81e8ff74` ("cifs: return cached_fid from open_shroot") Signed-off-by: Muhammad Usama Anjum <musamaanjum@gmail.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:24 -05:00
Paul Aurich	83728cbf36	cifs: Return correct error code from smb2_get_enc_key Avoid a warning if the error percolates back up: [440700.376476] CIFS VFS: \\otters.example.com crypt_message: Could not get encryption key [440700.386947] ------------[ cut here ]------------ [440700.386948] err = 1 [440700.386977] WARNING: CPU: 11 PID: 2733 at /build/linux-hwe-5.4-p6lk6L/linux-hwe-5.4-5.4.0/lib/errseq.c:74 errseq_set+0x5c/0x70 ... [440700.397304] CPU: 11 PID: 2733 Comm: tar Tainted: G OE 5.4.0-70-generic #78~18.04.1-Ubuntu ... [440700.397334] Call Trace: [440700.397346] __filemap_set_wb_err+0x1a/0x70 [440700.397419] cifs_writepages+0x9c7/0xb30 [cifs] [440700.397426] do_writepages+0x4b/0xe0 [440700.397444] __filemap_fdatawrite_range+0xcb/0x100 [440700.397455] filemap_write_and_wait+0x42/0xa0 [440700.397486] cifs_setattr+0x68b/0xf30 [cifs] [440700.397493] notify_change+0x358/0x4a0 [440700.397500] utimes_common+0xe9/0x1c0 [440700.397510] do_utimes+0xc5/0x150 [440700.397520] __x64_sys_utimensat+0x88/0xd0 Fixes: `61cfac6f26` ("CIFS: Fix possible use after free in demultiplex thread") Signed-off-by: Paul Aurich <paul@darkrain42.org> CC: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:24 -05:00
Eugene Korenevsky	a637f4ae03	cifs: fix out-of-bound memory access when calling smb3_notify() at mount point If smb3_notify() is called at mount point of CIFS, build_path_from_dentry() returns the pointer to kmalloc-ed memory with terminating zero (this is empty FileName to be passed to SMB2 CREATE request). This pointer is assigned to the `path` variable. Then `path + 1` (to skip first backslash symbol) is passed to cifs_convert_path_to_utf16(). This is incorrect for empty path and causes out-of-bound memory access. Get rid of this "increase by one". cifs_convert_path_to_utf16() already contains the check for leading backslash in the path. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=212693 CC: <stable@vger.kernel.org> # v5.6+ Signed-off-by: Eugene Korenevsky <ekorenevsky@astralinux.ru> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:24 -05:00
Aurelien Aptel	ccd48ec3d4	smb2: fix use-after-free in smb2_ioctl_query_info() * rqst[1,2,3] is allocated in vars * each rqst->rq_iov is also allocated in vars or using pooled memory SMB2_open_free, SMB2_ioctl_free, SMB2_query_info_free are iterating on each rqst after vars has been freed (use-after-free), and they are freeing the kvec a second time (double-free). How to trigger: * compile with KASAN * mount a share $ smbinfo quota /mnt/foo Segmentation fault $ dmesg ================================================================== BUG: KASAN: use-after-free in SMB2_open_free+0x1c/0xa0 Read of size 8 at addr ffff888007b10c00 by task python3/1200 CPU: 2 PID: 1200 Comm: python3 Not tainted 5.12.0-rc6+ #107 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014 Call Trace: dump_stack+0x93/0xc2 print_address_description.constprop.0+0x18/0x130 ? SMB2_open_free+0x1c/0xa0 ? SMB2_open_free+0x1c/0xa0 kasan_report.cold+0x7f/0x111 ? smb2_ioctl_query_info+0x240/0x990 ? SMB2_open_free+0x1c/0xa0 SMB2_open_free+0x1c/0xa0 smb2_ioctl_query_info+0x2bf/0x990 ? smb2_query_reparse_tag+0x600/0x600 ? cifs_mapchar+0x250/0x250 ? rcu_read_lock_sched_held+0x3f/0x70 ? cifs_strndup_to_utf16+0x12c/0x1c0 ? rwlock_bug.part.0+0x60/0x60 ? rcu_read_lock_sched_held+0x3f/0x70 ? cifs_convert_path_to_utf16+0xf8/0x140 ? smb2_check_message+0x6f0/0x6f0 cifs_ioctl+0xf18/0x16b0 ? smb2_query_reparse_tag+0x600/0x600 ? cifs_readdir+0x1800/0x1800 ? selinux_bprm_creds_for_exec+0x4d0/0x4d0 ? do_user_addr_fault+0x30b/0x950 ? __x64_sys_openat+0xce/0x140 __x64_sys_ioctl+0xb9/0xf0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fdcf1f4ba87 Code: b3 66 90 48 8b 05 11 14 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 13 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffef1ce7748 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00000000c018cf07 RCX: 00007fdcf1f4ba87 RDX: 0000564c467c5590 RSI: 00000000c018cf07 RDI: 0000000000000003 RBP: 00007ffef1ce7770 R08: 00007ffef1ce7420 R09: 00007fdcf0e0562b R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000004018 R13: 0000000000000001 R14: 0000000000000003 R15: 0000564c467c5590 Allocated by task 1200: kasan_save_stack+0x1b/0x40 __kasan_kmalloc+0x7a/0x90 smb2_ioctl_query_info+0x10e/0x990 cifs_ioctl+0xf18/0x16b0 __x64_sys_ioctl+0xb9/0xf0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 1200: kasan_save_stack+0x1b/0x40 kasan_set_track+0x1c/0x30 kasan_set_free_info+0x20/0x30 __kasan_slab_free+0xe5/0x110 slab_free_freelist_hook+0x53/0x130 kfree+0xcc/0x320 smb2_ioctl_query_info+0x2ad/0x990 cifs_ioctl+0xf18/0x16b0 __x64_sys_ioctl+0xb9/0xf0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff888007b10c00 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 0 bytes inside of 512-byte region [ffff888007b10c00, ffff888007b10e00) The buggy address belongs to the page: page:0000000044e14b75 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7b10 head:0000000044e14b75 order:2 compound_mapcount:0 compound_pincount:0 flags: 0x100000000010200(slab\|head) raw: 0100000000010200 ffffea000015f500 0000000400000004 ffff888001042c80 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888007b10b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888007b10b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff888007b10c00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888007b10c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888007b10d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Signed-off-by: Aurelien Aptel <aaptel@suse.com> CC: <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:24 -05:00
Aurelien Aptel	94b0595a8e	cifs: export supported mount options via new mount_params /proc file Can aid in making mount problems easier to diagnose Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:24 -05:00
Aurelien Aptel	24fedddc95	cifs: log mount errors using cifs_errorf() This makes the errors accessible from userspace via dmesg and the fs_context fd. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:24 -05:00
Aurelien Aptel	d9a8692277	cifs: add fs_context param to parsing helpers Add fs_context param to parsing helpers to be able to log into it in next patch. Make some helper static as they are not used outside of fs_context.c Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:24 -05:00
Aurelien Aptel	9d4ac8b630	cifs: make fs_context error logging wrapper This new helper will be used in the fs_context mount option parsing code. It log errors both in: * the fs_context log queue for userspace to read * kernel printk buffer (dmesg, old behaviour) Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:24 -05:00
Ronnie Sahlberg	7fe6fe95b9	cifs: add FALLOC_FL_INSERT_RANGE support Emulated via server side copy and setsize for SMB3 and later. In the future we could compound this (and/or optionally use DUPLICATE_EXTENTS if supported by the server). Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:24 -05:00
Ronnie Sahlberg	5476b5dd82	cifs: add support for FALLOC_FL_COLLAPSE_RANGE Emulated for SMB3 and later via server side copy and setsize. Eventually this could be compounded. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:24 -05:00
Ronnie Sahlberg	f6d2353a50	cifs: check the timestamp for the cached dirent when deciding on revalidate Improves directory metadata caching Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:24 -05:00
Ronnie Sahlberg	ed8561fa1d	cifs: pass the dentry instead of the inode down to the revalidation check functions Needed for the final patch in the directory caching series Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:24 -05:00
Ronnie Sahlberg	ed20f54a3c	cifs: add a timestamp to track when the lease of the cached dir was taken and clear the timestamp when we receive a lease break. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:23 -05:00
Ronnie Sahlberg	6ef4e9cbe1	cifs: add a function to get a cached dir based on its dentry Needed for subsequent patches in the directory caching series. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:23 -05:00
Ronnie Sahlberg	5e9c89d43f	cifs: Grab a reference for the dentry of the cached directory during the lifetime of the cache We need to hold both a reference for the root/superblock as well as the directory that we are caching. We need to drop these references before we call kill_anon_sb(). At this point, the root and the cached dentries are always the same but this will change once we start caching other directories as well. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:23 -05:00
Ronnie Sahlberg	269f67e1ff	cifs: store a pointer to the root dentry in cifs_sb_info once we have completed mounting the share And use this to only allow to take out a shared handle once the mount has completed and the sb becomes available. This will become important in follow up patches where we will start holding a reference to the directory dentry for the shared handle during the lifetime of the handle. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:23 -05:00
Ronnie Sahlberg	45c0f1aabe	cifs: rename the _shroot functions to _cached_dir These functions will eventually be used to cache any directory, not just the root so change the names. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:23 -05:00
Ronnie Sahlberg	e6eb19504e	cifs: pass a path to open_shroot and check if it is the root or not Move the check for the directory path into the open_shroot() function but still fail for any non-root directories. This is preparation for later when we will start using the cache also for other directories than the root. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:23 -05:00
Ronnie Sahlberg	4df3d976dd	cifs: move the check for nohandlecache into open_shroot instead of doing it in the callsites for open_shroot. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:23 -05:00
Al Viro	991e72eb0e	cifs: switch build_path_from_dentry() to using dentry_path_raw() The cost is that we might need to flip '/' to '\\' in more than just the prefix. Needs profiling, but I suspect that we won't get slowdown on that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:23 -05:00
Al Viro	f6a9bc336b	cifs: allocate buffer in the caller of build_path_from_dentry() build_path_from_dentry() open-codes dentry_path_raw(). The reason we can't use dentry_path_raw() in there (and postprocess the result as needed) is that the callers of build_path_from_dentry() expect that the object to be freed on cleanup and the string to be used are at the same address. That's painful, since the path is naturally built end-to-beginning - we start at the leaf and go through the ancestors, accumulating the pathname. Life would be easier if we left the buffer allocation to callers. It wouldn't be exact-sized buffer, but none of the callers keep the result for long - it's always freed before the caller returns. So there's no need to do exact-sized allocation; better use __getname()/__putname(), same as we do for pathname arguments of syscalls. What's more, there's no need to do allocation under spinlocks, so GFP_ATOMIC is not needed. Next patch will replace the open-coded dentry_path_raw() (in build_path_from_dentry_optional_prefix()) with calling the real thing. This patch only introduces wrappers for allocating/freeing the buffers and switches to new calling conventions: build_path_from_dentry(dentry, buf) expects buf to be address of a page-sized object or NULL, return value is a pathname built inside that buffer on success, ERR_PTR(-ENOMEM) if buf is NULL and ERR_PTR(-ENAMETOOLONG) if the pathname won't fit into page. Note that we don't need to check for failure when allocating the buffer in the caller - build_path_from_dentry() will do the right thing. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:23 -05:00
Al Viro	8e33cf20ce	cifs: make build_path_from_dentry() return const char * ... and adjust the callers. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:23 -05:00
Al Viro	f6f1f17907	cifs: constify pathname arguments in a bunch of helpers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:23 -05:00
Al Viro	558691393a	cifs: constify path argument of ->make_node() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:23 -05:00
Al Viro	9cfdb1c12b	cifs: constify get_normalized_path() properly As it is, it takes const char * and, in some cases, stores it in caller's variable that is plain char *. Fortunately, none of the callers actually proceeded to modify the string via now-non-const alias, but that's trouble waiting to happen. It's easy to do properly, anyway... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:23 -05:00
Al Viro	8d76722355	cifs: don't cargo-cult strndup() strndup(s, strlen(s)) is a highly unidiomatic way to spell strdup(s); it's NOT safer in any way, since strlen() is just as sensitive to NUL-termination as strdup() is. strndup() is for situations when you need a copy of a known-sized substring, not a magic security juju to drive the bad spirits away. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:23 -05:00
Steve French	b9335f6210	SMB3: update structures for new compression protocol definitions Protocol has been extended for additional compression headers. See MS-SMB2 section 2.2.42 Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:23 -05:00
Aurelien Aptel	ec4e4862a9	cifs: remove old dead code While reviewing a patch clarifying locks and locking hierarchy I realized some locks were unused. This commit removes old data and code that isn't actually used anywhere, or hidden in ifdefs which cannot be enabled from the kernel config. * The uid/gid trees and associated locks are left-overs from when uid/sid mapping had an extra caching layer on top of the keyring and are now unused. See commit `faa65f07d2` ("cifs: simplify id_to_sid and sid_to_id mapping code") from 2012. * cifs_oplock_break_ops is a left-over from when slow_work was remplaced by regular workqueue and is now unused. See commit `9b64697246` ("cifs: use workqueue instead of slow-work") from 2010. * CIFSSMBSetAttrLegacy is SMB1 cruft dealing with some legacy NT4/Win9x behaviour. * Remove CONFIG_CIFS_DNOTIFY_EXPERIMENTAL left-overs. This was already partially removed in `392e1c5dc9` ("cifs: rename and clarify CIFS_ASYNC_OP and CIFS_NO_RESP") from 2019. Kill it completely. * Another candidate that was considered but spared is CONFIG_CIFS_NFSD_EXPORT which has an empty implementation and cannot be enabled by a config option (although it is listed but disabled with "BROKEN" as a dep). It's unclear whether this could even function today in its current form but it has it's own .c file and Kconfig entry which is a bit more involved to remove and might make a come back? Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:22 -05:00
Gustavo A. R. Silva	9f4c6eed26	cifs: cifspdu.h: Replace one-element array with flexible-array member There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. Also, this helps with the ongoing efforts to enable -Warray-bounds by fixing the following warning: CC [M] fs/cifs/cifssmb.o fs/cifs/cifssmb.c: In function ‘CIFSFindNext’: fs/cifs/cifssmb.c:4636:23: warning: array subscript 1 is above array bounds of ‘char[1]’ [-Warray-bounds] 4636 \| pSMB->ResumeFileName[name_len+1] = 0; \| ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~ [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.10/process/deprecated.html#zero-length-and-one-element-arrays Link: https://github.com/KSPP/linux/issues/79 Link: https://github.com/KSPP/linux/issues/109 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:22 -05:00
Wan Jiabing	5e14c7240a	fs: cifs: Remove repeated struct declaration struct cifs_writedata is declared twice. One is declared at 209th line. And struct cifs_writedata is defined blew. The declaration hear is not needed. Remove the duplicate. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:22 -05:00
Aurelien Aptel	b7fd0fa0ea	cifs: simplify SWN code with dummy funcs instead of ifdefs This commit doesn't change the logic of SWN. Add dummy implementation of SWN functions when SWN is disabled instead of using ifdef sections. The dummy functions get optimized out, this leads to clearer code and compile time type-checking regardless of config options with no runtime penalty. Leave the simple ifdefs section as-is. A single bitfield (bool foo:1) on its own will use up one int. Move tcon->use_witness out of ifdefs with the other tcon bitfields. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Samuel Cabrero <scabrero@suse.de> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:22 -05:00
Steve French	bb9cad1b49	smb3: update protocol header definitions based to include new flags [MS-SMB2] protocol specification was recently updated to include new flags, new negotiate context and some minor changes to fields. Update smb2pdu.h structure definitions to match the newest version of the protocol specification. Updates to the compression context values will be in a followon patch. Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:22 -05:00
Steve French	edc9dd1e3c	cifs: correct comments explaining internal semaphore usage in the module A few of the semaphores had been removed, and one additional one needed to be noted in the comments. Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:22 -05:00
Jiapeng Chong	83cd9ed7ae	cifs: Remove useless variable Fix the following gcc warning: fs/cifs/cifsacl.c:1097:8: warning: variable ‘nmode’ set but not used [-Wunused-but-set-variable]. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:22 -05:00
jack1.li_cp	c45adff786	cifs: Fix spelling of 'security' secuirty -> security Signed-off-by: jack1.li_cp <liliu1@yulong.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-04-25 16:28:22 -05:00
Stefan Metzmacher	a2a7cc32a5	io_uring: io_sq_thread() no longer needs to reset current->pf_io_worker This is done by create_io_thread() now. Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:29:05 -06:00
Hao Xu	2b4ae19c6d	io_uring: update sq_thread_idle after ctx deleted we shall update sq_thread_idle anytime we do ctx deletion from ctx_list Fixes:734551df6f9b ("io_uring: fix shared sqpoll cancellation hangs") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/1619256380-236460-1-git-send-email-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:25 -06:00
Pavel Begunkov	634d00df5e	io_uring: add full-fledged dynamic buffers support Hook buffers into all rsrc infrastructure, including tagging and updates. Suggested-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/119ed51d68a491dae87eb55fb467a47870c86aad.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Bijan Mottahedeh	bd54b6fe33	io_uring: implement fixed buffers registration similar to fixed files Apply fixed_rsrc functionality for fixed buffers support. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [rebase, remove multi-level tables, fix unregister on exit] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/17035f4f75319dc92962fce4fc04bc0afb5a68dc.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	eae071c9b4	io_uring: prepare fixed rw for dynanic buffers With dynamic buffer updates, registered buffers in the table may change at any moment. First of all we want to prevent future races between updating and importing (i.e. io_import_fixed()), where the latter one may happen without uring_lock held, e.g. from io-wq. Save the first loaded io_mapped_ubuf buffer and reuse. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/21a2302d07766ae956640b6f753292c45200fe8f.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	41edf1a5ec	io_uring: keep table of pointers to ubufs Instead of keeping a table of ubufs convert them into pointers to ubuf, so we can atomically read one pointer and be sure that the content of ubuf won't change. Because it was already dynamically allocating imu->bvec, throw both imu and bvec into a single structure so they can be allocated together. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b96efa4c5febadeccf41d0e849ac099f4c83b0d3.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	c3bdad0271	io_uring: add generic rsrc update with tags Add IORING_REGISTER_RSRC_UPDATE, which also supports passing in rsrc tags. Implement it for registered files. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d4dc66df204212f64835ffca2c4eb5e8363f2f05.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	792e35824b	io_uring: add IORING_REGISTER_RSRC Add a new io_uring_register() opcode for rsrc registeration. Instead of accepting a pointer to resources, fds or iovecs, it @arg is now pointing to a struct io_uring_rsrc_register, and the second argument tells how large that struct is to make it easily extendible by adding new fields. All that is done mainly to be able to pass in a pointer with tags. Pass it in and enable CQE posting for file resources. Doesn't support setting tags on update yet. A design choice made here is to not post CQEs on rsrc de-registration, but only when we updated-removed it by rsrc dynamic update. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c498aaec32a4bb277b2406b9069662c02cdda98c.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	fdecb66281	io_uring: enumerate dynamic resources As resources are getting more support and common parts, it'll be more convenient to index resources and use it for indexing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f0be63e9310212d5601d36277c2946ff7a040485.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	98f0b3b4f1	io_uring: add generic path for rsrc update Extract some common parts for rsrc update, will be used reg buffers support dynamic (i.e. quiesce-lee) managing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b49c3ff6b9ff0e530295767604fe4de64d349e04.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	b60c8dce33	io_uring: preparation for rsrc tagging We need a way to notify userspace when a lazily removed resource actually died out. This will be done by associating a tag, which is u64 exactly like req->user_data, with each rsrc (e.g. buffer of file). A CQE will be posted once a resource is actually put down. Tag 0 is a special value set by default, for whcih it don't generate an CQE, so providing the old behaviour. Don't expose it to the userspace yet, but prepare internally, allocate buffers, add all posting hooks, etc. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2e6beec5eabe7216bb61fb93cdf5aaf65812a9b0.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	d4d19c19d6	io_uring: decouple CQE filling from requests Make __io_cqring_fill_event() agnostic of struct io_kiocb, pass all the data needed directly into it. Will be used to post rsrc removal completions, which don't have an associated request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c9b8da9e42772db2033547dfebe479dc972a0f2c.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	44b31f2fa2	io_uring: return back rsrc data free helper Add io_rsrc_data_free() helper for destroying rsrc_data, easier for search and the function will get more stuff to destroy shortly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/562d1d53b5ff184f15b8949a63d76ef19c4ba9ec.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	fff4db76be	io_uring: move __io_sqe_files_unregister A preparation patch moving __io_sqe_files_unregister() definition closer to other "files" functions without any modification. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/95caf17fe837e67bd1f878395f07049062a010d4.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Chao Yu	2e22d48dca	f2fs: clean up left deprecated IO trace codes Commit `d5f7bc0064` ("f2fs: deprecate f2fs_trace_io") left some dead codes, delete them. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-04-24 17:41:50 -07:00
Masahiro Yamada	d8fc9b667d	sysctl: use min() helper for namecmp() Make it slightly readable by using min(). Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Kees Cook <keescook@chromium.org>	2021-04-25 05:25:16 +09:00
Christian König	2896900e22	ovl: fix reference counting in ovl_mmap error path mmap_region() now calls fput() on the vma->vm_file. Fix this by using vma_set_file() so it doesn't need to be handled manually here any more. Link: https://lkml.kernel.org/r/20210421132012.82354-2-christian.koenig@amd.com Fixes: `1527f926fd` ("mm: mmap: fix fput in error path v2") Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Jan Harkes <jaharkes@cs.cmu.edu> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: <stable@vger.kernel.org> [5.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-04-23 14:42:39 -07:00
Christian König	9da29c7f77	coda: fix reference counting in coda_file_mmap error path mmap_region() now calls fput() on the vma->vm_file. So we need to drop the extra reference on the coda file instead of the host file. Link: https://lkml.kernel.org/r/20210421132012.82354-1-christian.koenig@amd.com Fixes: `1527f926fd` ("mm: mmap: fix fput in error path v2") Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Jan Harkes <jaharkes@cs.cmu.edu> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: <stable@vger.kernel.org> [5.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-04-23 14:42:39 -07:00
Hao Xu	724cb4f9ec	io_uring: check sqring and iopoll_list before shedule do this to avoid race below: userspace kernel \| check sqring and iopoll_list submit sqe \| check IORING_SQ_NEED_WAKEUP \| (which is not set) \| \| \| set IORING_SQ_NEED_WAKEUP wait cqe \| schedule(never wakeup again) Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/1619018351-75883-1-git-send-email-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-23 08:26:41 -06:00
David Howells	3003bbd069	afs: Use the netfs_write_begin() helper Make AFS use the new netfs_write_begin() helper to do the pre-reading required before the write. If successful, the helper returns with the required page filled in and locked. It may read more than just one page, expanding the read to meet cache granularity requirements as necessary. Note: A more advanced version of this could be made that does generic_perform_write() for a whole cache granule. This would make it easier to avoid doing the download/read for the data to be overwritten. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/160588546422.3465195.1546354372589291098.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161539563244.286939.16537296241609909980.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653819291.2770958.406013201547420544.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789102743.6155.17396591236631761195.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:17:28 +01:00
David Howells	5cbf03985c	afs: Use new netfs lib read helper API Make AFS use the new netfs read helpers to implement the VM read operations: - afs_readpage() now hands off responsibility to netfs_readpage(). - afs_readpages() is gone and replaced with afs_readahead(). - afs_readahead() just hands off responsibility to netfs_readahead(). These make use of the cache if a cookie is supplied, otherwise just call the ->issue_op() method a sufficient number of times to complete the entire request. Changes: v5: - Use proper wait function for PG_fscache in afs_page_mkwrite()[1]. - Use killable wait for PG_writeback in afs_page_mkwrite()[1]. v4: - Folded in error handling fixes to afs_req_issue_op(). - Added flag to netfs_subreq_terminated() to indicate that the caller may have been running async and stuff that might sleep needs punting to a workqueue. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/2499407.1616505440@warthog.procyon.org.uk [1] Link: https://lore.kernel.org/r/160588542733.3465195.7526541422073350302.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118158436.1232039.3884845981224091996.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161053540.2537118.14904446369309535330.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340418739.1303470.5908092911600241280.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539561926.286939.5729036262354802339.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653817977.2770958.17696456811587237197.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789101258.6155.3879271028895121537.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:17:28 +01:00
David Howells	dc4191841d	afs: Use the fs operation ops to handle FetchData completion Use the 'success' and 'aborted' afs_operations_ops methods and add a 'failed' method to handle the completion of an AFS.FetchData, AFS.FetchData64 or YFS.FetchData64 RPC operation rather than directly calling the done func pointed to by the afs_read struct from the call delivery handler. This means the done function will be called back on error also, not just on successful completion. This allows motion towards asynchronous data reception on data fetch calls and allows any error to be handed off to the fscache read helper in the same place as a successful completion. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/160588541471.3465195.8807019223378490810.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118157260.1232039.6549085372718234792.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161052647.2537118.12922380836599003659.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340417106.1303470.3502017303898569631.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539560673.286939.391310781674212229.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653816367.2770958.5856904574822446404.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789099994.6155.473719823490561190.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:17:28 +01:00
David Howells	e87b03f583	afs: Prepare for use of THPs As a prelude to supporting transparent huge pages, use thp_size() and similar rather than PAGE_SIZE/SHIFT. Further, try and frame everything in terms of file positions and lengths rather than page indices and numbers of pages. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/160588540227.3465195.4752143929716269062.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118155821.1232039.540445038028845740.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161051439.2537118.15577827510426326534.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340415869.1303470.6040191748634322355.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539559365.286939.18344613540296085269.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653815142.2770958.454490670311230206.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789098713.6155.16394227991842480300.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:17:27 +01:00
David Howells	810caa3e67	afs: Extract writeback extension into its own function Extract writeback extension into its own function to break up the writeback function a bit. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/160588538471.3465195.782513375683399583.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118154610.1232039.1765365632920504822.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161050546.2537118.2202554806419189453.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340414102.1303470.9078891484034668985.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539558417.286939.2879469588895925399.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653813972.2770958.12671731209438112378.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789097132.6155.4916609419912731964.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:17:27 +01:00
David Howells	630f5dda84	afs: Wait on PG_fscache before modifying/releasing a page PG_fscache is going to be used to indicate that a page is being written to the cache, and that the page should not be modified or released until it's finished. Make afs_invalidatepage() and afs_releasepage() wait for it. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/158861253957.340223.7465334678444521655.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/159465832417.1377938.3571599385208729791.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/160588536286.3465195.13231895135369807920.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118153708.1232039.3535103645871176749.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161049369.2537118.11591934943429117060.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340412903.1303470.6424701655031380012.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539556890.286939.5873470593519458598.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653812726.2770958.18167145829938766503.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789096241.6155.5907241930823579235.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:17:27 +01:00
David Howells	bd80d8a80e	afs: Use ITER_XARRAY for writing Use a single ITER_XARRAY iterator to describe the portion of a file to be transmitted to the server rather than generating a series of small ITER_BVEC iterators on the fly. This will make it easier to implement AIO in afs. In theory we could maybe use one giant ITER_BVEC, but that means potentially allocating a huge array of bio_vec structs (max 256 per page) when in fact the pagecache already has a structure listing all the relevant pages (radix_tree/xarray) that can be walked over. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/153685395197.14766.16289516750731233933.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/158861251312.340223.17924900795425422532.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/159465828607.1377938.6903132788463419368.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/160588535018.3465195.14509994354240338307.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118152415.1232039.6452879415814850025.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161048194.2537118.13763612220937637316.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340411602.1303470.4661108879482218408.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539555629.286939.5241869986617154517.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653811456.2770958.7017388543246759245.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789095005.6155.6789055030327407928.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:17:27 +01:00
David Howells	c450846461	afs: Set up the iov_iter before calling afs_extract_data() afs_extract_data() sets up a temporary iov_iter and passes it to AF_RXRPC each time it is called to describe the remaining buffer to be filled. Instead: (1) Put an iterator in the afs_call struct. (2) Set the iterator for each marshalling stage to load data into the appropriate places. A number of convenience functions are provided to this end (eg. afs_extract_to_buf()). This iterator is then passed to afs_extract_data(). (3) Use the new ITER_XARRAY iterator when reading data to load directly into the inode's pages without needing to create a list of them. This will allow O_DIRECT calls to be supported in future patches. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/152898380012.11616.12094591785228251717.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/153685394431.14766.3178466345696987059.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/153999787395.866.11218209749223643998.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/154033911195.12041.3882700371848894587.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/158861250059.340223.1248231474865140653.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/159465827399.1377938.11181327349704960046.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/160588533776.3465195.3612752083351956948.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118151238.1232039.17015723405750601161.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161047240.2537118.14721975104810564022.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340410333.1303470.16260122230371140878.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539554187.286939.15305559004905459852.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653810525.2770958.4630666029125411789.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789093719.6155.7877160739235087723.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:17:27 +01:00
David Howells	05092755aa	afs: Log remote unmarshalling errors Log unmarshalling errors reported by the peer (ie. it can't parse what we sent it). Limit the maximum number of messages to 3. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/159465826250.1377938.16372395422217583913.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/160588532584.3465195.15618385466614028590.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118149739.1232039.208060911149801695.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161046033.2537118.7779717661044373273.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340409118.1303470.17812607349396199116.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539552964.286939.16503232687974398308.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653808989.2770958.11530765353025697860.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789092349.6155.8581594259882708631.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:17:26 +01:00
David Howells	f105da1a79	afs: Don't truncate iter during data fetch Don't truncate the iterator to correspond to the actual data size when fetching the data from the server - rather, pass the length we want to read to rxrpc. This will allow the clear-after-read code in future to simply clear the remaining iterator capacity rather than having to reinitialise the iterator. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/158861249201.340223.13035445866976590375.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/159465825061.1377938.14403904452300909320.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/160588531418.3465195.10712005940763063144.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118148567.1232039.13380313332292947956.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161044610.2537118.17908520793806837792.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340407907.1303470.6501394859511712746.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539551721.286939.14655713136572200716.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653807790.2770958.14034599989374173734.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789090823.6155.15673999934535049102.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:17:26 +01:00
David Howells	c69bf479ba	afs: Move key to afs_read struct Stash the key used to authenticate read operations in the afs_read struct. This will be necessary to reissue the operation against the server if a read from the cache fails in upcoming cache changes. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/158861248336.340223.1851189950710196001.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/159465823899.1377938.11925978022348532049.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/160588529557.3465195.7303323479305254243.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118147693.1232039.13780672951838643842.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161043340.2537118.511899217704140722.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340406678.1303470.12676824086429446370.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539550819.286939.1268332875889175195.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653806683.2770958.11300984379283401542.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789089556.6155.14603302893431820997.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:17:26 +01:00
David Howells	f015cf1d6b	afs: Print the operation debug_id when logging an unexpected data version Print the afs_operation debug_id when logging an unexpected change in the data version. This allows the logged message to be matched against tracelines. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/160588528377.3465195.2206051235095182302.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118146111.1232039.11398082422487058312.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161042180.2537118.2471333561661033316.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340405772.1303470.3877167548944248214.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539549628.286939.15234870409714613954.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653805530.2770958.15120507632529970934.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789088290.6155.3494369629853673866.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:17:26 +01:00
David Howells	67d78a6f6e	afs: Pass page into dirty region helpers to provide THP size Pass a pointer to the page being accessed into the dirty region helpers so that the size of the page can be determined in case it's a transparent huge page. This also required the page to be passed into the afs_page_dirty trace point - so there's no need to specifically pass in the index or private data as these can be retrieved directly from the page struct. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/160588527183.3465195.16107942526481976308.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118144921.1232039.11377711180492625929.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161040747.2537118.11435394902674511430.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340404553.1303470.11414163641767769882.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539548385.286939.8864598314493255313.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653804285.2770958.3497360004849598038.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789087043.6155.16922142208140170528.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:17:26 +01:00
David Howells	03ffae9092	afs: Disable use of the fscache I/O routines Disable use of the fscache I/O routined by the AFS filesystem. It's about to transition to passing iov_iters down and fscache is about to have its I/O path to use iov_iter, so all that needs to change. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/158861209824.340223.1864211542341758994.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/159465768717.1376105.2229314852486665807.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/160588457929.3465195.1730097418904945578.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118143744.1232039.2727898205333669064.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161039077.2537118.7986870854927176905.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340403323.1303470.8159439948319423431.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539547167.286939.3536238932531122332.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653802797.2770958.547311814861545911.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789085806.6155.2596146255056027428.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:17:25 +01:00
David Howells	26aaeffcaf	fscache, cachefiles: Add alternate API to use kiocb for read/write to cache Add an alternate API by which the cache can be accessed through a kiocb, doing async DIO, rather than using the current API that tells the cache where all the pages are. The new API is intended to be used in conjunction with the netfs helper library. A filesystem must pick one or the other and not mix them. Filesystems wanting to use the new API must #define FSCACHE_USE_NEW_IO_API before #including the header. This prevents them from continuing to use the old API at the same time as there are incompatibilities in how the PG_fscache page bit is used. Changes: v6: - Provide a routine to shape a write so that the start and length can be aligned for DIO[3]. v4: - Use the vfs_iocb_iter_read/write() helpers[1] - Move initial definition of fscache_begin_read_operation() here. - Remove a commented-out line[2] - Combine ki->term_func calls in cachefiles_read_complete()[2]. - Remove explicit NULL initialiser[2]. - Remove extern on func decl[2]. - Put in param names on func decl[2]. - Remove redundant else[2]. - Fill out the kdoc comment for fscache_begin_read_operation(). - Rename fs/fscache/page2.c to io.c to match later patches. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-and-tested-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: Christoph Hellwig <hch@lst.de> cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20210216102614.GA27555@lst.de/ [1] Link: https://lore.kernel.org/r/20210216084230.GA23669@lst.de/ [2] Link: https://lore.kernel.org/r/161781047695.463527.7463536103593997492.stgit@warthog.procyon.org.uk/ [3] Link: https://lore.kernel.org/r/161118142558.1232039.17993829899588971439.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161037850.2537118.8819808229350326503.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340402057.1303470.8038373593844486698.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539545919.286939.14573472672781434757.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653801477.2770958.10543270629064934227.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789084517.6155.12799689829859169640.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:14:32 +01:00
David Howells	0246f3e573	netfs: Add a tracepoint to log failures that would be otherwise unseen Add a tracepoint to log internal failures (such as cache errors) that we don't otherwise want to pass back to the netfs. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: Matthew Wilcox <willy@infradead.org> cc: linux-mm@kvack.org cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/161781048813.463527.1557000804674707986.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/161789082749.6155.15498680577213140870.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:14:32 +01:00
David Howells	726218fdc2	netfs: Define an interface to talk to a cache Add an interface to the netfs helper library for reading data from the cache instead of downloading it from the server and support for writing data just downloaded or cleared to the cache. The API passes an iov_iter to the cache read/write routines to indicate the data/buffer to be used. This is done using the ITER_XARRAY type to provide direct access to the netfs inode's pagecache. When the netfs's ->begin_cache_operation() method is called, this must fill in the cache_resources in the netfs_read_request struct, including the netfs_cache_ops used by the helper lib to talk to the cache. The helper lib does not directly access the cache. Changes: v6: - Call trace_netfs_read() after beginning the cache op so that the cookie debug ID can be logged[3]. - Don't record the error from writing to the cache. We don't want to pass it back to the netfs[4]. - Fix copy-to-cache subreq amalgamation to not round up as it goes along otherwise it overcalculates the length of the write[5]. v5: - Use end_page_fscache() rather than unlock_page_fscache()[2]. v4: - Added flag to netfs_subreq_terminated() to indicate that the caller may have been running async and stuff that might sleep needs punting to a workqueue (can't use in_softirq()[1]). - Add missing inc of netfs_n_rh_read stat. - Move initial definition of fscache_begin_read_operation() elsewhere. - Need to call op->begin_cache_operation() from netfs_write_begin(). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-and-tested-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: Matthew Wilcox <willy@infradead.org> cc: linux-mm@kvack.org cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20210216084230.GA23669@lst.de/ [1] Link: https://lore.kernel.org/r/2499407.1616505440@warthog.procyon.org.uk/ [2] Link: https://lore.kernel.org/r/161781045123.463527.14533348855710902201.stgit@warthog.procyon.org.uk/ [3] Link: https://lore.kernel.org/r/161781046256.463527.18158681600085556192.stgit@warthog.procyon.org.uk/ [4] Link: https://lore.kernel.org/r/161781047695.463527.7463536103593997492.stgit@warthog.procyon.org.uk/ [5] Link: https://lore.kernel.org/r/161118141321.1232039.8296910406755622458.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161036700.2537118.11170748455436854978.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340399569.1303470.1138884774643385730.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539542874.286939.13337898213448136687.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653799826.2770958.9015430297426331950.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789081462.6155.3853904866933313256.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:14:32 +01:00
David Howells	e1b1240c1f	netfs: Add write_begin helper Add a helper to do the pre-reading work for the netfs write_begin address space op. Changes v6: - Fixed a missing rreq put in netfs_write_begin()[3]. - Use DEFINE_READAHEAD()[4]. v5: - Made the wait for PG_fscache in netfs_write_begin() killable[2]. v4: - Added flag to netfs_subreq_terminated() to indicate that the caller may have been running async and stuff that might sleep needs punting to a workqueue (can't use in_softirq()[1]). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-and-tested-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: Matthew Wilcox <willy@infradead.org> cc: linux-mm@kvack.org cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20210216084230.GA23669@lst.de/ [1] Link: https://lore.kernel.org/r/2499407.1616505440@warthog.procyon.org.uk/ [2] Link: https://lore.kernel.org/r/161781042127.463527.9154479794406046987.stgit@warthog.procyon.org.uk/ [3] Link: https://lore.kernel.org/r/1234933.1617886271@warthog.procyon.org.uk/ [4] Link: https://lore.kernel.org/r/160588543960.3465195.2792938973035886168.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118140165.1232039.16418853874312234477.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161035539.2537118.15674887534950908530.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340398368.1303470.11242918276563276090.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539541541.286939.1889738674057013729.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653798616.2770958.17213315845968485563.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789080530.6155.1011847312392330491.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:14:32 +01:00
David Howells	289af54cc6	netfs: Gather stats Gather statistics from the netfs interface that can be exported through a seqfile. This is intended to be called by a later patch when viewing /proc/fs/fscache/stats. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-and-tested-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: Matthew Wilcox <willy@infradead.org> cc: linux-mm@kvack.org cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/161118139247.1232039.10556850937548511068.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161034669.2537118.2761232524997091480.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340397101.1303470.17581910581108378458.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539539959.286939.6794352576462965914.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653797700.2770958.5801990354413178228.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789079281.6155.17141344853277186500.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:14:32 +01:00
David Howells	77b4d2c631	netfs: Add tracepoints Add three tracepoints to track the activity of the read helpers: (1) netfs/netfs_read This logs entry to the read helpers and also expansion of the range in a readahead request. (2) netfs/netfs_rreq This logs the progress of netfs_read_request objects which track read requests. A read request may be a compound of multiple subrequests. (3) netfs/netfs_sreq This logs the progress of netfs_read_subrequest objects, which track the contributions from various sources to a read request. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-and-tested-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: Matthew Wilcox <willy@infradead.org> cc: linux-mm@kvack.org cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/161118138060.1232039.5353374588021776217.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161033468.2537118.14021843889844001905.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340395843.1303470.7355519662919639648.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539538693.286939.10171713520419106334.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653796447.2770958.1870655382450862155.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789078003.6155.17814844411672989942.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:14:32 +01:00
David Howells	3d3c950467	netfs: Provide readahead and readpage netfs helpers Add a pair of helper functions: () netfs_readahead() () netfs_readpage() to do the work of handling a readahead or a readpage, where the page(s) that form part of the request may be split between the local cache, the server or just require clearing, and may be single pages and transparent huge pages. This is all handled within the helper. Note that while both will read from the cache if there is data present, only netfs_readahead() will expand the request beyond what it was asked to do, and only netfs_readahead() will write back to the cache. netfs_readpage(), on the other hand, is synchronous and only fetches the page (which might be a THP) it is asked for. The netfs gives the helper parameters from the VM, the cache cookie it wants to use (or NULL) and a table of operations (only one of which is mandatory): () expand_readahead() [optional] Called to allow the netfs to request an expansion of a readahead request to meet its own alignment requirements. This is done by changing rreq->start and rreq->len. () clamp_length() [optional] Called to allow the netfs to cut down a subrequest to meet its own boundary requirements. If it does this, the helper will generate additional subrequests until the full request is satisfied. () is_still_valid() [optional] Called to find out if the data just read from the cache has been invalidated and must be reread from the server. () issue_op() [required] Called to ask the netfs to issue a read to the server. The subrequest describes the read. The read request holds information about the file being accessed. The netfs can cache information in rreq->netfs_priv. Upon completion, the netfs should set the error, transferred and can also set FSCACHE_SREQ_CLEAR_TAIL and then call fscache_subreq_terminated(). () done() [optional] Called after the pages have been unlocked. The read request is still pinning the file and mapping and may still be pinning pages with PG_fscache. rreq->error indicates any error that has been accumulated. () cleanup() [optional] Called when the helper is disposing of a finished read request. This allows the netfs to clear rreq->netfs_priv. Netfs support is enabled with CONFIG_NETFS_SUPPORT=y. It will be built even if CONFIG_FSCACHE=n and in this case much of it should be optimised away, allowing the filesystem to use it even when caching is disabled. Changes: v5: - Comment why netfs_readahead() is putting pages[2]. - Use page_file_mapping() rather than page->mapping[2]. - Use page_index() rather than page->index[2]. - Use set_page_fscache()[3] rather then SetPageFsCache() as this takes an appropriate ref too[4]. v4: - Folded in a kerneldoc comment fix. - Folded in a fix for the error handling in the case that ENOMEM occurs. - Added flag to netfs_subreq_terminated() to indicate that the caller may have been running async and stuff that might sleep needs punting to a workqueue (can't use in_softirq()[1]). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-and-tested-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: Matthew Wilcox <willy@infradead.org> cc: linux-mm@kvack.org cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20210216084230.GA23669@lst.de/ [1] Link: https://lore.kernel.org/r/20210321014202.GF3420@casper.infradead.org/ [2] Link: https://lore.kernel.org/r/2499407.1616505440@warthog.procyon.org.uk/ [3] Link: https://lore.kernel.org/r/CAHk-=wh+2gbF7XEjYc=HV9w_2uVzVf7vs60BPz0gFA=+pUm3ww@mail.gmail.com/ [4] Link: https://lore.kernel.org/r/160588497406.3465195.18003475695899726222.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118136849.1232039.8923686136144228724.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161032290.2537118.13400578415247339173.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340394873.1303470.6237319335883242536.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539537375.286939.16642940088716990995.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653795430.2770958.4947584573720000554.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789076581.6155.6745849361504760209.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:14:32 +01:00
David Howells	3a5829fefd	netfs: Make a netfs helper module Make a netfs helper module to manage read request segmentation, caching support and transparent huge page support on behalf of a network filesystem. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-and-tested-by: Jeff Layton <jlayton@redhat.com> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: Matthew Wilcox <willy@infradead.org> cc: linux-mm@kvack.org cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/160588496284.3465195.10102643717770106661.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118135638.1232039.1622182202673126285.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161031028.2537118.1213974428943508753.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340391427.1303470.14884950716721956560.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539531569.286939.18317119181653706665.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653790328.2770958.6710423217716151549.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789071202.6155.16519256513958534906.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 10:14:32 +01:00
Matthew Wilcox (Oracle)	fcd9ae4f7f	mm/filemap: Pass the file_ra_state in the ractl For readahead_expand(), we need to modify the file ra_state, so pass it down by adding it to the ractl. We have to do this because it's not always the same as f_ra in the struct file that is already being passed. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> Link: https://lore.kernel.org/r/20210407201857.3582797-2-willy@infradead.org/ Link: https://lore.kernel.org/r/161789067431.6155.8063840447229665720.stgit@warthog.procyon.org.uk/ # v6	2021-04-23 09:25:00 +01:00
Christoph Hellwig	732de7dbdb	xfs: rename struct xfs_legacy_ictimestamp Rename struct xfs_legacy_ictimestamp to struct xfs_log_legacy_timestamp as it is a type used for logging timestamps with no relationship to the in-core inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2021-04-22 18:29:25 -07:00
Christoph Hellwig	6fc277c7c9	xfs: rename xfs_ictimestamp_t Rename xfs_ictimestamp_t to xfs_log_timestamp_t as it is a type used for logging timestamps with no relationship to the in-core inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2021-04-22 18:29:17 -07:00
Leah Rumancik	6c09127396	ext4: wipe ext4_dir_entry2 upon file deletion Upon file deletion, zero out all fields in ext4_dir_entry2 besides rec_len. In case sensitive data is stored in filenames, this ensures no potentially sensitive data is left in the directory entry upon deletion. Also, wipe these fields upon moving a directory entry during the conversion to an htree and when splitting htree nodes. The data wiped may still exist in the journal, but there are future commits planned to address this. Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Link: https://lore.kernel.org/r/20210422180834.2242353-1-leah.rumancik@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2021-04-22 16:51:23 -04:00
Jan Kara	5899593f51	ext4: Fix occasional generic/418 failure Eric has noticed that after pagecache read rework, generic/418 is occasionally failing for ext4 when blocksize < pagesize. In fact, the pagecache rework just made hard to hit race in ext4 more likely. The problem is that since ext4 conversion of direct IO writes to iomap framework (commit `378f32bab3`), we update inode size after direct IO write only after invalidating page cache. Thus if buffered read sneaks at unfortunate moment like: CPU1 - write at offset 1k CPU2 - read from offset 0 iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT); ext4_readpage(); ext4_handle_inode_extension() the read will zero out tail of the page as it still sees smaller inode size and thus page cache becomes inconsistent with on-disk contents with all the consequences. Fix the problem by moving inode size update into end_io handler which gets called before the page cache is invalidated. Reported-and-tested-by: Eric Whitney <enwlinux@gmail.com> Fixes: `378f32bab3` ("ext4: introduce direct I/O write using iomap infrastructure") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20210415155417.4734-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2021-04-22 16:51:03 -04:00
Mickaël Salaün	83e804f0bf	fs,security: Add sb_delete hook The sb_delete security hook is called when shutting down a superblock, which may be useful to release kernel objects tied to the superblock's lifetime (e.g. inodes). This new hook is needed by Landlock to release (ephemerally) tagged struct inodes. This comes from the unprivileged nature of Landlock described in the next commit. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: James Morris <jmorris@namei.org> Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com> Reviewed-by: Jann Horn <jannh@google.com> Acked-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210422154123.13086-7-mic@digikod.net Signed-off-by: James Morris <jamorris@linux.microsoft.com>	2021-04-22 12:22:11 -07:00
Ard Biesheuvel	e3a606f2c5	fsverity: relax build time dependency on CRYPTO_SHA256 CONFIG_CRYPTO_SHA256 denotes the generic C implementation of the SHA-256 shash algorithm, which is selected as the default crypto shash provider for fsverity. However, fsverity has no strict link time dependency, and the same shash could be exposed by an optimized implementation, and arm64 has a number of those (scalar, NEON-based and one based on special crypto instructions). In such cases, it makes little sense to require that the generic C implementation is incorporated as well, given that it will never be called. To address this, relax the 'select' clause to 'imply' so that the generic driver can be omitted from the build if desired. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-04-22 17:31:32 +10:00
Ard Biesheuvel	a0fc20333e	fscrypt: relax Kconfig dependencies for crypto API algorithms Even if FS encryption has strict functional dependencies on various crypto algorithms and chaining modes. those dependencies could potentially be satisified by other implementations than the generic ones, and no link time dependency exists on the 'depends on' claused defined by CONFIG_FS_ENCRYPTION_ALGS. So let's relax these clauses to 'imply', so that the default behavior is still to pull in those generic algorithms, but in a way that permits them to be disabled again in Kconfig. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-04-22 17:31:32 +10:00
Chao Yu	509f1010e4	f2fs: avoid using native allocate_segment_by_default() As we did for other cases, in fix_curseg_write_pointer(), let's use wrapped f2fs_allocate_new_section() instead of native allocate_segment_by_default(), by this way, it fixes to cover segment allocation with curseg_lock and sentry_lock. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-04-21 21:00:59 -07:00
Tian Tao	a3cc754ad9	fs/reiserfs/journal.c: delete useless variables The value of 'cn' is not used, so just delete it. Link: https://lore.kernel.org/r/1618278196-17749-1-git-send-email-tiantao6@hisilicon.com Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Jan Kara <jack@suse.cz>	2021-04-21 16:29:05 +02:00
Gustavo A. R. Silva	76c50eb70d	nfsd: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple warnings by explicitly adding a couple of break statements instead of just letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2021-04-20 16:55:07 -04:00
Gustavo A. R. Silva	e5966cf20f	gfs2: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple warnings by explicitly adding multiple goto statements instead of just letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-04-20 22:38:21 +02:00
Pavel Begunkov	f2a48dd09b	io_uring: refactor io_sq_offload_create() Just a bit of code tossing in io_sq_offload_create(), so it looks a bit better. No functional changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/939776f90de8d2cdd0414e1baa29c8ec0926b561.1618916549.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-20 12:55:28 -06:00
Pavel Begunkov	07db298a1c	io_uring: safer sq_creds putting Put sq_creds as a part of io_ring_ctx_free(), it's easy to miss doing it in io_sq_thread_finish(), especially considering past mistakes related to ring creation failures. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3becb1866467a1de82a97345a0a90d7fb8ff875e.1618916549.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-20 12:55:28 -06:00
Pavel Begunkov	3a0a690235	io_uring: move inflight un-tracking into cleanup REQ_F_INFLIGHT deaccounting doesn't do any spinlocking or resource freeing anymore, so it's safe to move it into the normal cleanup flow, i.e. into io_clean_op(), so making it cleaner. Also move io_req_needs_clean() to be first in io_dismantle_req() so it doesn't reload req->flags. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/90653a3a5de4107e3a00536fa4c2ea5f2c38a4ac.1618916549.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-20 12:55:28 -06:00
Johannes Thumshirn	18bb8bbf13	btrfs: zoned: automatically reclaim zones When a file gets deleted on a zoned file system, the space freed is not returned back into the block group's free space, but is migrated to zone_unusable. As this zone_unusable space is behind the current write pointer it is not possible to use it for new allocations. In the current implementation a zone is reset once all of the block group's space is accounted as zone unusable. This behaviour can lead to premature ENOSPC errors on a busy file system. Instead of only reclaiming the zone once it is completely unusable, kick off a reclaim job once the amount of unusable bytes exceeds a user configurable threshold between 51% and 100%. It can be set per mounted filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75% by default. Similar to reclaiming unused block groups, these dirty block groups are added to a to_reclaim list and then on a transaction commit, the reclaim process is triggered but after we deleted unused block groups, which will free space for the relocation process. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-20 20:46:31 +02:00
Johannes Thumshirn	f33720657d	btrfs: rename delete_unused_bgs_mutex to reclaim_bgs_lock As a preparation for extending the block group deletion use case, rename the unused_bgs_mutex to reclaim_bgs_lock. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-20 20:30:18 +02:00
Johannes Thumshirn	01e86008aa	btrfs: zoned: reset zones of relocated block groups When relocating a block group the freed up space is not discarded in one big block, but each extent is discarded on its own with -odisard=sync. For a zoned filesystem we need to discard the whole block group at once, so btrfs_discard_extent() will translate the discard into a REQ_OP_ZONE_RESET operation, which then resets the device's zone. Failure to reset the zone is not fatal error. Discussion about the approach and regarding transaction blocking: https://lore.kernel.org/linux-btrfs/CAL3q7H4SjS_d5rBepfTMhU8Th3bJzdmyYd0g4Z60yUgC_rC_ZA@mail.gmail.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-20 20:25:16 +02:00
Qu Wenruo	e9306ad4ef	btrfs: more graceful errors/warnings on 32bit systems when reaching limits Btrfs uses internally mapped u64 address space for all its metadata. Due to the page cache limit on 32bit systems, btrfs can't access metadata at or beyond (ULONG_MAX + 1) << PAGE_SHIFT. See how MAX_LFS_FILESIZE and page::index are defined. This is 16T for 4K page size while 256T for 64K page size. Users can have a filesystem which doesn't have metadata beyond the boundary at mount time, but later balance can cause it to create metadata beyond the boundary. And modification to MM layer is unrealistic just for such minor use case. We can't do more than to prevent mounting such filesystem or warn early when the numbers are still within the limits. To address such problem, this patch will introduce the following checks: - Mount time rejection This will reject any fs which has metadata chunk at or beyond the boundary. - Mount time early warning If there is any metadata chunk beyond 5/8th of the boundary, we do an early warning and hope the end user will see it. - Runtime extent buffer rejection If we're going to allocate an extent buffer at or beyond the boundary, reject such request with EOVERFLOW. This is definitely going to cause problems like transaction abort, but we have no better ways. - Runtime extent buffer early warning If an extent buffer beyond 5/8th of the max file size is allocated, do an early warning. Above error/warning message will only be printed once for each fs to reduce dmesg flood. If the mount is rejected, the filesystem will be mountable only on a 64bit host. Link: https://lore.kernel.org/linux-btrfs/1783f16d-7a28-80e6-4c32-fdf19b705ed0@gmx.com/ Reported-by: Erik Jensen <erikjensen@rkjnsn.net> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-20 19:56:50 +02:00
Filipe Manana	0dc16ef4f6	btrfs: zoned: fix unpaired block group unfreeze during device replace When doing a device replace on a zoned filesystem, if we find a block group with ->to_copy == 0, we jump to the label 'done', which will result in later calling btrfs_unfreeze_block_group(), even though at this point we never called btrfs_freeze_block_group(). Since at this point we have neither turned the block group to RO mode nor made any progress, we don't need to jump to the label 'done'. So fix this by jumping instead to the label 'skip' and dropping our reference on the block group before the jump. Fixes: `78ce9fc269` ("btrfs: zoned: mark block groups to copy for device-replace") CC: stable@vger.kernel.org # 5.12 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-20 19:32:43 +02:00
Filipe Manana	f9690f426b	btrfs: fix race when picking most recent mod log operation for an old root Commit `dbcc7d57bf` ("btrfs: fix race when cloning extent buffer during rewind of an old root"), fixed a race when we need to rewind the extent buffer of an old root. It was caused by picking a new mod log operation for the extent buffer while getting a cloned extent buffer with an outdated number of items (off by -1), because we cloned the extent buffer without locking it first. However there is still another similar race, but in the opposite direction. The cloned extent buffer has a number of items that does not match the number of tree mod log operations that are going to be replayed. This is because right after we got the last (most recent) tree mod log operation to replay and before locking and cloning the extent buffer, another task adds a new pointer to the extent buffer, which results in adding a new tree mod log operation and incrementing the number of items in the extent buffer. So after cloning we have mismatch between the number of items in the extent buffer and the number of mod log operations we are going to apply to it. This results in hitting a BUG_ON() that produces the following stack trace: ------------[ cut here ]------------ kernel BUG at fs/btrfs/tree-mod-log.c:675! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 3 PID: 4811 Comm: crawl_1215 Tainted: G W 5.12.0-7d1efdf501f8-misc-next+ #99 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 RIP: 0010:tree_mod_log_rewind+0x3b1/0x3c0 Code: 05 48 8d 74 10 (...) RSP: 0018:ffffc90001027090 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff8880a8514600 RCX: ffffffffaa9e59b6 RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8880a851462c RBP: ffffc900010270e0 R08: 00000000000000c0 R09: ffffed1004333417 R10: ffff88802199a0b7 R11: ffffed1004333416 R12: 000000000000000e R13: ffff888135af8748 R14: ffff88818766ff00 R15: ffff8880a851462c FS: 00007f29acf62700(0000) GS:ffff8881f2200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0e6013f718 CR3: 000000010d42e003 CR4: 0000000000170ee0 Call Trace: btrfs_get_old_root+0x16a/0x5c0 ? lock_downgrade+0x400/0x400 btrfs_search_old_slot+0x192/0x520 ? btrfs_search_slot+0x1090/0x1090 ? free_extent_buffer.part.61+0xd7/0x140 ? free_extent_buffer+0x13/0x20 resolve_indirect_refs+0x3e9/0xfc0 ? lock_downgrade+0x400/0x400 ? __kasan_check_read+0x11/0x20 ? add_prelim_ref.part.11+0x150/0x150 ? lock_downgrade+0x400/0x400 ? __kasan_check_read+0x11/0x20 ? lock_acquired+0xbb/0x620 ? __kasan_check_write+0x14/0x20 ? do_raw_spin_unlock+0xa8/0x140 ? rb_insert_color+0x340/0x360 ? prelim_ref_insert+0x12d/0x430 find_parent_nodes+0x5c3/0x1830 ? stack_trace_save+0x87/0xb0 ? resolve_indirect_refs+0xfc0/0xfc0 ? fs_reclaim_acquire+0x67/0xf0 ? __kasan_check_read+0x11/0x20 ? lockdep_hardirqs_on_prepare+0x210/0x210 ? fs_reclaim_acquire+0x67/0xf0 ? __kasan_check_read+0x11/0x20 ? ___might_sleep+0x10f/0x1e0 ? __kasan_kmalloc+0x9d/0xd0 ? trace_hardirqs_on+0x55/0x120 btrfs_find_all_roots_safe+0x142/0x1e0 ? find_parent_nodes+0x1830/0x1830 ? trace_hardirqs_on+0x55/0x120 ? ulist_free+0x1f/0x30 ? btrfs_inode_flags_to_xflags+0x50/0x50 iterate_extent_inodes+0x20e/0x580 ? tree_backref_for_extent+0x230/0x230 ? release_extent_buffer+0x225/0x280 ? read_extent_buffer+0xdd/0x110 ? lock_downgrade+0x400/0x400 ? __kasan_check_read+0x11/0x20 ? lock_acquired+0xbb/0x620 ? __kasan_check_write+0x14/0x20 ? do_raw_spin_unlock+0xa8/0x140 ? _raw_spin_unlock+0x22/0x30 ? release_extent_buffer+0x225/0x280 iterate_inodes_from_logical+0x129/0x170 ? iterate_inodes_from_logical+0x129/0x170 ? btrfs_inode_flags_to_xflags+0x50/0x50 ? iterate_extent_inodes+0x580/0x580 ? __vmalloc_node+0x92/0xb0 ? init_data_container+0x34/0xb0 ? init_data_container+0x34/0xb0 ? kvmalloc_node+0x60/0x80 btrfs_ioctl_logical_to_ino+0x158/0x230 btrfs_ioctl+0x2038/0x4360 ? __kasan_check_write+0x14/0x20 ? mmput+0x3b/0x220 ? btrfs_ioctl_get_supported_features+0x30/0x30 ? __kasan_check_read+0x11/0x20 ? __kasan_check_read+0x11/0x20 ? lock_release+0xc8/0x650 ? __might_fault+0x64/0xd0 ? __kasan_check_read+0x11/0x20 ? lock_downgrade+0x400/0x400 ? lockdep_hardirqs_on_prepare+0x210/0x210 ? lockdep_hardirqs_on_prepare+0x13/0x210 ? _raw_spin_unlock_irqrestore+0x51/0x63 ? __kasan_check_read+0x11/0x20 ? do_vfs_ioctl+0xfc/0x9d0 ? ioctl_file_clone+0xe0/0xe0 ? lock_downgrade+0x400/0x400 ? lockdep_hardirqs_on_prepare+0x210/0x210 ? __kasan_check_read+0x11/0x20 ? lock_release+0xc8/0x650 ? __task_pid_nr_ns+0xd3/0x250 ? __kasan_check_read+0x11/0x20 ? __fget_files+0x160/0x230 ? __fget_light+0xf2/0x110 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f29ae85b427 Code: 00 00 90 48 8b (...) RSP: 002b:00007f29acf5fcf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f29acf5ff40 RCX: 00007f29ae85b427 RDX: 00007f29acf5ff48 RSI: 00000000c038943b RDI: 0000000000000003 RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f29acf60120 R10: 00005640d5fc7b00 R11: 0000000000000246 R12: 0000000000000003 R13: 00007f29acf5ff48 R14: 00007f29acf5ff40 R15: 00007f29acf5fef8 Modules linked in: ---[ end trace 85e5fce078dfbe04 ]--- (gdb) l (tree_mod_log_rewind+0x3b1) 0xffffffff819e5b21 is in tree_mod_log_rewind (fs/btrfs/tree-mod-log.c:675). 670 the modification. As we're going backwards, we do the 671 * opposite of each operation here. 672 */ 673 switch (tm->op) { 674 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING: 675 BUG_ON(tm->slot < n); 676 fallthrough; 677 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING: 678 case BTRFS_MOD_LOG_KEY_REMOVE: 679 btrfs_set_node_key(eb, &tm->key, tm->slot); (gdb) quit The following steps explain in more detail how it happens: 1) We have one tree mod log user (through fiemap or the logical ino ioctl), with a sequence number of 1, so we have fs_info->tree_mod_seq == 1. This is task A; 2) Another task is at ctree.c:balance_level() and we have eb X currently as the root of the tree, and we promote its single child, eb Y, as the new root. Then, at ctree.c:balance_level(), we call: ret = btrfs_tree_mod_log_insert_root(root->node, child, true); 3) At btrfs_tree_mod_log_insert_root() we create a tree mod log operation of type BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, with a ->logical field pointing to ebX->start. We only have one item in eb X, so we create only one tree mod log operation, and store in the "tm_list" array; 4) Then, still at btrfs_tree_mod_log_insert_root(), we create a tree mod log element of operation type BTRFS_MOD_LOG_ROOT_REPLACE, ->logical set to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level set to the level of eb X and ->generation set to the generation of eb X; 5) Then btrfs_tree_mod_log_insert_root() calls tree_mod_log_free_eb() with "tm_list" as argument. After that, tree_mod_log_free_eb() calls tree_mod_log_insert(). This inserts the mod log operation of type BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING from step 3 into the rbtree with a sequence number of 2 (and fs_info->tree_mod_seq set to 2); 6) Then, after inserting the "tm_list" single element into the tree mod log rbtree, the BTRFS_MOD_LOG_ROOT_REPLACE element is inserted, which gets the sequence number 3 (and fs_info->tree_mod_seq set to 3); 7) Back to ctree.c:balance_level(), we free eb X by calling btrfs_free_tree_block() on it. Because eb X was created in the current transaction, has no other references and writeback did not happen for it, we add it back to the free space cache/tree; 8) Later some other task B allocates the metadata extent from eb X, since it is marked as free space in the space cache/tree, and uses it as a node for some other btree; 9) The tree mod log user task calls btrfs_search_old_slot(), which calls btrfs_get_old_root(), and finally that calls tree_mod_log_oldest_root() with time_seq == 1 and eb_root == eb Y; 10) The first iteration of the while loop finds the tree mod log element with sequence number 3, for the logical address of eb Y and of type BTRFS_MOD_LOG_ROOT_REPLACE; 11) Because the operation type is BTRFS_MOD_LOG_ROOT_REPLACE, we don't break out of the loop, and set root_logical to point to tm->old_root.logical, which corresponds to the logical address of eb X; 12) On the next iteration of the while loop, the call to tree_mod_log_search_oldest() returns the smallest tree mod log element for the logical address of eb X, which has a sequence number of 2, an operation type of BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and corresponds to the old slot 0 of eb X (eb X had only 1 item in it before being freed at step 7); 13) We then break out of the while loop and return the tree mod log operation of type BTRFS_MOD_LOG_ROOT_REPLACE (eb Y), and not the one for slot 0 of eb X, to btrfs_get_old_root(); 14) At btrfs_get_old_root(), we process the BTRFS_MOD_LOG_ROOT_REPLACE operation and set "logical" to the logical address of eb X, which was the old root. We then call tree_mod_log_search() passing it the logical address of eb X and time_seq == 1; 15) But before calling tree_mod_log_search(), task B locks eb X, adds a key to eb X, which results in adding a tree mod log operation of type BTRFS_MOD_LOG_KEY_ADD, with a sequence number of 4, to the tree mod log, and increments the number of items in eb X from 0 to 1. Now fs_info->tree_mod_seq has a value of 4; 16) Task A then calls tree_mod_log_search(), which returns the most recent tree mod log operation for eb X, which is the one just added by task B at the previous step, with a sequence number of 4, a type of BTRFS_MOD_LOG_KEY_ADD and for slot 0; 17) Before task A locks and clones eb X, task A adds another key to eb X, which results in adding a new BTRFS_MOD_LOG_KEY_ADD mod log operation, with a sequence number of 5, for slot 1 of eb X, increments the number of items in eb X from 1 to 2, and unlocks eb X. Now fs_info->tree_mod_seq has a value of 5; 18) Task A then locks eb X and clones it. The clone has a value of 2 for the number of items and the pointer "tm" points to the tree mod log operation with sequence number 4, not the most recent one with a sequence number of 5, so there is mismatch between the number of mod log operations that are going to be applied to the cloned version of eb X and the number of items in the clone; 19) Task A then calls tree_mod_log_rewind() with the clone of eb X, the tree mod log operation with sequence number 4 and a type of BTRFS_MOD_LOG_KEY_ADD, and time_seq == 1; 20) At tree_mod_log_rewind(), we set the local variable "n" with a value of 2, which is the number of items in the clone of eb X. Then in the first iteration of the while loop, we process the mod log operation with sequence number 4, which is targeted at slot 0 and has a type of BTRFS_MOD_LOG_KEY_ADD. This results in decrementing "n" from 2 to 1. Then we pick the next tree mod log operation for eb X, which is the tree mod log operation with a sequence number of 2, a type of BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and for slot 0, it is the one added in step 5 to the tree mod log tree. We go back to the top of the loop to process this mod log operation, and because its slot is 0 and "n" has a value of 1, we hit the BUG_ON: (...) switch (tm->op) { case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING: BUG_ON(tm->slot < n); fallthrough; (...) Fix this by checking for a more recent tree mod log operation after locking and cloning the extent buffer of the old root node, and use it as the first operation to apply to the cloned extent buffer when rewinding it. Stable backport notes: due to moved code and renames, in =< 5.11 the change should be applied to ctree.c:get_old_root. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Link: https://lore.kernel.org/linux-btrfs/20210404040732.GZ32440@hungrycats.org/ Fixes: `834328a849` ("Btrfs: tree mod log's old roots could still be part of the tree") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-20 19:27:17 +02:00
Jens Axboe	eb37267229	io-wq: remove unused io_wqe_need_worker() function A previous commit removed the need for this, but overlooked that we no longer use it at all. Get rid of it. Fixes: `685fe7feed` ("io-wq: eliminate the need for a manager thread") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-20 11:24:22 -06:00
Filipe Manana	67addf2900	btrfs: fix metadata extent leak after failure to create subvolume When creating a subvolume we allocate an extent buffer for its root node after starting a transaction. We setup a root item for the subvolume that points to that extent buffer and then attempt to insert the root item into the root tree - however if that fails, due to ENOMEM for example, we do not free the extent buffer previously allocated and we do not abort the transaction (as at that point we did nothing that can not be undone). This means that we effectively do not return the metadata extent back to the free space cache/tree and we leave a delayed reference for it which causes a metadata extent item to be added to the extent tree, in the next transaction commit, without having backreferences. When this happens 'btrfs check' reports the following: $ btrfs check /dev/sdi Opening filesystem to check... Checking filesystem on /dev/sdi UUID: dce2cb9d-025f-4b05-a4bf-cee0ad3785eb [1/7] checking root items [2/7] checking extents ref mismatch on [30425088 16384] extent item 1, found 0 backref 30425088 root 256 not referenced back 0x564a91c23d70 incorrect global backref count on 30425088 found 1 wanted 0 backpointer mismatch on [30425088 16384] owner ref check failed [30425088 16384] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space cache [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups skipped (not enabled on this FS) found 212992 bytes used, error(s) found total csum bytes: 0 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 124669 file data blocks allocated: 65536 referenced 65536 So fix this by freeing the metadata extent if btrfs_insert_root() returns an error. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-20 19:16:01 +02:00
Ingo Molnar	d0d252b8ca	Linux 5.12-rc8 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmB8qHweHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGEXIIAILUbsTJsNsvZIkZ uQ6SY6gnsPFkRiSRjY0YsZLUnqjTuiiHeTz4gzkonddwdnAp/9g6OIHIEBaeTqBh sTUMU/61Fgtrt/IvkA1yJ3rlawqgwdMe2VdimB+EFhufcSKq+5vpd3MVP4IuGx4E J3psoTU4gVltFs5t+1QjvI3XmByN0Qm8FMRXR7iL+zov1QTmGwR3G6Rn4AymG+QT pdruKboyZPfsrFGSVx7wd3HpFyQcrclEX9rKmBNZqets9d9JGWnqnEN4vQKmwO86 4MV29ucdMXH0AMB3kzGdVp0Ji2Ykt5W0K+MUWbFLtcSxnpu1OyBKGsEAMlRbD7ik gm0bMSw= =qHI0 -----END PGP SIGNATURE----- Merge tag 'v5.12-rc8' into sched/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2021-04-20 10:13:58 +02:00
J. Bruce Fields	aba2072f45	nfsd: grant read delegations to clients holding writes It's OK to grant a read delegation to a client that holds a write, as long as it's the only client holding the write. We originally tried to do this in commit `94415b06eb` ("nfsd4: a client's own opens needn't prevent delegations"), which had to be reverted in commit `6ee65a7730` ("Revert "nfsd4: a client's own opens needn't prevent delegations""). Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2021-04-19 16:41:36 -04:00
J. Bruce Fields	ebd9d2c2f5	nfsd: reshuffle some code No change in behavior, I'm just moving some code around to avoid forward references in a following patch. (To do someday: figure out how to split up nfs4state.c. It's big and disorganized.) Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2021-04-19 16:41:36 -04:00
J. Bruce Fields	a0ce48375a	nfsd: track filehandle aliasing in nfs4_files It's unusual but possible for multiple filehandles to point to the same file. In that case, we may end up with multiple nfs4_files referencing the same inode. For delegation purposes it will turn out to be useful to flag those cases. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2021-04-19 16:41:36 -04:00
J. Bruce Fields	f9b60e2209	nfsd: hash nfs4_files by inode number The nfs4_file structure is per-filehandle, not per-inode, because the spec requires open and other state to be per filehandle. But it will turn out to be convenient for nfs4_files associated with the same inode to be hashed to the same bucket, so let's hash on the inode instead of the filehandle. Filehandle aliasing is rare, so that shouldn't have much performance impact. (If you have a ton of exported filesystems, though, and all of them have a root with inode number 2, could that get you an overlong hash chain? Perhaps this (and the v4 open file cache) should be hashed on the inode pointer instead.) Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2021-04-19 16:41:33 -04:00
Qu Wenruo	1d8ba9e7e7	btrfs: handle remount to no compress during compression [BUG] When running btrfs/071 with inode_need_compress() removed from compress_file_range(), we got the following crash: BUG: kernel NULL pointer dereference, address: 0000000000000018 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page Workqueue: btrfs-delalloc btrfs_work_helper [btrfs] RIP: 0010:compress_file_range+0x476/0x7b0 [btrfs] Call Trace: ? submit_compressed_extents+0x450/0x450 [btrfs] async_cow_start+0x16/0x40 [btrfs] btrfs_work_helper+0xf2/0x3e0 [btrfs] process_one_work+0x278/0x5e0 worker_thread+0x55/0x400 ? process_one_work+0x5e0/0x5e0 kthread+0x168/0x190 ? kthread_create_worker_on_cpu+0x70/0x70 ret_from_fork+0x22/0x30 ---[ end trace 65faf4eae941fa7d ]--- This is already after the patch "btrfs: inode: fix NULL pointer dereference if inode doesn't need compression." [CAUSE] @pages is firstly created by kcalloc() in compress_file_extent(): pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS); Then passed to btrfs_compress_pages() to be utilized there: ret = btrfs_compress_pages(... pages, &nr_pages, ...); btrfs_compress_pages() will initialize each page as output, in zlib_compress_pages() we have: pages[nr_pages] = out_page; nr_pages++; Normally this is completely fine, but there is a special case which is in btrfs_compress_pages() itself: switch (type) { default: return -E2BIG; } In this case, we didn't modify @pages nor @out_pages, leaving them untouched, then when we cleanup pages, the we can hit NULL pointer dereference again: if (pages) { for (i = 0; i < nr_pages; i++) { WARN_ON(pages[i]->mapping); put_page(pages[i]); } ... } Since pages[i] are all initialized to zero, and btrfs_compress_pages() doesn't change them at all, accessing pages[i]->mapping would lead to NULL pointer dereference. This is not possible for current kernel, as we check inode_need_compress() before doing pages allocation. But if we're going to remove that inode_need_compress() in compress_file_extent(), then it's going to be a problem. [FIX] When btrfs_compress_pages() hits its default case, modify @out_pages to 0 to prevent such problem from happening. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212331 CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 21:32:45 +02:00
Pavel Begunkov	734551df6f	io_uring: fix shared sqpoll cancellation hangs [ 736.982891] INFO: task iou-sqp-4294:4295 blocked for more than 122 seconds. [ 736.982897] Call Trace: [ 736.982901] schedule+0x68/0xe0 [ 736.982903] io_uring_cancel_sqpoll+0xdb/0x110 [ 736.982908] io_sqpoll_cancel_cb+0x24/0x30 [ 736.982911] io_run_task_work_head+0x28/0x50 [ 736.982913] io_sq_thread+0x4e3/0x720 We call io_uring_cancel_sqpoll() one by one for each ctx either in sq_thread() itself or via task works, and it's intended to cancel all requests of a specified context. However the function uses per-task counters to track the number of inflight requests, so it counts more requests than available via currect io_uring ctx and goes to sleep for them to appear (e.g. from IRQ), that will never happen. Cancel a bit more than before, i.e. all ctxs that share sqpoll and continue to use shared counters. Don't forget that we should not remove ctx from the list before running that task_work sqpoll-cancel, otherwise the function wouldn't be able to find the context and will hang. Reported-by: Joakim Hassila <joj@mac.com> Reported-by: Jens Axboe <axboe@kernel.dk> Fixes: `37d1e2e364` ("io_uring: move SQPOLL thread io-wq forked worker") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1bded7e6c6b32e0bae25fce36be2868e46b116a0.1618752958.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-19 11:34:32 -06:00
Pavel Begunkov	3b763ba1c7	io_uring: remove extra sqpoll submission halting SQPOLL task won't submit requests for a context that is currently dying, so no need to remove ctx from sqd_list prior the main loop of io_ring_exit_work(). Kill it, will be removed by io_sq_thread_finish() and only brings confusion and lockups. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f220c2b786ba0f9499bebc9f3cd9714d29efb6a5.1618752958.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-19 11:34:32 -06:00
Wan Jiabing	a7b4e506dc	f2fs: remove unnecessary struct declaration struct dnode_of_data is defined at 897th line. The declaration here is unnecessary. Remove it. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-04-19 09:36:14 -07:00
Johannes Thumshirn	1d68128c10	btrfs: zoned: fail mount if the device does not support zone append For zoned btrfs, zone append is mandatory to write to a sequential write only zone, otherwise parallel writes to the same zone could result in unaligned write errors. If a zoned block device does not support zone append (e.g. a dm-crypt zoned device using a non-NULL IV cypher), fail to mount. CC: stable@vger.kernel.org # 5.12 Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:23 +02:00
Filipe Manana	061dde8245	btrfs: fix race between transaction aborts and fsyncs leading to use-after-free There is a race between a task aborting a transaction during a commit, a task doing an fsync and the transaction kthread, which leads to an use-after-free of the log root tree. When this happens, it results in a stack trace like the following: BTRFS info (device dm-0): forced readonly BTRFS warning (device dm-0): Skipping commit of aborted transaction. BTRFS: error (device dm-0) in cleanup_transaction:1958: errno=-5 IO failure BTRFS warning (device dm-0): lost page write due to IO error on /dev/mapper/error-test (-5) BTRFS warning (device dm-0): Skipping commit of aborted transaction. BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0xa4e8 len 4096 err no 10 BTRFS error (device dm-0): error writing primary super block to device 1 BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e000 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e008 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e010 len 4096 err no 10 BTRFS: error (device dm-0) in write_all_supers:4110: errno=-5 IO failure (1 errors while writing supers) BTRFS: error (device dm-0) in btrfs_sync_log:3308: errno=-5 IO failure general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b68: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 2 PID: 2458471 Comm: fsstress Not tainted 5.12.0-rc5-btrfs-next-84 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 RIP: 0010:__mutex_lock+0x139/0xa40 Code: c0 74 19 (...) RSP: 0018:ffff9f18830d7b00 EFLAGS: 00010202 RAX: 6b6b6b6b6b6b6b68 RBX: 0000000000000001 RCX: 0000000000000002 RDX: ffffffffb9c54d13 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff9f18830d7bc0 R08: 0000000000000000 R09: 0000000000000000 R10: ffff9f18830d7be0 R11: 0000000000000001 R12: ffff8c6cd199c040 R13: ffff8c6c95821358 R14: 00000000fffffffb R15: ffff8c6cbcf01358 FS: 00007fa9140c2b80(0000) GS:ffff8c6fac600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa913d52000 CR3: 000000013d2b4003 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? __btrfs_handle_fs_error+0xde/0x146 [btrfs] ? btrfs_sync_log+0x7c1/0xf20 [btrfs] ? btrfs_sync_log+0x7c1/0xf20 [btrfs] btrfs_sync_log+0x7c1/0xf20 [btrfs] btrfs_sync_file+0x40c/0x580 [btrfs] do_fsync+0x38/0x70 __x64_sys_fsync+0x10/0x20 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fa9142a55c3 Code: 8b 15 09 (...) RSP: 002b:00007fff26278d48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 0000563c83cb4560 RCX: 00007fa9142a55c3 RDX: 00007fff26278cb0 RSI: 00007fff26278cb0 RDI: 0000000000000005 RBP: 0000000000000005 R08: 0000000000000001 R09: 00007fff26278d5c R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000340 R13: 00007fff26278de0 R14: 00007fff26278d96 R15: 0000563c83ca57c0 Modules linked in: btrfs dm_zero dm_snapshot dm_thin_pool (...) ---[ end trace ee2f1b19327d791d ]--- The steps that lead to this crash are the following: 1) We are at transaction N; 2) We have two tasks with a transaction handle attached to transaction N. Task A and Task B. Task B is doing an fsync; 3) Task B is at btrfs_sync_log(), and has saved fs_info->log_root_tree into a local variable named 'log_root_tree' at the top of btrfs_sync_log(). Task B is about to call write_all_supers(), but before that... 4) Task A calls btrfs_commit_transaction(), and after it sets the transaction state to TRANS_STATE_COMMIT_START, an error happens before it waits for the transaction's 'num_writers' counter to reach a value of 1 (no one else attached to the transaction), so it jumps to the label "cleanup_transaction"; 5) Task A then calls cleanup_transaction(), where it aborts the transaction, setting BTRFS_FS_STATE_TRANS_ABORTED on fs_info->fs_state, setting the ->aborted field of the transaction and the handle to an errno value and also setting BTRFS_FS_STATE_ERROR on fs_info->fs_state. After that, at cleanup_transaction(), it deletes the transaction from the list of transactions (fs_info->trans_list), sets the transaction to the state TRANS_STATE_COMMIT_DOING and then waits for the number of writers to go down to 1, as it's currently 2 (1 for task A and 1 for task B); 6) The transaction kthread is running and sees that BTRFS_FS_STATE_ERROR is set in fs_info->fs_state, so it calls btrfs_cleanup_transaction(). There it sees the list fs_info->trans_list is empty, and then proceeds into calling btrfs_drop_all_logs(), which frees the log root tree with a call to btrfs_free_log_root_tree(); 7) Task B calls write_all_supers() and, shortly after, under the label 'out_wake_log_root', it deferences the pointer stored in 'log_root_tree', which was already freed in the previous step by the transaction kthread. This results in a use-after-free leading to a crash. Fix this by deleting the transaction from the list of transactions at cleanup_transaction() only after setting the transaction state to TRANS_STATE_COMMIT_DOING and waiting for all existing tasks that are attached to the transaction to release their transaction handles. This makes the transaction kthread wait for all the tasks attached to the transaction to be done with the transaction before dropping the log roots and doing other cleanups. Fixes: `ef67963dac` ("btrfs: drop logs when we've aborted a transaction") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:22 +02:00
Qu Wenruo	c4aec299fa	btrfs: introduce submit_eb_subpage() to submit a subpage metadata page The new function, submit_eb_subpage(), will submit all the dirty extent buffers in the page. The major difference between submit_eb_page() and submit_eb_subpage() is: - How to grab extent buffer Now we use find_extent_buffer_nospinlock() other than using page::private. All other different handling is already done in functions like lock_extent_buffer_for_io() and write_one_eb(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:22 +02:00
Qu Wenruo	f3156df944	btrfs: make lock_extent_buffer_for_io() to be subpage compatible For subpage metadata, we don't use page locking at all. So just skip the page locking part for subpage. The rest of the function can be reused. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:22 +02:00
Qu Wenruo	35b6ddfa96	btrfs: introduce write_one_subpage_eb() function The new function, write_one_subpage_eb(), as a subroutine for subpage metadata write, will handle the extent buffer bio submission. The major differences between the new write_one_subpage_eb() and write_one_eb() is: - No page locking When entering write_one_subpage_eb() the page is no longer locked. We only lock the page for its status update, and unlock immediately. Now we completely rely on extent io tree locking. - Extra bitmap update along with page status update Now page dirty and writeback is controlled by btrfs_subpage::dirty_bitmap and btrfs_subpage::writeback_bitmap. They both follow the schema that any sector is dirty/writeback, then the full page gets dirty/writeback. - When to update the nr_written number Now we take a shortcut, if we have cleared the last dirty bit of the page, we update nr_written. This is not completely perfect, but should emulate the old behavior well enough. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:22 +02:00
Qu Wenruo	2f3186d8ee	btrfs: introduce end_bio_subpage_eb_writepage() function The new function, end_bio_subpage_eb_writepage(), will handle the metadata writeback endio. The major differences involved are: - How to grab extent buffer Now page::private is a pointer to btrfs_subpage, we can no longer grab extent buffer directly. Thus we need to use the bv_offset to locate the extent buffer manually and iterate through the whole range. - Use btrfs_subpage_end_writeback() caller This helper will handle the subpage writeback for us. Since this function is executed under endio context, when grabbing extent buffers it can't grab eb->refs_lock as that lock is not designed to be grabbed under hardirq context. So here introduce a helper, find_extent_buffer_nolock(), for such situation, and convert find_extent_buffer() to use that helper. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:22 +02:00
Josef Bacik	fb686c6824	btrfs: check return value of btrfs_commit_transaction in relocation There are a few places where we don't check the return value of btrfs_commit_transaction in relocation.c. Thankfully all these places have straightforward error handling, so simply change all of the sites at once. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:22 +02:00
Josef Bacik	24213fa46c	btrfs: do proper error handling in merge_reloc_roots We have a BUG_ON() if we get an error back from btrfs_get_fs_root(). This honestly should never fail, as at this point we have a solid coordination of fs root to reloc root, and these roots will all be in memory. But in the name of killing BUG_ON()'s remove these and handle the error condition properly, ASSERT()'ing for developers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:22 +02:00
Josef Bacik	8717cf440d	btrfs: handle extent corruption with select_one_root properly In corruption cases we could have paths from a block up to no root at all, and thus we'll BUG_ON(!root) in select_one_root. Handle this by adding an ASSERT() for developers, and returning an error for normal users. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:22 +02:00
Josef Bacik	e0b085b0b0	btrfs: cleanup error handling in prepare_to_merge This probably can't happen even with a corrupt file system, because we would have failed much earlier on than here. However there's no reason we can't just check and bail out as appropriate, so do that and convert the correctness BUG_ON() to an ASSERT(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:22 +02:00
Josef Bacik	57a304cfd4	btrfs: do not panic in __add_reloc_root If we have a duplicate entry for a reloc root then we could have fs corruption that resulted in a double allocation. Since this shouldn't happen unless there is corruption, add an ASSERT(ret != -EEXIST) to all of the callers of __add_reloc_root() to catch any logic mistakes for developers, otherwise normal error handling will happen for normal users. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:22 +02:00
Josef Bacik	3c9258632c	btrfs: handle __add_reloc_root failures in btrfs_recover_relocation We can already handle errors appropriately from this function, deal with an error coming from __add_reloc_root appropriately. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:22 +02:00
Josef Bacik	790c1b8cd4	btrfs: do proper error handling in create_reloc_inode We already handle some errors in this function, and the callers do the correct error handling, so clean up the rest of the function to do the appropriate error handling. There's a little extra work that needs to be done here, as we create the inode item before we create the orphan item. We could potentially add the orphan item, but if we failed to create the inode item we would have to abort the transaction. Instead add a helper to delete the inode item we created in the case that we're unable to look up the inode (this would likely be caused by an ENOMEM), which if it succeeds means we can avoid a transaction abort in this particular error case. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:21 +02:00
Josef Bacik	24cd638902	btrfs: remove the extent item sanity checks in relocate_block_group These checks are all taken care of for us by the tree checker code: - the flags don't change or are updated consistently - the v0 extent item format is invalid and caught in many other places too Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:21 +02:00
Josef Bacik	0ebb6bbbd4	btrfs: tree-checker: check for BTRFS_BLOCK_FLAG_FULL_BACKREF being set improperly We need to validate that a data extent item does not have the FULL_BACKREF flag set on its flags. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:21 +02:00
Josef Bacik	eb6b7fb4b5	btrfs: handle extent reference errors in do_relocation We can already deal with errors appropriately from do_relocation, simply handle any errors that come from changing the refs at this point cleanly. We have to abort the transaction if we fail here as we've modified metadata at this point. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:21 +02:00
Josef Bacik	253e258c34	btrfs: handle errors in reference count manipulation in replace_path If any of the reference count manipulation stuff fails in replace_path we need to abort the transaction, as we've modified the blocks already. We can simply break at this point and everything will be cleaned up. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:21 +02:00
Josef Bacik	0e9873e2fe	btrfs: handle btrfs_search_slot failure in replace_path The search can fail for various reasons, in case of errors there's no cleanup to be done so we can pass the error to the caller, adjusting for the case where the key is not found and search slot returns 1. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:21 +02:00
Josef Bacik	45b87c5d25	btrfs: handle btrfs_cow_block errors in replace_path If we error out COWing the root node when doing a replace_path then we simply unlock and free the buffer and return the error. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:21 +02:00
Josef Bacik	7a9213a935	btrfs: convert logic BUG_ON()'s in replace_path to ASSERT()'s A few BUG_ON()'s in replace_path are purely to keep us from making logical mistakes, so replace them with ASSERT()'s. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:21 +02:00
Josef Bacik	592fbcd50c	btrfs: do proper error handling in btrfs_update_reloc_root We call btrfs_update_root in btrfs_update_reloc_root, which can fail for all sorts of reasons, including IO errors. Instead of panicing the box lets return the error, now that all callers properly handle those errors. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:21 +02:00
Josef Bacik	bbae13f8ab	btrfs: handle btrfs_update_reloc_root failure in prepare_to_merge btrfs_update_reloc_root will will return errors in the future, so handle an error properly in prepare_to_merge. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:21 +02:00
Josef Bacik	7934133fae	btrfs: handle btrfs_update_reloc_root failure in insert_dirty_subvol btrfs_update_reloc_root will will return errors in the future, so handle the error properly in insert_dirty_subvol. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:21 +02:00
Josef Bacik	ac54da6c37	btrfs: change insert_dirty_subvol to return errors This will be able to return errors in the future, so change it to return an error and handle the errors appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:21 +02:00
Josef Bacik	2dd8298eb3	btrfs: handle btrfs_update_reloc_root failure in commit_fs_roots btrfs_update_reloc_root will will return errors in the future, so handle the error properly in commit_fs_roots. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:20 +02:00
Josef Bacik	39200e5908	btrfs: validate root::reloc_root after recording root in trans If we fail to setup a root->reloc_root in a different thread that path will error out, however it still leaves root->reloc_root NULL but would still appear set up in the transaction. Subsequent calls to btrfs_record_root_in_transaction would succeed without attempting to create the reloc root, as the transid has already been updated. Handle this case by making sure we have a root->reloc_root set after a btrfs_record_root_in_transaction call so we don't end up dereferencing a NULL pointer. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:20 +02:00
Josef Bacik	84c50ba521	btrfs: do proper error handling in create_reloc_root We do memory allocations here, read blocks from disk, all sorts of operations that could easily fail at any given point. Instead of panicing the box, simply return the error back up the chain, all callers at this point have proper error handling. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:20 +02:00
Josef Bacik	00bb36a0e7	btrfs: have proper error handling in btrfs_init_reloc_root create_reloc_root will return errors in the future, and __add_reloc_root can return ENOMEM or EEXIST, so handle these errors properly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:20 +02:00
Josef Bacik	03a7e111a9	btrfs: return an error from btrfs_record_root_in_trans We can create a reloc root when we record the root in the trans, which can fail for all sorts of different reasons. Propagate this error up the chain of callers. Future patches will fix the callers of btrfs_record_root_in_trans() to handle the error. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:20 +02:00
Josef Bacik	f0118cb6bc	btrfs: handle record_root_in_trans failure in create_pending_snapshot record_root_in_trans can currently fail, so handle this failure properly. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:20 +02:00
Josef Bacik	1409e6cc74	btrfs: handle record_root_in_trans failure in btrfs_record_root_in_trans record_root_in_trans can fail currently, handle this failure properly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:20 +02:00
Josef Bacik	1c442d2246	btrfs: handle record_root_in_trans failure in qgroup_account_snapshot record_root_in_trans can fail currently, so handle this failure properly. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:20 +02:00
Josef Bacik	68075ea8d7	btrfs: handle btrfs_record_root_in_trans failure in start_transaction btrfs_record_root_in_trans will return errors in the future, so handle the error properly in start_transaction. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:20 +02:00
Josef Bacik	d18c7bd95c	btrfs: handle btrfs_record_root_in_trans failure in relocate_tree_block btrfs_record_root_in_trans will return errors in the future, so handle the error properly in relocate_tree_block. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:20 +02:00
Josef Bacik	221581e485	btrfs: handle btrfs_record_root_in_trans failure in create_subvol btrfs_record_root_in_trans will return errors in the future, so handle the error properly in create_subvol. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:20 +02:00
Josef Bacik	2002ae112a	btrfs: handle btrfs_record_root_in_trans failure in btrfs_recover_log_trees btrfs_record_root_in_trans will return errors in the future, so handle the error properly in btrfs_recover_log_trees. This appears tricky, however we have a reference count on the destination root, so if this fails we need to continue on in the loop to make sure the proper cleanup is done. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:20 +02:00
Josef Bacik	2731f5186b	btrfs: handle btrfs_record_root_in_trans failure in btrfs_delete_subvolume btrfs_record_root_in_trans will return errors in the future, so handle the error properly in btrfs_delete_subvolume. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:19 +02:00
Josef Bacik	b0fec6fd33	btrfs: handle btrfs_record_root_in_trans failure in btrfs_rename btrfs_record_root_in_trans will return errors in the future, so handle the error properly in btrfs_rename. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:19 +02:00
Josef Bacik	00aa8e87c9	btrfs: handle btrfs_record_root_in_trans failure in btrfs_rename_exchange btrfs_record_root_in_trans will return errors in the future, so handle the error properly in btrfs_rename_exchange. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:19 +02:00
Josef Bacik	404bccbcaa	btrfs: do proper error handling in record_reloc_root_in_trans Generally speaking this shouldn't ever fail, the corresponding fs root for the reloc root will already be in memory, so we won't get ENOMEM here. However if there is no corresponding root for the reloc root then we could get ENOMEM when we try to allocate it or we could get ENOENT when we look it up and see that it doesn't exist. Convert these BUG_ON()'s into ASSERT()'s and add proper error handling for the case of corruption. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:19 +02:00
Josef Bacik	92de551b83	btrfs: check record_root_in_trans related failures in select_reloc_root We will record the fs root or the reloc root in the trans in select_reloc_root. These will actually return errors in the following patches, so check their return value here and return it up the stack. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:19 +02:00
Josef Bacik	8ee66afe99	btrfs: convert BUG_ON()'s in select_reloc_root() to proper errors We have several BUG_ON()'s in select_reloc_root() that can be tripped if there is an extent tree corruption. Convert these to ASSERT()'s, because if we hit it during testing it really is bad, or could indicate a problem with the backref walking code. However if users hit these problems it generally indicates corruption, I've hit a few machines in the fleet that trip over these with clearly corrupted extent trees, so be nice and print out an error message and return an error instead of bringing the whole box down. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:19 +02:00
Josef Bacik	cbdc2ebc7c	btrfs: handle errors from select_reloc_root() Currently select_reloc_root() doesn't return an error, but followup patches will make it possible for it to return an error. We do have proper error recovery in do_relocation however, so handle the possibility of select_reloc_root() having an error properly instead of BUG_ON(!root). I've also adjusted select_reloc_root() to return ERR_PTR(-ENOENT) if we don't find a root, instead of NULL, to make the error case easier to deal with. I've replaced the BUG_ON(!root) with an ASSERT(0) for this case as it indicates we messed up the backref walking code, but it could also indicate corruption. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:19 +02:00
Josef Bacik	1c7bfa159f	btrfs: convert BUG_ON()'s in relocate_tree_block We have a couple of BUG_ON()'s in relocate_tree_block() that can be tripped if we have file system corruption. Convert these to ASSERT()'s so developers still get yelled at when they break the backref code, but error out nicely for users so the whole box doesn't go down. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:19 +02:00
Josef Bacik	ffe30dd892	btrfs: convert some BUG_ON()'s to ASSERT()'s in do_relocation A few of these are checking for correctness, and won't be triggered by corrupted file systems, so convert them to ASSERT() instead of BUG_ON() and add a comment explaining their existence. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:19 +02:00
Matthew Wilcox (Oracle)	32c0a6bcaa	btrfs: add and use readahead_batch_length Implement readahead_batch_length() to determine the number of bytes in the current batch of readahead pages and use it in btrfs. Also use the readahead_pos to get the offset. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:19 +02:00
Wan Jiabing	183ebab766	btrfs: move forward declarations to the beginning of extent_io.h There are two forward declarations deep in extent_io.h, move them to the beginning and remove the duplicate one. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:19 +02:00
Qu Wenruo	894d137818	btrfs: subpage: add overview comments This patch adds an overview how btrfs subpage support works: - limitations - behavior - basic implementation points Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:18 +02:00
Qu Wenruo	5a2c60752a	btrfs: make set_btree_ioerr accept extent buffer and be subpage compatible Current set_btree_ioerr() only accepts @page parameter and grabs extent buffer from page::private. This works fine for sector size == PAGE_SIZE case, but not for subpage case. Add an extra parameter, @eb, for callers to pass extent buffer to this function, so that subpage code can reuse this function. And also add subpage special handling to update btrfs_subpage::error_bitmap. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:18 +02:00
Qu Wenruo	0d27797e92	btrfs: make set/clear_extent_buffer_dirty() subpage compatible For set_extent_buffer_dirty() to support subpage sized metadata, just call btrfs_page_set_dirty() to handle both cases. For clear_extent_buffer_dirty(), it needs to clear the page dirty if and only if all extent buffers in the page range are no longer dirty. Also do the same for page error. This is pretty different from the existing clear_extent_buffer_dirty() routine, so add a new helper function, clear_subpage_extent_buffer_dirty() to do this for subpage metadata. Also since the main part of clearing page dirty code is still the same, extract that into btree_clear_page_dirty() so that it can be utilized for both cases. But there is a special race between set_extent_buffer_dirty() and clear_extent_buffer_dirty(), where we can clear the page dirty. [POSSIBLE RACE WINDOW] For the race window between clear_subpage_extent_buffer_dirty() and set_extent_buffer_dirty(), due to the fact that we can't call clear_page_dirty_for_io() under subpage spin lock, we can race like below: T1 (eb1 in the same page) \| T2 (eb2 in the same page) -------------------------------+------------------------------ set_extent_buffer_dirty() \| clear_extent_buffer_dirty() \|- was_dirty = false; \| \|- clear_subpagE_extent_buffer_dirty() \| \| \|- btrfs_clear_and_test_dirty() \| \| \| Since eb2 is the last dirty page \| \| \| we got: \| \| \| last == true; \| \| \| \|- btrfs_page_set_dirty() \| \| \| We set the page dirty and \| \| \| subpage dirty bitmap \| \| \| \| \|- if (last) \| \| \| Since we don't have subpage lock \| \| \| held, now @last is no longer \| \| \| correct \| \| \|- btree_clear_page_dirty() \| \| Now PageDirty == false, even if \| \| we have dirty_bitmap not zero. \|- ASSERT(PageDirty()); \| ^^^^ CRASH The solution here is to also lock the eb->pages[0] for subpage case of set_extent_buffer_dirty(), to prevent racing with clear_extent_buffer_dirty(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:18 +02:00
Qu Wenruo	b8f957715e	btrfs: support page uptodate assertions in subpage mode There are quite some assert checks on page uptodate in extent buffer write accessors. They ensure the destination page is already uptodate. This is fine for regular sector size case, but not for subpage case, as for subpage we only mark the page uptodate if the page contains no hole and all its extent buffers are uptodate. So instead of checking PageUptodate(), for subpage case we check the uptodate bitmap of btrfs_subpage structure. To make the check more elegant, introduce a helper, assert_eb_page_uptodate() to do the check for both subpage and regular sector size cases. The following functions are involved: - write_extent_buffer_chunk_tree_uuid() - write_extent_buffer_fsid() - write_extent_buffer() - memzero_extent_buffer() - copy_extent_buffer() - extent_buffer_test_bit() - extent_buffer_bitmap_set() - extent_buffer_bitmap_clear() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:18 +02:00
Qu Wenruo	1e5eb3d6a4	btrfs: make alloc_extent_buffer() check subpage dirty bitmap In alloc_extent_buffer(), we make sure that the newly allocated page is never dirty. This is fine for sector size == PAGE_SIZE case, but for subpage it's possible that one extent buffer in the page is dirty, thus the whole page is marked dirty, and could cause false alert. To support subpage, call btrfs_page_test_dirty() to handle both cases. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:18 +02:00
Qu Wenruo	eca0f6f643	btrfs: subpage: support metadata checksum calculation at write time Add a new helper, csum_dirty_subpage_buffers(), to iterate through all dirty extent buffers in one bvec. Also extract the code of calculating csum for one extent buffer into csum_one_extent_buffer(), so that both the existing csum_dirty_buffer() and the new csum_dirty_subpage_buffers() can reuse the same routine. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:18 +02:00
Qu Wenruo	139e8cd325	btrfs: subpage: do more sanity checks on metadata page dirtying For btree_set_page_dirty(), we should also check the extent buffer sanity for subpage support. Unlike the regular sector size case, since one page can contain multiple extent buffers, we need to make sure there is at least one dirty extent buffer in the page. So this patch will iterate through the btrfs_subpage::dirty_bitmap to get the extent buffers, and check if any dirty extent buffer in the page range has EXTENT_BUFFER_DIRTY and proper refs. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:18 +02:00
Qu Wenruo	3470da3b7d	btrfs: subpage: introduce helpers for writeback status Introduces the following functions to handle subpage writeback status: - btrfs_subpage_set_writeback() - btrfs_subpage_clear_writeback() - btrfs_subpage_test_writeback() These helpers can only be called when the range is ensured to be inside the page. - btrfs_page_set_writeback() - btrfs_page_clear_writeback() - btrfs_page_test_writeback() These helpers can handle both regular sector size and subpage without problem. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:18 +02:00
Qu Wenruo	d8a5713e89	btrfs: subpage: introduce helpers for dirty status Introduce the following functions to handle subpage dirty status: - btrfs_subpage_set_dirty() - btrfs_subpage_clear_dirty() - btrfs_subpage_test_dirty() These helpers can only be called when the range is ensured to be inside the page. - btrfs_page_set_dirty() - btrfs_page_clear_dirty() - btrfs_page_test_dirty() These helpers can handle both regular sector size and subpage without problem. Thus they would be used to replace PageDirty() related calls in later patches. There is one special point to note here, just like set_page_dirty() and clear_page_dirty_for_io(), btrfs_page_set_dirty() and btrfs_page_clear_dirty() must be called with page locked. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:18 +02:00
Qu Wenruo	d239bcb83b	btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage() In btrfs_invalidatepage() we re-declare @tree variable as btrfs_ordered_inode_tree. Since it's only used to do the spinlock, we can grab it from inode directly, and remove the unnecessary declaration completely. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:18 +02:00
Qu Wenruo	ac5804eb85	btrfs: use min() to replace open-code in btrfs_invalidatepage() In btrfs_invalidatepage() we introduce a temporary variable, new_len, to update ordered->truncated_len. But we can use min() to replace it completely and no need for the variable. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:18 +02:00
Qu Wenruo	fc57ad8d33	btrfs: add sysfs interface for supported sectorsize Export supported sector sizes in /sys/fs/btrfs/features/supported_sectorsizes. Currently all architectures have PAGE_SIZE, There's some disparity between read-only and read-write support but that will be unified in the future so there's only one file exporting the size. The read-only support for systems with 64K pages also works for 4K sector size. This new sysfs interface would help eg. mkfs.btrfs to print more accurate warnings about potentially incompatible option combinations. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:18 +02:00
Filipe Manana	ace75066ce	btrfs: improve btree readahead for full send operations Currently a full send operation uses the standard btree readahead when iterating over the subvolume/snapshot btree, which despite bringing good performance benefits, it could be improved in a few aspects for use cases such as full send operations, which are guaranteed to visit every node and leaf of a btree, in ascending and sequential order. The limitations of that standard btree readahead implementation are the following: 1) It only triggers readahead for leaves that are physically close to the leaf being read, within a 64K range; 2) It only triggers readahead for the next or previous leaves if the leaf being read is not currently in memory; 3) It never triggers readahead for nodes. So add a new readahead mode that addresses all these points and use it for full send operations. The following test script was used to measure the improvement on a box using an average, consumer grade, spinning disk and with 16GiB of RAM: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT # Create files with inline data to make it easier and faster to create # large btrees. add_files() { local total=$1 local start_offset=$2 local number_jobs=$3 local total_per_job=$(($total / $number_jobs)) echo "Creating $total new files using $number_jobs jobs" for ((n = 0; n < $number_jobs; n++)); do ( local start_num=$(($start_offset + $n * $total_per_job)) for ((i = 1; i <= $total_per_job; i++)); do local file_num=$((start_num + $i)) local file_path="$MNT/file_${file_num}" xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null if [ $? -ne 0 ]; then echo "Failed creating file $file_path" break fi done ) & worker_pids[$n]=$! done wait ${worker_pids[@]} sync echo echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV \| egrep '^(node\|leaf) ' \| wc -l)" } initial_file_count=500000 add_files $initial_file_count 0 4 echo echo "Creating first snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap1 echo echo "Adding more files..." add_files $((initial_file_count / 4)) $initial_file_count 4 echo echo "Updating 1/50th of the initial files..." for ((i = 1; i < $initial_file_count; i += 50)); do xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null done echo echo "Creating second snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap2 umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing full send..." start=$(date +%s) btrfs send $MNT/snap1 > /dev/null end=$(date +%s) echo echo "Full send took $((end - start)) seconds" umount $MNT The durations of the full send operation in seconds were the following: Before this change: 217 seconds After this change: 205 seconds (-5.7%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:17 +02:00
Filipe Manana	eafa4fd0ad	btrfs: fix exhaustion of the system chunk array due to concurrent allocations When we are running out of space for updating the chunk tree, that is, when we are low on available space in the system space info, if we have many task concurrently allocating block groups, via fallocate for example, many of them can end up all allocating new system chunks when only one is needed. In extreme cases this can lead to exhaustion of the system chunk array, which has a size limit of 2048 bytes, and results in a transaction abort with errno EFBIG, producing a trace in dmesg like the following, which was triggered on a PowerPC machine with a node/leaf size of 64K: [1359.518899] ------------[ cut here ]------------ [1359.518980] BTRFS: Transaction aborted (error -27) [1359.519135] WARNING: CPU: 3 PID: 16463 at ../fs/btrfs/block-group.c:1968 btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs] [1359.519152] Modules linked in: (...) [1359.519239] Supported: Yes, External [1359.519252] CPU: 3 PID: 16463 Comm: stress-ng Tainted: G X 5.3.18-47-default #1 SLE15-SP3 [1359.519274] NIP: c008000000e36fe8 LR: c008000000e36fe4 CTR: 00000000006de8e8 [1359.519293] REGS: c00000056890b700 TRAP: 0700 Tainted: G X (5.3.18-47-default) [1359.519317] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48008222 XER: 00000007 [1359.519356] CFAR: c00000000013e170 IRQMASK: 0 [1359.519356] GPR00: c008000000e36fe4 c00000056890b990 c008000000e83200 0000000000000026 [1359.519356] GPR04: 0000000000000000 0000000000000000 0000d52a3b027651 0000000000000007 [1359.519356] GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000000 [1359.519356] GPR12: 0000000000008000 c00000063fe44600 000000001015e028 000000001015dfd0 [1359.519356] GPR16: 000000000000404f 0000000000000001 0000000000010000 0000dd1e287affff [1359.519356] GPR20: 0000000000000001 c000000637c9a000 ffffffffffffffe5 0000000000000000 [1359.519356] GPR24: 0000000000000004 0000000000000000 0000000000000100 ffffffffffffffc0 [1359.519356] GPR28: c000000637c9a000 c000000630e09230 c000000630e091d8 c000000562188b08 [1359.519561] NIP [c008000000e36fe8] btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs] [1359.519613] LR [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] [1359.519626] Call Trace: [1359.519671] [c00000056890b990] [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] (unreliable) [1359.519729] [c00000056890ba90] [c008000000d68d44] __btrfs_end_transaction+0xbc/0x2f0 [btrfs] [1359.519782] [c00000056890bae0] [c008000000e309ac] btrfs_alloc_data_chunk_ondemand+0x154/0x610 [btrfs] [1359.519844] [c00000056890bba0] [c008000000d8a0fc] btrfs_fallocate+0xe4/0x10e0 [btrfs] [1359.519891] [c00000056890bd00] [c0000000004a23b4] vfs_fallocate+0x174/0x350 [1359.519929] [c00000056890bd50] [c0000000004a3cf8] ksys_fallocate+0x68/0xf0 [1359.519957] [c00000056890bda0] [c0000000004a3da8] sys_fallocate+0x28/0x40 [1359.519988] [c00000056890bdc0] [c000000000038968] system_call_exception+0xe8/0x170 [1359.520021] [c00000056890be20] [c00000000000cb70] system_call_common+0xf0/0x278 [1359.520037] Instruction dump: [1359.520049] 7d0049ad 40c2fff4 7c0004ac 71490004 40820024 2f83fffb 419e0048 3c620000 [1359.520082] e863bcb8 7ec4b378 48010d91 e8410018 <0fe00000> 3c820000 e884bcc8 7ec6b378 [1359.520122] ---[ end trace d6c186e151022e20 ]--- The following steps explain how we can end up in this situation: 1) Task A is at check_system_chunk(), either because it is allocating a new data or metadata block group, at btrfs_chunk_alloc(), or because it is removing a block group or turning a block group RO. It does not matter why; 2) Task A sees that there is not enough free space in the system space_info object, that is 'left' is < 'thresh'. And at this point the system space_info has a value of 0 for its 'bytes_may_use' counter; 3) As a consequence task A calls btrfs_alloc_chunk() in order to allocate a new system block group (chunk) and then reserves 'thresh' bytes in the chunk block reserve with the call to btrfs_block_rsv_add(). This changes the chunk block reserve's 'reserved' and 'size' counters by an amount of 'thresh', and changes the 'bytes_may_use' counter of the system space_info object from 0 to 'thresh'. Also during its call to btrfs_alloc_chunk(), we end up increasing the value of the 'total_bytes' counter of the system space_info object by 8MiB (the size of a system chunk stripe). This happens through the call chain: btrfs_alloc_chunk() create_chunk() btrfs_make_block_group() btrfs_update_space_info() 4) After it finishes the first phase of the block group allocation, at btrfs_chunk_alloc(), task A unlocks the chunk mutex; 5) At this point the new system block group was added to the transaction handle's list of new block groups, but its block group item, device items and chunk item were not yet inserted in the extent, device and chunk trees, respectively. That only happens later when we call btrfs_finish_chunk_alloc() through a call to btrfs_create_pending_block_groups(); Note that only when we update the chunk tree, through the call to btrfs_finish_chunk_alloc(), we decrement the 'reserved' counter of the chunk block reserve as we COW/allocate extent buffers, through: btrfs_alloc_tree_block() btrfs_use_block_rsv() btrfs_block_rsv_use_bytes() And the system space_info's 'bytes_may_use' is decremented everytime we allocate an extent buffer for COW operations on the chunk tree, through: btrfs_alloc_tree_block() btrfs_reserve_extent() find_free_extent() btrfs_add_reserved_bytes() If we end up COWing less chunk btree nodes/leaves than expected, which is the typical case since the amount of space we reserve is always pessimistic to account for the worst possible case, we release the unused space through: btrfs_create_pending_block_groups() btrfs_trans_release_chunk_metadata() btrfs_block_rsv_release() block_rsv_release_bytes() btrfs_space_info_free_bytes_may_use() But before task A gets into btrfs_create_pending_block_groups()... 6) Many other tasks start allocating new block groups through fallocate, each one does the first phase of block group allocation in a serialized way, since btrfs_chunk_alloc() takes the chunk mutex before calling check_system_chunk() and btrfs_alloc_chunk(). However before everyone enters the final phase of the block group allocation, that is, before calling btrfs_create_pending_block_groups(), new tasks keep coming to allocate new block groups and while at check_system_chunk(), the system space_info's 'bytes_may_use' keeps increasing each time a task reserves space in the chunk block reserve. This means that eventually some other task can end up not seeing enough free space in the system space_info and decide to allocate yet another system chunk. This may repeat several times if yet more new tasks keep allocating new block groups before task A, and all the other tasks, finish the creation of the pending block groups, which is when reserved space in excess is released. Eventually this can result in exhaustion of system chunk array in the superblock, with btrfs_add_system_chunk() returning EFBIG, resulting later in a transaction abort. Even when we don't reach the extreme case of exhausting the system array, most, if not all, unnecessarily created system block groups end up being unused since when finishing creation of the first pending system block group, the creation of the following ones end up not needing to COW nodes/leaves of the chunk tree, so we never allocate and deallocate from them, resulting in them never being added to the list of unused block groups - as a consequence they don't get deleted by the cleaner kthread - the only exceptions are if we unmount and mount the filesystem again, which adds any unused block groups to the list of unused block groups, if a scrub is run, which also adds unused block groups to the unused list, and under some circumstances when using a zoned filesystem or async discard, which may also add unused block groups to the unused list. So fix this by: ) Tracking the number of reserved bytes for the chunk tree per transaction, which is the sum of reserved chunk bytes by each transaction handle currently being used; ) When there is not enough free space in the system space_info, if there are other transaction handles which reserved chunk space, wait for some of them to complete in order to have enough excess reserved space released, and then try again. Otherwise proceed with the creation of a new system chunk. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:17 +02:00
Filipe Manana	b7a7a83463	btrfs: make reflinks respect O_SYNC O_DSYNC and S_SYNC flags If we reflink to or from a file opened with O_SYNC/O_DSYNC or to/from a file that has the S_SYNC attribute set, we totally ignore that and do not durably persist the reflink changes. Since a reflink can change the data readable from a file (and mtime/ctime, or a file size), it makes sense to durably persist (fsync) the source and destination files/ranges. This was previously discussed at: https://lore.kernel.org/linux-btrfs/20200903035225.GJ6090@magnolia/ The recently introduced test case generic/628, from fstests, exercises these scenarios and currently fails without this change. So make sure we fsync the source and destination files/ranges when either of them was opened with O_SYNC/O_DSYNC or has the S_SYNC attribute set, just like XFS already does. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:17 +02:00
Arnd Bergmann	bb05b298af	btrfs: zoned: bail out in btrfs_alloc_chunk for bad input gcc complains that the ctl->max_chunk_size member might be used uninitialized when none of the three conditions for initializing it in init_alloc_chunk_ctl_policy_zoned() are true: In function ‘init_alloc_chunk_ctl_policy_zoned’, inlined from ‘init_alloc_chunk_ctl’ at fs/btrfs/volumes.c:5023:3, inlined from ‘btrfs_alloc_chunk’ at fs/btrfs/volumes.c:5340:2: include/linux/compiler-gcc.h:48:45: error: ‘ctl.max_chunk_size’ may be used uninitialized [-Werror=maybe-uninitialized] 4998 \| ctl->max_chunk_size = min(limit, ctl->max_chunk_size); \| ^~~ fs/btrfs/volumes.c: In function ‘btrfs_alloc_chunk’: fs/btrfs/volumes.c:5316:32: note: ‘ctl’ declared here 5316 \| struct alloc_chunk_ctl ctl; \| ^~~ If we ever get into this condition, something is seriously wrong, as validity is checked in the callers btrfs_alloc_chunk init_alloc_chunk_ctl init_alloc_chunk_ctl_policy_zoned so the same logic as in init_alloc_chunk_ctl_policy_regular() and a few other places should be applied. This avoids both further data corruption, and the compile-time warning. Fixes: `1cd6121f2a` ("btrfs: zoned: implement zoned chunk allocator") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:17 +02:00
BingJing Chang	3227788cd3	btrfs: fix a potential hole punching failure In commit `d77815461f` ("btrfs: Avoid trucating page or punching hole in a already existed hole."), existing holes can be skipped by calling find_first_non_hole() to adjust start and len. However, if the given len is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will not be set to zero because (em->start + em->len) is less than (start + len). Then the ret will be 1 but len will not be set to 0. The propagated non-zero ret will result in fallocate failure. In the while-loop of btrfs_replace_file_extents(), len is not updated every time before it calls find_first_non_hole(). That is, after btrfs_drop_extents() successfully drops the last non-hole file extent, it may fail with ENOSPC when attempting to drop a file extent item representing a hole. The problem can happen. After it calls find_first_non_hole(), the cur_offset will be adjusted to be larger than or equal to end. However, since the len is not set to zero, the break-loop condition (ret && !len) will not be met. After it leaves the while-loop, fallocate will return 1, which is an unexpected return value. We're not able to construct a reproducible way to let btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole file extent but with remaining holes left. However, it's quite easy to fix. We just need to update and check the len every time before we call find_first_non_hole(). To make the while loop more readable, we also pull the variable updates to the bottom of loop like this: while (cur_offset < end) { ... // update cur_offset & len // advance cur_offset & len in hole-punching case if needed } Reported-by: Robbie Ko <robbieko@synology.com> Fixes: `d77815461f` ("btrfs: Avoid trucating page or punching hole in a already existed hole.") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:17 +02:00
Naohiro Aota	e75f9fd194	btrfs: zoned: move log tree node allocation out of log_root_tree->log_mutex Commit `6e37d24599` ("btrfs: zoned: fix deadlock on log sync") pointed out a deadlock warning and removed mutex_{lock,unlock} of fs_info::tree_root->log_mutex. While it looks like it always cause a deadlock, we didn't see actual deadlock in fstests runs. The reason is log_root_tree->log_mutex != fs_info->tree_root->log_mutex, not taking the same lock. So, the warning was actually a false-positive. Since btrfs_alloc_log_tree_node() is protected only by fs_info->tree_root->log_mutex, we can (and should) move the code out of the lock scope of log_root_tree->log_mutex and silence the warning. Fixes: `6e37d24599` ("btrfs: zoned: fix deadlock on log sync") Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:17 +02:00
Josef Bacik	2cdb3909c9	btrfs: use percpu_read_positive instead of sum_positive for need_preempt Looking at perf data for a fio workload I noticed that we were spending a pretty large chunk of time (around 5%) doing percpu_counter_sum() in need_preemptive_reclaim. This is silly, as we only want to know if we have more ordered than delalloc to see if we should be counting the delayed items in our threshold calculation. Change this to percpu_read_positive() to avoid the overhead. I ran this through fsperf to validate the changes, obviously the latency numbers in dbench and fio are quite jittery, so take them as you wish, but overall the improvements on throughput, iops, and bw are all positive. Each test was run two times, the given value is the average of both runs for their respective column. btrfs ssd normal test results bufferedrandwrite16g results metric baseline current diff ========================================================== write_io_kbytes 16777216 16777216 0.00% read_clat_ns_p99 0 0 0.00% write_bw_bytes 1.04e+08 1.05e+08 1.12% read_iops 0 0 0.00% write_clat_ns_p50 13888 11840 -14.75% read_io_kbytes 0 0 0.00% read_io_bytes 0 0 0.00% write_clat_ns_p99 35008 29312 -16.27% read_bw_bytes 0 0 0.00% elapsed 170 167 -1.76% write_lat_ns_min 4221.50 3762.50 -10.87% sys_cpu 39.65 35.37 -10.79% write_lat_ns_max 2.67e+10 2.50e+10 -6.63% read_lat_ns_min 0 0 0.00% write_iops 25270.10 25553.43 1.12% read_lat_ns_max 0 0 0.00% read_clat_ns_p50 0 0 0.00% dbench60 results metric baseline current diff ================================================== qpathinfo 11.12 12.73 14.52% throughput 416.09 445.66 7.11% flush 3485.63 1887.55 -45.85% qfileinfo 0.70 1.92 173.86% ntcreatex 992.60 695.76 -29.91% qfsinfo 2.43 3.71 52.48% close 1.67 3.14 88.09% sfileinfo 66.54 105.20 58.10% rename 809.23 619.59 -23.43% find 16.88 15.46 -8.41% unlink 820.54 670.86 -18.24% writex 3375.20 2637.91 -21.84% deltree 386.33 449.98 16.48% readx 3.43 3.41 -0.60% mkdir 0.05 0.03 -38.46% lockx 0.26 0.26 -0.76% unlockx 0.81 0.32 -60.33% dio4kbs16threads results metric baseline current diff ================================================================ write_io_kbytes 5249676 3357150 -36.05% read_clat_ns_p99 0 0 0.00% write_bw_bytes 89583501.50 57291192.50 -36.05% read_iops 0 0 0.00% write_clat_ns_p50 242688 263680 8.65% read_io_kbytes 0 0 0.00% read_io_bytes 0 0 0.00% write_clat_ns_p99 15826944 36732928 132.09% read_bw_bytes 0 0 0.00% elapsed 61 61 0.00% write_lat_ns_min 42704 42095 -1.43% sys_cpu 5.27 3.45 -34.52% write_lat_ns_max 7.43e+08 9.27e+08 24.71% read_lat_ns_min 0 0 0.00% write_iops 21870.97 13987.11 -36.05% read_lat_ns_max 0 0 0.00% read_clat_ns_p50 0 0 0.00% randwrite2xram results metric baseline current diff ================================================================ write_io_kbytes 24831972 28876262 16.29% read_clat_ns_p99 0 0 0.00% write_bw_bytes 83745273.50 92182192.50 10.07% read_iops 0 0 0.00% write_clat_ns_p50 13952 11648 -16.51% read_io_kbytes 0 0 0.00% read_io_bytes 0 0 0.00% write_clat_ns_p99 50176 52992 5.61% read_bw_bytes 0 0 0.00% elapsed 314 332 5.73% write_lat_ns_min 5920.50 5127 -13.40% sys_cpu 7.82 7.35 -6.07% write_lat_ns_max 5.27e+10 3.88e+10 -26.44% read_lat_ns_min 0 0 0.00% write_iops 20445.62 22505.42 10.07% read_lat_ns_max 0 0 0.00% read_clat_ns_p50 0 0 0.00% untarfirefox results metric baseline current diff ============================================== elapsed 47.41 47.40 -0.03% Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:17 +02:00
Filipe Manana	e2b84217f3	btrfs: update outdated comment at btrfs_replace_file_extents() There is a comment at btrfs_replace_file_extents() that mentions that we set the full sync flag on an inode when cloning into a file with a size greater than or equals to 16MiB, through try_release_extent_mapping() when we truncate the page cache after replacing file extents during a clone operation. That is not true anymore since commit `5e548b3201` ("btrfs: do not set the full sync flag on the inode during page release"), so update the comment to remove that part and rephrase it slightly to make it more clear why the full sync flag is set at btrfs_replace_file_extents(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:17 +02:00
Filipe Manana	0c0218e9a6	btrfs: update outdated comment at btrfs_orphan_cleanup() btrfs_orphan_cleanup() has a comment referring to find_dead_roots, but function does not exists since commit `cb517eabba` ("Btrfs: cleanup the similar code of the fs root read"). What we use now to find and load dead roots is btrfs_find_orphan_roots(). So update the comment and make it a bit more detailed about why we can not delete an orphan item for a root. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:17 +02:00
Filipe Manana	ffbc10a144	btrfs: update debug message when checking seq number of a delayed ref We used to encode two different numbers in the tree mod log counter used for sequence numbers, one in the upper 32 bits and the other one in the lower 32 bits. However that is no longer the case, we stopped doing that since commit `fcebe4562d` ("Btrfs: rework qgroup accounting"). So update the debug message at btrfs_check_delayed_seq to stop extracting the two 32 bits counters and print instead the 64 bits sequence numbers. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:17 +02:00
Filipe Manana	4bae788075	btrfs: add and use helper to get lowest sequence number for the tree mod log There are two places outside the tree mod log module that extract the lowest sequence number of the tree mod log. These places end up duplicating code and open coding the logic and internal implementation details of the tree mod log. So add a helper to the tree mod log module and header that returns the lowest sequence number or 0 if there aren't any tree mod log users at the moment. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:17 +02:00
Filipe Manana	ffe1d039d7	btrfs: remove unnecessary leaf check at btrfs_tree_mod_log_free_eb() At btrfs_tree_mod_log_free_eb() we check if we are dealing with a leaf, and if so, return immediately and do nothing. However this check can be removed, because after it we call tree_mod_need_log(), which returns false when given an extent buffer that corresponds to a leaf. So just remove the leaf check and pass the extent buffer to tree_mod_need_log(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:17 +02:00
Filipe Manana	888dd18339	btrfs: use the new bit BTRFS_FS_TREE_MOD_LOG_USERS at btrfs_free_tree_block() Instead of exposing implementation details of the tree mod log to check if there are active tree mod log users at btrfs_free_tree_block(), use the new bit BTRFS_FS_TREE_MOD_LOG_USERS for fs_info->flags instead. This way extent-tree.c does not need to known about any of the internals of the tree mod log and avoids taking a lock unnecessarily as well. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:16 +02:00
Filipe Manana	bc03f39ec3	btrfs: use a bit to track the existence of tree mod log users The tree modification log functions are called very frequently, basically they are called every time a btree is modified (a pointer added or removed to a node, a new root for a btree is set, etc). Because of that, to avoid heavy lock contention on the lock that protects the list of tree mod log users, we have checks that test the emptiness of the list with a full memory barrier before the checks, so that when there are no tree mod log users we avoid taking the lock. Replace the memory barrier and list emptiness check with a test for a new bit set at fs_info->flags. This bit is used to indicate when there are tree mod log users, set whenever a user is added to the list and cleared when the last user is removed from the list. This makes the intention a bit more obvious and possibly more efficient (assuming test_bit() may be cheaper than a full memory barrier on some architectures). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:16 +02:00
Filipe Manana	406808ab2f	btrfs: use booleans where appropriate for the tree mod log functions Several functions of the tree modification log use integers as booleans, so change them to use booleans instead, making their use more clear. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:16 +02:00
Filipe Manana	f3a84ccd28	btrfs: move the tree mod log code into its own file The tree modification log, which records modifications done to btrees, is quite large and currently spread all over ctree.c, which is a huge file already. To make things better organized, move all that code into its own separate source and header files. Functions and definitions that are used outside of the module (mostly by ctree.c) are renamed so that they start with a "btrfs_" prefix. Everything else remains unchanged. This makes it easier to go over the tree modification log code every time I need to go read it to fix a bug. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:16 +02:00
Ira Weiny	9a002d531b	btrfs: integrity-checker: convert block context kmap's to kmap_local_page btrfsic_read_block() (which calls kmap()) and btrfsic_release_block_ctx() (which calls kunmap()) are always called within a single thread of execution. Therefore the mappings created within these calls can be a thread local mapping. Convert the kmap() of bloc_ctx->pagev to kmap_local_page(). Luckily the unmap loops backwards through the array pointer so no adjustment needs to be made to the unmapping order. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:16 +02:00
Ira Weiny	3e037efdbd	btrfs: integrity-checker: use kmap_local_page in __btrfsic_submit_bio Again there is an array of pointers which must be unmapped in the correct order. Convert the kmap()'s to kmap_local_page() and adjust the unmapping to work backwards through the unmapping loop. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:16 +02:00
Ira Weiny	94a0b58d2d	btrfs: raid56: convert kmaps to kmap_local_page These kmaps are thread local and don't need to be atomic. So they can use the more efficient kmap_local_page(). However, the mapping of pages in the stripes and the additional parity and qstripe pages are a bit trickier because the unmapping must occur in the opposite order from the mapping. Furthermore, the pointer array in __raid_recover_end_io() may get reordered. Convert these calls to kmap_local_page() taking care to reverse the unmappings of any page arrays as well as being careful with the mappings of any special pages such as the parity and qstripe pages. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:16 +02:00
Ira Weiny	58c1a35cd5	btrfs: convert kmap to kmap_local_page, simple cases Use a simple coccinelle script to help convert the most common kmap()/kunmap() patterns to kmap_local_page()/kunmap_local(). Note that some kmaps which were caught by this script needed to be handled by hand because of the strict unmapping order of kunmap_local() so they are not included in this patch. But this script got us started. There's another temp variable added for the final length write to the first page so it does not interfere with cpage_out that is used for mapping other pages. The development of this patch was aided by the follow script: // <smpl> // SPDX-License-Identifier: GPL-2.0-only // Find kmap and replace with kmap_local_page then mark kunmap // // Confidence: Low // Copyright: (C) 2021 Intel Corporation // URL: http://coccinelle.lip6.fr/ @ catch_all @ expression e, e2; @@ ( -kmap(e) +kmap_local_page(e) ) ... ( -kunmap(...) +kunmap_local() ) // </smpl> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:16 +02:00
Johannes Thumshirn	cea628008f	btrfs: remove duplicated in_range() macro The in_range() macro is defined twice in btrfs' source, once in ctree.h and once in misc.h. Remove the definition in ctree.h and include misc.h in the files depending on it. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:16 +02:00
Filipe Manana	209ecbb858	btrfs: remove stale comment and logic from btrfs_inode_in_log() Currently btrfs_inode_in_log() checks the list of modified extents of the inode, and has a comment mentioning why, as it used to be necessary to make sure if we did something like the following: mmap write range A mmap write range B msync range A (ranged fsync) msync range B (ranged fsync) we ended up with both ranges being logged. If we did not check it, then the second fsync would do nothing because btrfs_inode_in_log() would return true. This was added in `125c4cf9f3` ("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync") and test case generic/325 from fstests exercises that scenario. However, as of commit `487781796d` ("btrfs: make fast fsyncs wait only for writeback"), every ranged fsync is now turned into a full ranged fsync (operates on the range from 0 to LLONG_MAX), so it is now pointless to test of emptiness of the list of modified extents, and the comment is clearly outdated. So just remove the comment and list emptiness check, while also changing the function's return type to be a boolean instead of an integer. In case one day we get support for ranged fsyncs again, it will be easy to notice the check is necessary again, because it will make generic/325 always fail. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:16 +02:00
Filipe Manana	bc0939fcfa	btrfs: fix race between marking inode needs to be logged and log syncing We have a race between marking that an inode needs to be logged, either at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between btrfs_sync_log(). The following steps describe how the race happens. 1) We are at transaction N; 2) Inode I was previously fsynced in the current transaction so it has: inode->logged_trans set to N; 3) The inode's root currently has: root->log_transid set to 1 root->last_log_commit set to 0 Which means only one log transaction was committed to far, log transaction 0. When a log tree is created we set ->log_transid and ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree()); 4) One more range of pages is dirtied in inode I; 5) Some task A starts an fsync against some other inode J (same root), and so it joins log transaction 1. Before task A calls btrfs_sync_log()... 6) Task B starts an fsync against inode I, which currently has the full sync flag set, so it starts delalloc and waits for the ordered extent to complete before calling btrfs_inode_in_log() at btrfs_sync_file(); 7) During ordered extent completion we have btrfs_update_inode() called against inode I, which in turn calls btrfs_set_inode_last_trans(), which does the following: spin_lock(&inode->lock); inode->last_trans = trans->transaction->transid; inode->last_sub_trans = inode->root->log_transid; inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); So ->last_trans is set to N and ->last_sub_trans set to 1. But before setting ->last_log_commit... 8) Task A is at btrfs_sync_log(): - it increments root->log_transid to 2 - starts writeback for all log tree extent buffers - waits for the writeback to complete - writes the super blocks - updates root->last_log_commit to 1 It's a lot of slow steps between updating root->log_transid and root->last_log_commit; 9) The task doing the ordered extent completion, currently at btrfs_set_inode_last_trans(), then finally runs: inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); Which results in inode->last_log_commit being set to 1. The ordered extent completes; 10) Task B is resumed, and it calls btrfs_inode_in_log() which returns true because we have all the following conditions met: inode->logged_trans == N which matches fs_info->generation && inode->last_subtrans (1) <= inode->last_log_commit (1) && inode->last_subtrans (1) <= root->last_log_commit (1) && list inode->extent_tree.modified_extents is empty And as a consequence we return without logging the inode, so the existing logged version of the inode does not point to the extent that was written after the previous fsync. It should be impossible in practice for one task be able to do so much progress in btrfs_sync_log() while another task is at btrfs_set_inode_last_trans() right after it reads root->log_transid and before it reads root->last_log_commit. Even if kernel preemption is enabled we know the task at btrfs_set_inode_last_trans() can not be preempted because it is holding the inode's spinlock. However there is another place where we do the same without holding the spinlock, which is in the memory mapped write path at: vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) { (...) BTRFS_I(inode)->last_trans = fs_info->generation; BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid; BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit; (...) So with preemption happening after setting ->last_sub_trans and before setting ->last_log_commit, it is less of a stretch to have another task do enough progress at btrfs_sync_log() such that the task doing the memory mapped write ends up with ->last_sub_trans and ->last_log_commit set to the same value. It is still a big stretch to get there, as the task doing btrfs_sync_log() has to start writeback, wait for its completion and write the super blocks. So fix this in two different ways: 1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the value of ->last_sub_trans minus 1; 2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just like we do for buffered and direct writes at btrfs_file_write_iter(), which is all we need to make sure multiple writes and fsyncs to an inode in the same transaction never result in an fsync missing that the inode changed and needs to be logged. Turn this into a helper function and use it both at btrfs_page_mkwrite() and at btrfs_file_write_iter() - this also fixes the problem that at btrfs_page_mkwrite() we were setting those fields without the protection of the inode's spinlock. This is an extremely unlikely race to happen in practice. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:16 +02:00
Filipe Manana	885f46d87f	btrfs: fix race between memory mapped writes and fsync When doing an fsync we flush all delalloc, lock the inode (VFS lock), flush any new delalloc that might have been created before taking the lock and then wait either for the ordered extents to complete or just for the writeback to complete (depending on whether the full sync flag is set or not). We then start logging the inode and assume that while we are doing it no one else is touching the inode's file extent items (or adding new ones). That is generally true because all operations that modify an inode acquire the inode's lock first, including buffered and direct IO writes. However there is one exception: memory mapped writes, which do not and can not acquire the inode's lock. This can cause two types of issues: ending up logging file extent items with overlapping ranges, which is detected by the tree checker and will result in aborting the transaction when starting writeback for a log tree's extent buffers, or a silent corruption where we log a version of the file that never existed. Scenario 1 - logging overlapping extents The following steps explain how we can end up with file extents items with overlapping ranges in a log tree due to a race between a fsync and memory mapped writes: 1) Task A starts an fsync on inode X, which has the full sync runtime flag set. First it starts by flushing all delalloc for the inode; 2) Task A then locks the inode and flushes any other delalloc that might have been created after the previous flush and waits for all ordered extents to complete; 3) In the inode's root we have the following leaf: Leaf N, generation == current transaction id: --------------------------------------------------------- \| (...) [ file extent item, offset 640K, length 128K ] \| --------------------------------------------------------- The last file extent item in leaf N covers the file range from 640K to 768K; 4) Task B does a memory mapped write for the page corresponding to the file range from 764K to 768K; 5) Task A starts logging the inode. At copy_inode_items_to_log() it uses btrfs_search_forward() to search for leafs modified in the current transaction that contain items for the inode. It finds leaf N and copies all the inode items from that leaf into the log tree. Now the log tree has a copy of the last file extent item from leaf N. At the end of the while loop at copy_inode_items_to_log(), we have the minimum key set to: min_key.objectid = <inode X number> min_key.type = BTRFS_EXTENT_DATA_KEY min_key.offset = 640K Then we increment the key's offset by 1 so that the next call to btrfs_search_forward() leaves us at the first key greater than the key we just processed. But before btrfs_search_forward() is called again... 6) Dellaloc for the page at offset 764K, dirtied by task B, is started. It can be started for several reasons: - The async reclaim task is attempting to satisfy metadata or data reservation requests, and it has reached a point where it decided to flush delalloc; - Due to memory pressure the VMM triggers writeback of dirty pages; - The system call sync_file_range(2) is called from user space. 7) When the respective ordered extent completes, it trims the length of the existing file extent item for file offset 640K from 128K to 124K, and a new file extent item is added with a key offset of 764K and a length of 4K; 8) Task A calls btrfs_search_forward(), which returns us a path pointing to the leaf (can be leaf N or some other) containing the new file extent item for file offset 764K. We end up copying this item to the log tree, which overlaps with the last copied file extent item, which covers the file range from 640K to 768K. When writeback is triggered for log tree's extent buffers, the issue will be detected by the tree checker which will dump a trace and an error message on dmesg/syslog. If the writeback is triggered when syncing the log, which typically is, then we also end up aborting the current transaction. This is the same type of problem fixed in `0c713cbab6` ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges"). Scenario 2 - logging a version of the file that never existed This scenario only happens when using the NO_HOLES feature and results in a silent corruption, in the sense that is not detectable by 'btrfs check' or the tree checker: 1) We have an inode I with a size of 1M and two file extent items, one covering an extent with disk_bytenr == X for the file range [0, 512K) and another one covering another extent with disk_bytenr == Y for the file range [512K, 1M); 2) A hole is punched for the file range [512K, 1M); 3) Task A starts an fsync of inode I, which has the full sync runtime flag set. It starts by flushing all existing delalloc, locks the inode (VFS lock), starts any new delalloc that might have been created before taking the lock and waits for all ordered extents to complete; 4) Some other task does a memory mapped write for the page corresponding to the file range [640K, 644K) for example; 5) Task A then logs all items of the inode with the call to copy_inode_items_to_log(); 6) In the meanwhile delalloc for the range [640K, 644K) is started. It can be started for several reasons: - The async reclaim task is attempting to satisfy metadata or data reservation requests, and it has reached a point where it decided to flush delalloc; - Due to memory pressure the VMM triggers writeback of dirty pages; - The system call sync_file_range(2) is called from user space. 7) The ordered extent for the range [640K, 644K) completes and a file extent item for that range is added to the subvolume tree, pointing to a 4K extent with a disk_bytenr == Z; 8) Task A then calls btrfs_log_holes(), to scan for implicit holes in the subvolume tree. It finds two implicit holes: - one for the file range [512K, 640K) - one for the file range [644K, 1M) As a result we end up neither logging a hole for the range [640K, 644K) nor logging the file extent item with a disk_bytenr == Z. This means that if we have a power failure and replay the log tree we end up getting the following file extent layout: [ disk_bytenr X ] [ hole ] [ disk_bytenr Y ] [ hole ] 0 512K 512K 640K 640K 644K 644K 1M Which does not corresponding to any layout the file ever had before the power failure. The only two valid layouts would be: [ disk_bytenr X ] [ hole ] 0 512K 512K 1M and [ disk_bytenr X ] [ hole ] [ disk_bytenr Z ] [ hole ] 0 512K 512K 640K 640K 644K 644K 1M This can be fixed by serializing memory mapped writes with fsync, and there are two ways to do it: 1) Make a fsync lock the entire file range, from 0 to (u64)-1 / LLONG_MAX in the inode's io tree. This prevents the race but also blocks any reads during the duration of the fsync, which has a negative impact for many common workloads; 2) Make an fsync write lock the i_mmap_lock semaphore in the inode. This semaphore was recently added by Josef's patch set: btrfs: add a i_mmap_lock to our inode btrfs: cleanup inode_lock/inode_unlock uses btrfs: exclude mmaps while doing remap btrfs: exclude mmap from happening during all fallocate operations and is used to solve races between memory mapped writes and clone/dedupe/fallocate. This also makes us have the same behaviour we have regarding other writes (buffered and direct IO) and fsync - block them while the inode logging is in progress. This change uses the second approach due to the performance impact of the first one. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:15 +02:00
Josef Bacik	8d9b4a162a	btrfs: exclude mmap from happening during all fallocate operations There's a small window where a deadlock can happen between fallocate and mmap. This is described in detail by Filipe: """ When doing a fallocate operation we lock the inode, flush delalloc within the target range, wait for any ordered extents to complete and then lock the file range. Before we lock the range and after we flush delalloc, there is a time window where another task can come in and do a memory mapped write for a page within the fallocate range. This means that after fallocate locks the range, there can be a dirty page in the range. More often than not, this does not cause any problem. The exception is when we are low on available metadata space, because an fallocate operation needs to start a transaction while holding the file range locked, either through btrfs_prealloc_file_range() or through the call to btrfs_fallocate_update_isize(). If that's the case, we can end up in a deadlock. The following list of steps explains how that happens: 1) A fallocate operation starts, locks the inode, flushes delalloc in the range and waits for ordered extents in the range to complete; 2) Before the fallocate task locks the file range, another task does a memory mapped write for a page in the fallocate target range. This is possible since memory mapped writes do not (and can not) lock the inode; 3) The fallocate task locks the file range. At this point there is one dirty page in the range (due to the memory mapped write); 4) When the fallocate task attempts to start a transaction, it blocks when attempting to reserve metadata space, since we are low on available metadata space. Before blocking (wait on its reservation ticket), it starts the async reclaim task (if not running already); 5) The async reclaim task is not able to release space through any other means, so it decides to flush delalloc for inodes with dirty pages. It finds that the inode used in the fallocate operation has a dirty page and therefore queues a job (fs_info->flush_workers workqueue) to flush delalloc for that inode and waits on that job to complete; 6) The flush job blocks when attempting to lock the file range because it is currently locked by the fallocate task; 7) The fallocate task keeps waiting for its metadata reservation, waiting for a wakeup on its reservation ticket. The async reclaim task is waiting on the flush job, which in turn is waiting for locking the file range that is currently locked by the fallocate task. So unless some other task is able to release enough metadata space, for example an ordered extent for some other inode completes, we end up in a deadlock between all these tasks. When this happens stack traces like the following show up in dmesg/syslog: INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] Call Trace: __schedule+0x5d1/0xcf0 schedule+0x45/0xe0 lock_extent_bits+0x1e6/0x2d0 [btrfs] ? finish_wait+0x90/0x90 btrfs_invalidatepage+0x32c/0x390 [btrfs] ? __mod_memcg_state+0x8e/0x160 __extent_writepage+0x2d4/0x400 [btrfs] extent_write_cache_pages+0x2b2/0x500 [btrfs] ? lock_release+0x20e/0x4c0 ? trace_hardirqs_on+0x1b/0xf0 extent_writepages+0x43/0x90 [btrfs] ? lock_acquire+0x1a3/0x490 do_writepages+0x43/0xe0 ? __filemap_fdatawrite_range+0xa4/0x100 __filemap_fdatawrite_range+0xc5/0x100 btrfs_run_delalloc_work+0x17/0x40 [btrfs] btrfs_work_helper+0xf1/0x600 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x50/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] Call Trace: __schedule+0x5d1/0xcf0 ? kvm_clock_read+0x14/0x30 ? wait_for_completion+0x81/0x110 schedule+0x45/0xe0 schedule_timeout+0x30c/0x580 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? lock_acquire+0x1a3/0x490 ? try_to_wake_up+0x7a/0xa20 ? lock_release+0x20e/0x4c0 ? lock_acquired+0x199/0x490 ? wait_for_completion+0x81/0x110 wait_for_completion+0xab/0x110 start_delalloc_inodes+0x2af/0x390 [btrfs] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] flush_space+0x24f/0x660 [btrfs] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x20f/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 (...) several tasks waiting for the inode lock held by the fallocate task below (...) RIP: 0033:0x7f61efe73fff Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5. RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003 R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001 R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 task:fdm-stress state:D stack: 0 pid:2508243 ppid:2508153 flags:0x00000000 Call Trace: __schedule+0x5d1/0xcf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 schedule+0x45/0xe0 __reserve_bytes+0x4a4/0xb10 [btrfs] ? finish_wait+0x90/0x90 btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] btrfs_block_rsv_add+0x1f/0x50 [btrfs] start_transaction+0x2d1/0x760 [btrfs] btrfs_replace_file_extents+0x120/0x930 [btrfs] ? btrfs_fallocate+0xdcf/0x1260 [btrfs] btrfs_fallocate+0xdfb/0x1260 [btrfs] ? filename_lookup+0xf1/0x180 vfs_fallocate+0x14f/0x440 ioctl_preallocate+0x92/0xc0 do_vfs_ioctl+0x66b/0x750 ? __do_sys_newfstat+0x53/0x60 __x64_sys_ioctl+0x62/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 """ Fix this by disallowing mmaps from happening while we're doing any of the fallocate operations on this inode. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:15 +02:00
Josef Bacik	8c99516a8c	btrfs: exclude mmaps while doing remap Darrick reported a potential issue to me where we could allow mmap writes after validating a page range matched in the case of dedupe. Generally we rely on lock page -> lock extent with the ordered flush to protect us, but this is done after we check the pages because we use the generic helpers, so we could modify the page in between doing the check and locking the range. There also exists a deadlock, as described by Filipe """ When cloning a file range, we lock the inodes, flush any delalloc within the respective file ranges, wait for any ordered extents and then lock the file ranges in both inodes. This means that right after we flush delalloc and before we lock the file ranges, memory mapped writes can come in and dirty pages in the file ranges of the clone operation. Most of the time this is harmless and causes no problems. However, if we are low on available metadata space, we can later end up in a deadlock when starting a transaction to replace file extent items. This happens if when allocating metadata space for the transaction, we need to wait for the async reclaim thread to release space and the reclaim thread needs to flush delalloc for the inode that got the memory mapped write and has its range locked by the clone task. Basically what happens is the following: 1) A clone operation locks inodes A and B, flushes delalloc for both inodes in the respective file ranges and waits for any ordered extents in those ranges to complete; 2) Before the clone task locks the file ranges, another task does a memory mapped write (which does not lock the inode) for one of the inodes of the clone operation. So now we have a dirty page in one of the ranges used by the clone operation; 3) The clone operation locks the file ranges for inodes A and B; 4) Later, when iterating over the file extents of inode A, the clone task attempts to start a transaction. There's not enough available free metadata space, so the async reclaim task is started (if not running already) and we wait for someone to wake us up on our reservation ticket; 5) The async reclaim task is not able to release space by any other means and decides to flush delalloc for the inode of the clone operation; 6) The workqueue job used to flush the inode blocks when starting delalloc for the inode, since the file range is currently locked by the clone task; 7) But the clone task is waiting on its reservation ticket and the async reclaim task is waiting on the flush job to complete, which can't progress since the clone task has the file range locked. So unless some other task is able to release space, for example an ordered extent for some other inode completes, we have a deadlock between all these tasks; When this happens stack traces like the following show up in dmesg/syslog: INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] Call Trace: __schedule+0x5d1/0xcf0 schedule+0x45/0xe0 lock_extent_bits+0x1e6/0x2d0 [btrfs] ? finish_wait+0x90/0x90 btrfs_invalidatepage+0x32c/0x390 [btrfs] ? __mod_memcg_state+0x8e/0x160 __extent_writepage+0x2d4/0x400 [btrfs] extent_write_cache_pages+0x2b2/0x500 [btrfs] ? lock_release+0x20e/0x4c0 ? trace_hardirqs_on+0x1b/0xf0 extent_writepages+0x43/0x90 [btrfs] ? lock_acquire+0x1a3/0x490 do_writepages+0x43/0xe0 ? __filemap_fdatawrite_range+0xa4/0x100 __filemap_fdatawrite_range+0xc5/0x100 btrfs_run_delalloc_work+0x17/0x40 [btrfs] btrfs_work_helper+0xf1/0x600 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x50/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] Call Trace: __schedule+0x5d1/0xcf0 ? kvm_clock_read+0x14/0x30 ? wait_for_completion+0x81/0x110 schedule+0x45/0xe0 schedule_timeout+0x30c/0x580 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? lock_acquire+0x1a3/0x490 ? try_to_wake_up+0x7a/0xa20 ? lock_release+0x20e/0x4c0 ? lock_acquired+0x199/0x490 ? wait_for_completion+0x81/0x110 wait_for_completion+0xab/0x110 start_delalloc_inodes+0x2af/0x390 [btrfs] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] flush_space+0x24f/0x660 [btrfs] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x20f/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 (...) several other tasks blocked on inode locks held by the clone task below (...) RIP: 0033:0x7f61efe73fff Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5. RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff RDX: 00000000ffffff9c RSI: 0000560fbd604690 RDI: 00000000ffffff9c RBP: 00007ffc3371beb0 R08: 0000000000000002 R09: 0000560fbd5d75f0 R10: 0000560fbd5d81f0 R11: 0000000000000202 R12: 0000000000000002 R13: 000000000000000b R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 task: fdm-stress state:D stack: 0 pid:2508234 ppid:2508153 flags:0x00004000 Call Trace: __schedule+0x5d1/0xcf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 schedule+0x45/0xe0 __reserve_bytes+0x4a4/0xb10 [btrfs] ? finish_wait+0x90/0x90 btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] btrfs_block_rsv_add+0x1f/0x50 [btrfs] start_transaction+0x2d1/0x760 [btrfs] btrfs_replace_file_extents+0x120/0x930 [btrfs] ? lock_release+0x20e/0x4c0 btrfs_clone+0x3e4/0x7e0 [btrfs] ? btrfs_lookup_first_ordered_extent+0x8e/0x100 [btrfs] btrfs_clone_files+0xf6/0x150 [btrfs] btrfs_remap_file_range+0x324/0x3d0 [btrfs] do_clone_file_range+0xd4/0x1f0 vfs_clone_file_range+0x4d/0x230 ? lock_release+0x20e/0x4c0 ioctl_file_clone+0x8f/0xc0 do_vfs_ioctl+0x342/0x750 __x64_sys_ioctl+0x62/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 """ Fix both of these issues by excluding mmaps from happening we are doing any sort of remap, which prevents this race completely. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:15 +02:00
Josef Bacik	64708539cd	btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpers A few places we intermix btrfs_inode_lock with a inode_unlock, and some places we just use inode_lock/inode_unlock instead of btrfs_inode_lock. None of these places are using this incorrectly, but as we adjust some of these callers it would be nice to keep everything consistent, so convert everybody to use btrfs_inode_lock/btrfs_inode_unlock. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:15 +02:00
Josef Bacik	8318ba79ee	btrfs: add a i_mmap_lock to our inode We need to be able to exclude page_mkwrite from happening concurrently with certain operations. To facilitate this, add a i_mmap_lock to our inode, down_read() it in our mkwrite, and add a new ILOCK flag to indicate that we want to take the i_mmap_lock as well. I used pahole to check the size of the btrfs_inode, the sizes are as follows no lockdep: before: 1120 (3 per 4k page) after: 1160 (3 per 4k page) lockdep: before: 2072 (1 per 4k page) after: 2224 (1 per 4k page) We're slightly larger but it doesn't change how many objects we can fit per page. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:15 +02:00
Goldwyn Rodrigues	5e295768a0	btrfs: remove mirror argument from btrfs_csum_verify_data() The parameter mirror is not used and does not make sense for checksum verification of the given bio. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:15 +02:00
Goldwyn Rodrigues	6e65ae7629	btrfs: remove force argument from run_delalloc_nocow() force_cow can be calculated from inode and does not need to be passed as an argument. This simplifies run_delalloc_nocow() call from btrfs_run_delalloc_range() A new function, should_nocow() checks if the range should be NOCOWed or not. The function returns true iff either BTRFS_INODE_NODATA or BTRFS_INODE_PREALLOC, but is not a defrag extent. Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:15 +02:00
Nikolay Borisov	d6ade6894e	btrfs: don't opencode extent_changeset_free Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:15 +02:00
Jiapeng Chong	7000babdda	btrfs: assign proper values to a bool variable in dev_extent_hole_check_zoned Fix the following coccicheck warnings: ./fs/btrfs/volumes.c:1462:10-11: WARNING: return of 0/1 in function 'dev_extent_hole_check_zoned' with return type bool. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:15 +02:00
Filipe Manana	2ce73c6335	btrfs: add btree read ahead for incremental send operations Currently we do not do btree read ahead when doing an incremental send, however we know that we will read and process any node or leaf in the send root that has a generation greater than the generation of the parent root. So triggering read ahead for such nodes and leafs is beneficial for an incremental send. This change does that, triggers read ahead of any node or leaf in the send root that has a generation greater then the generation of the parent root. As for the parent root, no readahead is triggered because knowing in advance which nodes/leaves are going to be read is not so linear and there's often a large time window between visiting nodes or leaves of the parent root. So I opted to leave out the parent root, and triggering read ahead for its nodes/leaves seemed to have not made significant difference. The following test script was used to measure the improvement on a box using an average, consumer grade, spinning disk and with 16GiB of ram: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT # Create files with inline data to make it easier and faster to create # large btrees. add_files() { local total=$1 local start_offset=$2 local number_jobs=$3 local total_per_job=$(($total / $number_jobs)) echo "Creating $total new files using $number_jobs jobs" for ((n = 0; n < $number_jobs; n++)); do ( local start_num=$(($start_offset + $n * $total_per_job)) for ((i = 1; i <= $total_per_job; i++)); do local file_num=$((start_num + $i)) local file_path="$MNT/file_${file_num}" xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null if [ $? -ne 0 ]; then echo "Failed creating file $file_path" break fi done ) & worker_pids[$n]=$! done wait ${worker_pids[@]} sync echo echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV \| egrep '^(node\|leaf) ' \| wc -l)" } initial_file_count=500000 add_files $initial_file_count 0 4 echo echo "Creating first snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap1 echo echo "Adding more files..." add_files $((initial_file_count / 4)) $initial_file_count 4 echo echo "Updating 1/50th of the initial files..." for ((i = 1; i < $initial_file_count; i += 50)); do xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null done echo echo "Creating second snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap2 umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing full send..." start=$(date +%s) btrfs send $MNT/snap1 > /dev/null end=$(date +%s) echo echo "Full send took $((end - start)) seconds" umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing incremental send..." start=$(date +%s) btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null end=$(date +%s) echo echo "Incremental send took $((end - start)) seconds" umount $MNT Before this change, incremental send duration: with $initial_file_count == 200000: 51 seconds with $initial_file_count == 500000: 168 seconds After this change, incremental send duration: with $initial_file_count == 200000: 39 seconds (-26.7%) with $initial_file_count == 500000: 125 seconds (-29.4%) For $initial_file_count == 200000 there are 62600 nodes and leaves in the btree of the first snapshot, and 77759 nodes and leaves in the btree of the second snapshot. The root nodes were at level 2. While for $initial_file_count == 500000 there are 152476 nodes and leaves in the btree of the first snapshot, and 190511 nodes and leaves in the btree of the second snapshot. The root nodes were at level 2 as well. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:15 +02:00
Filipe Manana	19358b154f	btrfs: add btree read ahead for full send operations When doing a full send we know that we are going to be reading every node and leaf of the send root, so we benefit from enabling read ahead for the btree. This change enables read ahead for full send operations only, incremental sends will have read ahead enabled in a different way by a separate patch. The following test script was used to measure the improvement on a box using an average, consumer grade, spinning disk and with 16GiB of RAM: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT # Create files with inline data to make it easier and faster to create # large btrees. add_files() { local total=$1 local start_offset=$2 local number_jobs=$3 local total_per_job=$(($total / $number_jobs)) echo "Creating $total new files using $number_jobs jobs" for ((n = 0; n < $number_jobs; n++)); do ( local start_num=$(($start_offset + $n * $total_per_job)) for ((i = 1; i <= $total_per_job; i++)); do local file_num=$((start_num + $i)) local file_path="$MNT/file_${file_num}" xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null if [ $? -ne 0 ]; then echo "Failed creating file $file_path" break fi done ) & worker_pids[$n]=$! done wait ${worker_pids[@]} sync echo echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV \| egrep '^(node\|leaf) ' \| wc -l)" } initial_file_count=500000 add_files $initial_file_count 0 4 echo echo "Creating first snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap1 echo echo "Adding more files..." add_files $((initial_file_count / 4)) $initial_file_count 4 echo echo "Updating 1/50th of the initial files..." for ((i = 1; i < $initial_file_count; i += 50)); do xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null done echo echo "Creating second snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap2 umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing full send..." start=$(date +%s) btrfs send $MNT/snap1 > /dev/null end=$(date +%s) echo echo "Full send took $((end - start)) seconds" umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing incremental send..." start=$(date +%s) btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null end=$(date +%s) echo echo "Incremental send took $((end - start)) seconds" umount $MNT Before this change, full send duration: with $initial_file_count == 200000: 165 seconds with $initial_file_count == 500000: 407 seconds After this change, full send duration: with $initial_file_count == 200000: 149 seconds (-10.2%) with $initial_file_count == 500000: 353 seconds (-14.2%) For $initial_file_count == 200000 there are 62600 nodes and leaves in the btree of the first snapshot, while for $initial_file_count == 500000 there are 152476 nodes and leaves. The roots were at level 2. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:15 +02:00
Nikolay Borisov	98686ffc71	btrfs: simplify code flow in btrfs_delayed_inode_reserve_metadata btrfs_block_rsv_add can return only ENOSPC since it's called with NO_FLUSH modifier. This so simplify the logic in btrfs_delayed_inode_reserve_metadata to exploit this invariant. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add assert and comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:15 +02:00
Nikolay Borisov	8e3c9d3cf8	btrfs: remove btrfs_inode parameter from btrfs_delayed_inode_reserve_metadata It's only used for tracepoint to obtain the inode number, but we already have the ino from btrfs_delayed_node::inode_id. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:14 +02:00
Nikolay Borisov	ae396a3b7a	btrfs: simplify commit logic in try_flush_qgroup It's no longer expected to call this function with an open transaction so all the workarounds concerning this can be removed. In fact it'll constitute a bug to call this function with a transaction already held so WARN in this case. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-04-19 17:25:14 +02:00

... 3 4 5 6 7 ...

70705 Commits