linux-sg2042

Commit Graph

Author	SHA1	Message	Date
David Sterba	1db45a35f0	btrfs: replace u_long type cast with unsigned long We don't use the u_XX types anywhere, though they're defined. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:45 +01:00
David Sterba	eeb6f17200	btrfs: raid56: simplify sort_parity_stripes Remove trivial comprator and open coded swap of two values. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:45 +01:00
David Sterba	7e8f19e50e	btrfs: adjust message level for unrecognized mount option An unrecognized option is a failure that should get user/administrator attention, the info level is often below what gets logged, so make it error. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:45 +01:00
David Sterba	42c9d0b524	btrfs: simplify parameters of btrfs_set_disk_extent_flags All callers pass extent buffer start and length so the extent buffer itself should work fine. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:45 +01:00
David Sterba	c4ac754198	btrfs: open code trivial helper btrfs_header_chunk_tree_uuid The helper btrfs_header_chunk_tree_uuid follows naming convention of other struct accessors but does something compeletly different. As the offsetof calculation is clear in the context of extent buffer operations we can remove it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:44 +01:00
David Sterba	9a8658e33d	btrfs: open code trivial helper btrfs_header_fsid The helper btrfs_header_fsid follows naming convention of other struct accessors but does something compeletly different. As the offsetof calculation is clear in the context of extent buffer operations we can remove it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:44 +01:00
David Sterba	75fb2e9e49	btrfs: move mapping of block for discard to its caller There's a simple forwarded call based on the operation that would better fit the caller btrfs_map_block that's until now a trivial wrapper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:44 +01:00
David Sterba	ee787f9550	btrfs: use struct_size to calculate size of raid hash table The struct_size macro does the same calculation and is safe regarding overflows. Though we're not expecting them to happen, use the helper for clarity. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:44 +01:00
Nikolay Borisov	dcc3eb9638	btrfs: convert snapshot/nocow exlcusion to drew lock This patch removes all haphazard code implementing nocow writers exclusion from pending snapshot creation and switches to using the drew lock to ensure this invariant still holds. 'Readers' are snapshot creators from create_snapshot and 'writers' are nocow writers from buffered write path or btrfs_setsize. This locking scheme allows for multiple snapshots to happen while any nocow writers are blocked, since writes to page cache in the nocow path will make snapshots inconsistent. So for performance reasons we'd like to have the ability to run multiple concurrent snapshots and also favors readers in this case. And in case there aren't pending snapshots (which will be the majority of the cases) we rely on the percpu's writers counter to avoid cacheline contention. The main gain from using the drew lock is it's now a lot easier to reason about the guarantees of the locking scheme and whether there is some silent breakage lurking. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:44 +01:00
Nikolay Borisov	2992df7326	btrfs: Implement DREW lock A (D)ouble (R)eader (W)riter (E)xclustion lock is a locking primitive that allows to have multiple readers or multiple writers but not multiple readers and writers holding it concurrently. The code is factored out from the existing open-coded locking scheme used to exclude pending snapshots from nocow writers and vice-versa. Current implementation actually favors Readers (that is snapshot creaters) to writers (nocow writers of the filesystem). The API provides lock/unlock/trylock for reads and writes. Formal specification for TLA+ provided by Valentin Schneider is at https://lore.kernel.org/linux-btrfs/2dcaf81c-f0d3-409e-cb29-733d8b3b4cc9@arm.com/ Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:43 +01:00
Johannes Thumshirn	fd8efa818c	btrfs: simplify error handling in __btrfs_write_out_cache() The error cleanup gotos in __btrfs_write_out_cache() needlessly jump back making the code less readable then needed. Flatten them out so no back-jump is necessary and the read flow is uninterrupted. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:43 +01:00
Johannes Thumshirn	1afb648e94	btrfs: use standard debug config option to enable free-space-cache debug prints free-space-cache.c has it's own set of DEBUG ifdefs which need to be turned on instead of the global CONFIG_BTRFS_DEBUG to print debug messages about failed block-group writes. Switch this over to CONFIG_BTRFS_DEBUG so we always see these messages when running a debug kernel. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:43 +01:00
Johannes Thumshirn	7a195f6db9	btrfs: make the uptodate argument of io_ctl_add_pages() boolean Make the uptodate argument of io_ctl_add_pages() boolean. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:43 +01:00
Johannes Thumshirn	831fa14f1e	btrfs: use inode from io_ctl in io_ctl_prepare_pages io_ctl_prepare_pages() gets a 'struct btrfs_io_ctl' as well as a 'struct inode', but btrfs_io_ctl::inode points to the same struct inode as this is assgined in io_ctl_init(). Use the inode form io_ctl to reduce the arguments of io_ctl_prepare_pages. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:43 +01:00
Marcos Paulo de Souza	949964c928	btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl This ioctl will be responsible for deleting a subvolume using its id. This can be used when a system has a file system mounted from a subvolume, rather than the root file system, like below: / @subvol1/ @subvol2/ @subvol_default/ If only @subvol_default is mounted, we have no path to reach @subvol1 and @subvol2, thus no way to delete them. Current subvolume delete ioctl takes a file handle point as argument, and if @subvol_default is mounted, we can't reach @subvol1 and @subvol2 from the same mount point. This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes the extended structure with flags to allow to delete subvolume using subvolid. Now, we can use this new ioctl specifying the subvolume id and refer to the same mount point. It doesn't matter which subvolume was mounted, since we can reach to the desired one using the subvolume id, and then delete it. The full path to the subvolume id is resolved internally and access is verified as if the subvolume was accessed by path. The volume args v2 structure is extended to use the existing union for subvolume id specification, that's valid in case the BTRFS_SUBVOL_SPEC_BY_ID is set. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:42 +01:00
Marcos Paulo de Souza	c0c907a47d	btrfs: export helpers for subvolume name/id resolution The functions will be used outside of export.c and super.c to allow resolving subvolume name from a given id, eg. for subvolume deletion by id ioctl. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ split from the next patch ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:42 +01:00
David Sterba	748449cdbe	btrfs: use ioctl args support mask for device delete When the device remove v2 ioctl was added, the full support mask was added to sanity check the flags. However this would allow to let the subvolume related flags to be accepted. This is not supposed to happen. Use the correct support mask, which means that now any of BTRFS_SUBVOL_CREATE_ASYNC, BTRFS_SUBVOL_RDONLY or BTRFS_SUBVOL_QGROUP_INHERIT will be rejected as ENOTSUPP. Though this is a user-visible change, specifying subvolume flags for device deletion does not make sense and there are hopefully no applications doing that. Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:42 +01:00
David Sterba	673990dba3	btrfs: use ioctl args support mask for subvolume create/delete Using the defined mask instead of flag enumeration in the ioctl handler is preferred. No functional changes. Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:42 +01:00
Jules Irenge	5ce48d0f0e	btrfs: Add missing lock annotation for release_extent_buffer() Sparse reports a warning at release_extent_buffer() warning: context imbalance in release_extent_buffer() - unexpected unlock The root cause is the missing annotation at release_extent_buffer() Add the missing __releases(&eb->refs_lock) annotation Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:42 +01:00
Josef Bacik	75ec1db871	btrfs: set update the uuid generation as soon as possible In my EIO stress testing I noticed I was getting forced to rescan the uuid tree pretty often, which was weird. This is because my error injection stuff would sometimes inject an error after log replay but before we loaded the UUID tree. If log replay committed the transaction it wouldn't have updated the uuid tree generation, but the tree was valid and didn't change, so there's no reason to not update the generation here. Fix this by setting the BTRFS_FS_UPDATE_UUID_TREE_GEN bit immediately after reading all the fs roots if the uuid tree generation matches the fs generation. Then any transaction commits that happen during mount won't screw up our uuid tree state, forcing us to do needless uuid rescans. Fixes: `70f8017547` ("Btrfs: check UUID tree during mount if required") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:41 +01:00
Josef Bacik	c94bec2c61	btrfs: bail out of uuid tree scanning if we're closing In doing my fsstress+EIO stress testing I started running into issues where umount would get stuck forever because the uuid checker was chewing through the thousands of subvolumes I had created. We shouldn't block umount on this, simply bail if we're unmounting the fs. We need to make sure we don't mark the UUID tree as ok, so we only set that bit if we made it through the whole rescan operation, but otherwise this is completely safe. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:41 +01:00
Nikolay Borisov	97f4dd09da	btrfs: make btrfs_check_uuid_tree private to disk-io.c It's used only during filesystem mount as such it can be made private to disk-io.c file. Also use the occasion to move btrfs_uuid_rescan_kthread as btrfs_check_uuid_tree is its sole caller. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:41 +01:00
Nikolay Borisov	560b7a4aa2	btrfs: call btrfs_check_uuid_tree_entry directly in btrfs_uuid_tree_iterate btrfs_uuid_tree_iterate is called from only once place and its 2nd argument is always btrfs_check_uuid_tree_entry. Simplify btrfs_uuid_tree_iterate's signature by removing its 2nd argument and directly calling btrfs_check_uuid_tree_entry. Also move the latter into uuid-tree.h. No functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:41 +01:00
David Sterba	c17af96554	btrfs: raid56: simplify tracking of Q stripe presence There are temporary variables tracking the index of P and Q stripes, but none of them is really used as such, merely for determining if the Q stripe is present. This leads to compiler warnings with -Wunused-but-set-variable and has been reported several times. fs/btrfs/raid56.c: In function ‘finish_rmw’: fs/btrfs/raid56.c:1199:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable] 1199 \| int p_stripe = -1; \| ^~~~~~~~ fs/btrfs/raid56.c: In function ‘finish_parity_scrub’: fs/btrfs/raid56.c:2356:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable] 2356 \| int p_stripe = -1; \| ^~~~~~~~ Replace the two variables with one that has a clear meaning and also get rid of the warnings. The logic that verifies that there are only 2 valid cases is unchanged. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:41 +01:00
ethanwu	b25b0b871f	btrfs: backref, use correct count to resolve normal data refs With the following patches: - btrfs: backref, only collect file extent items matching backref offset - btrfs: backref, not adding refs from shared block when resolving normal backref - btrfs: backref, only search backref entries from leaves of the same root we only collect the normal data refs we want, so the imprecise upper bound total_refs of that EXTENT_ITEM could now be changed to the count of the normal backref entry we want to search. Background and how the patches fit together: Btrfs has two types of data backref. For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the exact block number. Therefore, we need to call resolve_indirect_refs. It uses btrfs_search_slot to locate the leaf block. Then we need to walk through the leaves to search for the EXTENT_DATA items that have disk bytenr matching the extent item (add_all_parents). When resolving indirect refs, we could take entries that don't belong to the backref entry we are searching for right now. For that reason when searching backref entry, we always use total refs of that EXTENT_ITEM rather than individual count. For example: item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize extent refs 24 gen 7302 flags DATA shared data backref parent 394985472 count 10 #1 extent data backref root 257 objectid 260 offset 1048576 count 3 #2 extent data backref root 256 objectid 260 offset 65536 count 6 #3 extent data backref root 257 objectid 260 offset 65536 count 5 #4 For example, when searching backref entry #4, we'll use total_refs 24, a very loose loop ending condition, instead of total_refs = 5. But using total_refs = 24 is not accurate. Sometimes, we'll never find all the refs from specific root. As a result, the loop keeps on going until we reach the end of that inode. The first 3 patches, handle 3 different types refs we might encounter. These refs do not belong to the normal backref we are searching, and hence need to be skipped. This patch changes the total_refs to correct number so that we could end loop as soon as we find all the refs we want. btrfs send uses backref to find possible clone sources, the following is a simple test to compare the results with and without this patch: $ btrfs subvolume create /sub1 $ for i in `seq 1 163840`; do dd if=/dev/zero of=/sub1/file bs=64K count=1 seek=$((i-1)) conv=notrunc oflag=direct done $ btrfs subvolume snapshot /sub1 /sub2 $ for i in `seq 1 163840`; do dd if=/dev/zero of=/sub1/file bs=4K count=1 seek=$(((i-1)*16+10)) conv=notrunc oflag=direct done $ btrfs subvolume snapshot -r /sub1 /snap1 $ time btrfs send /snap1 \| btrfs receive /volume2 Without this patch: real 69m48.124s user 0m50.199s sys 70m15.600s With this patch: real 1m59.683s user 0m35.421s sys 2m42.684s Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: ethanwu <ethanwu@synology.com> [ add patchset cover letter with background and numbers ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:40 +01:00
ethanwu	cfc0eed0ec	btrfs: backref, only search backref entries from leaves of the same root We could have some nodes/leaves in subvolume whose owner are not the that subvolume. In this way, when we resolve normal backrefs of that subvolume, we should avoid collecting those references from these blocks. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: ethanwu <ethanwu@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:40 +01:00
ethanwu	ed58f2e66e	btrfs: backref, don't add refs from shared block when resolving normal backref All references from the block of SHARED_DATA_REF belong to that shared block backref. For example: item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 95 extent refs 24 gen 7302 flags DATA extent data backref root 257 objectid 260 offset 65536 count 5 extent data backref root 258 objectid 265 offset 0 count 9 shared data backref parent 394985472 count 10 Block 394985472 might be leaf from root 257, and the item obejctid and (file_pos - file_extent_item::offset) in that leaf just happens to be 260 and 65536 which is equal to the first extent data backref entry. Before this patch, when we resolve backref: root 257 objectid 260 offset 65536 we will add those refs in block 394985472 and wrongly treat those as the refs we want. Fix this by checking if the leaf we are processing is shared data backref, if so, just skip this leaf. Shared data refs added into preftrees.direct have all entry value = 0 (root_id = 0, key = NULL, level = 0) except parent entry. Other refs from indirect tree will have key value and root id != 0, and these values won't be changed when their parent is resolved and added to preftrees.direct. Therefore, we could reuse the preftrees.direct and search ref with all values = 0 except parent is set to avoid getting those resolved refs block. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: ethanwu <ethanwu@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:40 +01:00
ethanwu	7ac8b88ee6	btrfs: backref, only collect file extent items matching backref offset When resolving one backref of type EXTENT_DATA_REF, we collect all references that simply reference the EXTENT_ITEM even though their (file_pos - file_extent_item::offset) are not the same as the btrfs_extent_data_ref::offset we are searching for. This patch adds additional check so that we only collect references whose (file_pos - file_extent_item::offset) == btrfs_extent_data_ref::offset. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: ethanwu <ethanwu@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:40 +01:00
Johannes Thumshirn	9da2b242e2	btrfs: remove buffer_heads form super block mirror integrity checking The integrity checking code for the super block mirrors is the last remaining user of buffer_heads, change it to using plain bios as well. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:40 +01:00
Johannes Thumshirn	59aaad503f	btrfs: remove buffer_heads from btrfsic_process_written_block() Now that the last caller of btrfsic_process_written_block() with buffer_heads is gone, remove the buffer_head processing path from it as well. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:40 +01:00
Johannes Thumshirn	61ecc5fc18	btrfs: remove btrfsic_submit_bh() Now that the last use of btrfsic_submit_bh() is gone as the super block is now written using bios, remove the function as well. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:39 +01:00
Johannes Thumshirn	314b6dd0ee	btrfs: use bios instead of buffer_heads from super block writeout Similar to the superblock read path, change the write path to using bios and pages instead of buffer_heads. This allows us to skip over the buffer_head code, for writing the superblock to disk. This is based on a patch originally authored by Nikolay Borisov. Co-developed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:39 +01:00
Johannes Thumshirn	8f32380d3f	btrfs: use the page cache for super block reading Super-block reading in BTRFS is done using buffer_heads. Buffer_heads have some drawbacks, like not being able to propagate errors from the lower layers. Directly use the page cache for reading the super blocks from disk or invalidating an on-disk super block. We have to use the page cache so to avoid races between mkfs and udev. See also `6f60cbd3ae` ("btrfs: access superblock via pagecache in scan_one_device"). This patch unwraps the buffer head API and does not change the way the super block is actually read. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:39 +01:00
Johannes Thumshirn	6fbceb9fa4	btrfs: reduce scope of btrfs_scratch_superblocks() btrfs_scratch_superblocks() isn't used anywhere outside volumes.c so remove it from the header file and mark it as static. Also move it above it's callers so we don't need a forward declaration. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:39 +01:00
Johannes Thumshirn	c514c9b10b	btrfs: don't kmap() pages from block devices Block device mappings are never in highmem so kmap() / kunmap() calls for pages from block devices are unneeded. Use page_address() instead of kmap() to get to the virtual addreses. While we're at it, read_cache_page_gfp() doesn't return NULL on error, only an ERR_PTR, so use IS_ERR() to check for errors. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:39 +01:00
Nikolay Borisov	f6d9abbc1f	btrfs: Export btrfs_release_disk_super Preparatory patch for removal of buffer_head usage in btrfs. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:38 +01:00
Filipe Manana	55ffaabe23	Btrfs: avoid unnecessary splits when setting bits on an extent io tree When attempting to set bits on a range of an exent io tree that already has those bits set we can end up splitting an extent state record, use the preallocated extent state record, insert it into the red black tree, do another search on the red black tree, merge the preallocated extent state record with the previous extent state record, remove that previous record from the red black tree and then free it. This is all unnecessary work that consumes time. This happens specifically at the following case at __set_extent_bit(): $ cat -n fs/btrfs/extent_io.c 957 static int __must_check 958 __set_extent_bit(struct extent_io_tree tree, u64 start, u64 end, (...) 1044 / 1045 * \| ---- desired range ---- \| 1046 * \| state \| 1047 * or 1048 * \| ------------- state -------------- \| 1049 * (...) 1060 if (state->start < start) { 1061 if (state->state & exclusive_bits) { 1062 failed_start = start; 1063 err = -EEXIST; 1064 goto out; 1065 } 1066 1067 prealloc = alloc_extent_state_atomic(prealloc); 1068 BUG_ON(!prealloc); 1069 err = split_state(tree, state, prealloc, start); 1070 if (err) 1071 extent_io_tree_panic(tree, err); 1072 1073 prealloc = NULL; So if our extent state represents a range from 0 to 1MiB for example, and we want to set bits in the range 128KiB to 256KiB for example, and that extent state record already has all those bits set, we end up splitting that record, so we end up with extent state records in the tree which represent the ranges from 0 to 128KiB and from 128KiB to 1MiB. This is temporary because a subsequent iteration in that function will end up merging the records. The splitting requires using the preallocated extent state record, so a future iteration that needs to do another split will need to allocate another extent state record in an atomic context, something not ideal that we try to avoid as much as possible. The splitting also requires an insertion in the red black tree, and a subsequent merge will require a deletion from the red black tree and freeing an extent state record. This change just skips the splitting of an extent state record when it already has all the bits the we need to set. Setting a bit that is already set for a range is very common in the inode's 'file_extent_tree' extent io tree for example, where we keep setting the EXTENT_DIRTY bit every time we replace an extent. This change also fixes a bug that happens after the recent patchset from Josef that avoids having implicit holes after a power failure when not using the NO_HOLES feature, more specifically the patch with the subject: "btrfs: introduce the inode->file_extent_tree" This patch introduced an extent io tree per inode to keep track of completed ordered extents and figure out at any time what is the safe value for the inode's disk_i_size. This assumes that for contiguous ranges in a file we always end up with a single extent state record in the io tree, but that is not the case, as there is a short time window where we can have two extent state records representing contiguous ranges. When this happens we end setting up an incorrect value for the inode's disk_i_size, resulting in data loss after a clean unmount of the filesystem. The following example explains how this can happen. Suppose we have an inode with an i_size and a disk_i_size of 1MiB, so in the inode's file_extent_tree we have a single extent state record that represents the range [0, 1MiB) with the EXTENT_DIRTY bit set. Then the following steps happen: 1) A buffered write against file range [512KiB, 768KiB) is made. At this point delalloc was not flushed yet; 2) Deduplication from some other inode into this inode's range [128KiB, 256KiB) is made. This causes btrfs_inode_set_file_extent_range() to be called, from btrfs_insert_clone_extent(), to mark the range [128KiB, 256KiB) with EXTENT_DIRTY in the inode's file_extent_tree; 3) When btrfs_inode_set_file_extent_range() calls set_extent_bits(), we end up at __set_extent_bit(). In the first iteration of that function's loop we end up in the following branch: $ cat -n fs/btrfs/extent_io.c 957 static int __must_check 958 __set_extent_bit(struct extent_io_tree tree, u64 start, u64 end, (...) 1044 /* 1045 * \| ---- desired range ---- \| 1046 * \| state \| 1047 * or 1048 * \| ------------- state -------------- \| 1049 * (...) 1060 if (state->start < start) { 1061 if (state->state & exclusive_bits) { 1062 *failed_start = start; 1063 err = -EEXIST; 1064 goto out; 1065 } 1066 1067 prealloc = alloc_extent_state_atomic(prealloc); 1068 BUG_ON(!prealloc); 1069 err = split_state(tree, state, prealloc, start); 1070 if (err) 1071 extent_io_tree_panic(tree, err); 1072 1073 prealloc = NULL; (...) 1089 goto search_again; This splits the state record into two, one for range [0, 128KiB) and another for the range [128KiB, 1MiB). Both already have the EXTENT_DIRTY bit set. Then we jump to the 'search_again' label, where we unlock the the spinlock protecting the extent io tree before jumping to the 'again' label to perform the next iteration; 4) In the meanwhile, delalloc is flushed, the ordered extent for the range [512KiB, 768KiB) is created and when it completes, at btrfs_finish_ordered_io(), it calls btrfs_inode_safe_disk_i_size_write() with a value of 0 for its 'new_size' argument; 5) Before the deduplication task currently at __set_extent_bit() moves to the next iteration, the task finishing the ordered extent calls find_first_extent_bit() through btrfs_inode_safe_disk_i_size_write() and gets 'start' set to 0 and 'end' set to 128KiB - because at this moment the io tree has two extent state records, one representing the range [0, 128KiB) and another representing the range [128KiB, 1MiB), both with EXTENT_DIRTY set. Then we set 'isize' to: isize = min(isize, end + 1) = min(1MiB, 128KiB - 1 + 1) = 128KiB Then we set the inode's disk_i_size to 128KiB (isize). After a clean unmount of the filesystem and mounting it again, we have the file with a size of 128KiB, and effectively lost all the data it had before in the range from 128KiB to 1MiB. This change fixes that issue too, as we never end up splitting extent state records when they already have all the bits we want set. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:38 +01:00
Josef Bacik	ab9b2c7b32	btrfs: handle logged extent failure properly If we're allocating a logged extent we attempt to insert an extent record for the file extent directly. We increase space_info->bytes_reserved, because the extent entry addition will call btrfs_update_block_group(), which will convert the ->bytes_reserved to ->bytes_used. However if we fail at any point while inserting the extent entry we will bail and leave space on ->bytes_reserved, which will trigger a WARN_ON() on umount. Fix this by pinning the space if we fail to insert, which is what happens in every other failure case that involves adding the extent entry. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:38 +01:00
Qu Wenruo	e19221180d	btrfs: relocation: Remove is_cowonly_root() This function is only used in read_fs_root(), which is just a wrapper of btrfs_get_fs_root(). For all the mentioned essential roots except log root tree, btrfs_get_fs_root() has its own quick path to grab them from fs_info directly, thus no need for key.offset modification. For subvolume trees, btrfs_get_fs_root() with key.offset == -1 is completely fine. For log trees and log root tree, it's impossible to hit them, as for relocation all backrefs are fetched from commit root, which never records log tree blocks. Log tree blocks either get freed in regular transaction commit, or replayed at mount time. At runtime we should never hit an backref for log tree in extent tree. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:38 +01:00
Nikolay Borisov	fe119a6eeb	btrfs: switch to per-transaction pinned extents This commit flips the switch to start tracking/processing pinned extents on a per-transaction basis. It mostly replaces all references from btrfs_fs_info::(pinned_extents\|freed_extents[]) to btrfs_transaction::pinned_extents. Two notable modifications that warrant explicit mention are changing clean_pinned_extents to get a reference to the previously running transaction. The other one is removal of call to btrfs_destroy_pinned_extent since transactions are going to be cleaned in btrfs_cleanup_one_transaction. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:38 +01:00
Nikolay Borisov	45bb5d6ae9	btrfs: Factor out pinned extent clean up in btrfs_delete_unused_bgs Next patch is going to refactor how pinned extents are tracked which will necessitate changing this code. To ease that work and contain the changes factor the code now in preparation, this will also help review. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:37 +01:00
Nikolay Borisov	f2fb72983b	btrfs: Mark pinned log extents as excluded In preparation to making pinned extents per-transaction ensure that log such extents are always excluded from caching. To achieve this in addition to marking them via btrfs_pin_extent_for_log_replay they also need to be marked with btrfs_add_excluded_extent to prevent log tree extent buffer being loaded by the free space caching thread. That's required since log tree blocks are not recorded in the extent tree, hence they always look free. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:37 +01:00
Nikolay Borisov	6b45f64172	btrfs: Pass transaction handle to write_pinned_extent_entries Preparation for refactoring pinned extents tracking. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:37 +01:00
Nikolay Borisov	6690d07126	btrfs: Make pin_down_extent take transaction handle All callers have a reference to a transaction handle so pass it to pin_down_extent. This is the final step before switching pinned extent tracking to a per-transaction basis. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:37 +01:00
Nikolay Borisov	9fce570454	btrfs: Make btrfs_pin_extent_for_log_replay take transaction handle Preparation for refactoring pinned extents tracking. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:37 +01:00
Nikolay Borisov	7bfc100705	btrfs: Make btrfs_pin_reserved_extent take transaction handle btrfs_pin_reserved_extent is now only called with a valid transaction so exploit the fact to take a transaction. This is preparation for tracking pinned extents on a per-transaction basis. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:37 +01:00
Nikolay Borisov	10e958d523	btrfs: Call btrfs_pin_reserved_extent only during active transaction Calling btrfs_pin_reserved_extent makes sense only with a valid transaction since pinned extents are processed from transaction commit in btrfs_finish_extent_commit. In case of error it's sufficient to adjust the reserved counter to account for log tree extents allocated in the last transaction. This commit moves btrfs_pin_reserved_extent to be called only with valid transaction handle and otherwise uses the newly introduced unaccount_log_buffer to adjust "reserved". If this is not done if a failure occurs before transaction is committed WARN_ON are going to be triggered on unmount. This was especially pronounced with generic/475 test. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:36 +01:00
Nikolay Borisov	6787bb9f35	btrfs: Introduce unaccount_log_buffer This function correctly adjusts the reserved bytes occupied by a log tree extent buffer. It will be used instead of calling btrfs_pin_reserved_extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:36 +01:00
Nikolay Borisov	b25c36f84b	btrfs: Make btrfs_pin_extent take trans handle Preparation for switching pinned extent tracking to a per-transaction basis. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:36 +01:00
Nikolay Borisov	f603bb94ab	btrfs: Perform pinned cleanup directly in btrfs_destroy_delayed_refs Having btrfs_destroy_delayed_refs call btrfs_pin_extent is problematic for making pinned extents tracking per-transaction since btrfs_trans_handle cannot be passed to btrfs_pin_extent in this context. Additionally delayed refs heads pinned in btrfs_destroy_delayed_refs are going to be handled very closely, in btrfs_destroy_pinned_extent. To enable btrfs_pin_extent to take btrfs_trans_handle simply open code it in btrfs_destroy_delayed_refs and call btrfs_error_unpin_extent_range on the range. This enables us to do less work in btrfs_destroy_pinned_extent and leaves btrfs_pin_extent being called in contexts which have a valid btrfs_trans_handle. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:36 +01:00
Anand Jain	25864778bc	btrfs: sysfs, unify handler name of devinfo/missing The devinfo attribute handlers were added in `668e48af7a` ("btrfs: sysfs, add devid/dev_state kobject and device attributes") and the name should contain _devinfo_, there's one that does not conform, so unify it with the rest. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:36 +01:00
Anand Jain	f3cd2c5811	btrfs: sysfs, rename device_link add/remove functions Since commit `668e48af7a` ("btrfs: sysfs, add devid/dev_state kobject and device attributes"), the functions btrfs_sysfs_add_device_link() and btrfs_sysfs_rm_device_link() do more than just adding and removing the device link as its name indicated. Rename them to be more specific that's about the directory with the attirbutes Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:35 +01:00
Anand Jain	1f6087e69c	btrfs: sysfs, use btrfs_sysfs_remove_fsid to celanup errors in add_fsid We have one simple function btrfs_sysfs_remove_fsid() to undo btrfs_sysfs_add_fsid(), which also does proper checks before releasing objects. One difference, if btrfs_sysfs_remove_fsid is used that now we also call kobject_del() which was missing before. This was tested (with kobject debug turned on) and no change in behaviour was found. This is a cleanup patch. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:35 +01:00
David Sterba	f657a31c86	btrfs: sink argument tree to __do_readpage The tree pointer can be safely read from the inode, use it and drop the redundant argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:35 +01:00
David Sterba	b6660e80f1	btrfs: sink arugment tree to contiguous_readpages The tree pointer can be safely read from the inode, use it and drop the redundant argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:35 +01:00
David Sterba	0d44fea77e	btrfs: sink argument tree to __extent_read_full_page The tree pointer can be safely read from the inode, use it and drop the redundant argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:35 +01:00
David Sterba	71ad38b44e	btrfs: sink argument tree to extent_read_full_page The tree pointer can be safely read from the page's inode, use it and drop the redundant argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:35 +01:00
David Sterba	b272ae22ac	btrfs: drop argument tree from btrfs_lock_and_flush_ordered_range The tree pointer can be safely read from the inode so we can drop the redundant argument from btrfs_lock_and_flush_ordered_range. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:34 +01:00
David Sterba	ae6957ebbf	btrfs: add assertions for tree == inode->io_tree to extent IO helpers Add assertions to all helpers that get tree as argument and verify that it's the same that can be obtained from the inode or from its pages. In followup patches the redundant arguments and assertions will be removed one by one. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:34 +01:00
David Sterba	0ceb34bf46	btrfs: drop argument tree from submit_extent_page Now that we're sure the tree from argument is same as the one we can get from the page's inode io_tree, drop the redundant argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:34 +01:00
David Sterba	45b08405b9	btrfs: remove extent_page_data::tree All functions that set up extent_page_data::tree set it to the inode io_tree. That's passed down the callstack that accesses either the same inode or its pages. In the end submit_extent_page can pull the tree out of the page and we don't have to store it in the structure. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:34 +01:00
David Sterba	bf31f87f71	btrfs: add wrapper for transaction abort predicate The status of aborted transaction can change between calls and it needs to be accessed by READ_ONCE. Add a helper that also wraps the unlikely hint. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:34 +01:00
David Sterba	b908c334e7	btrfs: move root node locking helpers to locking.c The helpers are related to locking so move them there, update comments. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:33 +01:00
Josef Bacik	0024652895	btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root We are now using these for all roots, rename them to btrfs_put_root() and btrfs_grab_root(); Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:33 +01:00
Josef Bacik	bd647ce385	btrfs: add a leak check for roots Now that we're going to start relying on getting ref counting right for roots, add a list to track allocated roots and print out any roots that aren't freed up at free_fs_info time. Hide this behind CONFIG_BTRFS_DEBUG because this will just be used for developers to verify they aren't breaking things. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:33 +01:00
Josef Bacik	8260edba67	btrfs: make the init of static elements in fs_info separate In adding things like eb leak checking and root leak checking there were a lot of weird corner cases that come from the fact that 1) We do not init the fs_info until we get to open_ctree time in the normal case and 2) The test infrastructure half-init's the fs_info for things that it needs. This makes it really annoying to make changes because you have to add init in two different places, have special cases for testing fs_info's that may not have certain things initialized, and cases for fs_info's that didn't make it to open_ctree and thus are not fully set up. Fix this by extracting out the non-allocating init of the fs info into it's own public function and use that to make sure we're all getting consistent views of an allocated fs_info. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:33 +01:00
Josef Bacik	ae18c37ad5	btrfs: move fs_info init work into it's own helper function open_ctree mixes initialization of fs stuff and fs_info stuff, which makes it confusing when doing things like adding the root leak detection. Make a separate function that inits all the static structures inside of the fs_info needed for the fs to operate, and then call that before we start setting up the fs_info to be mounted. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:33 +01:00
Josef Bacik	141386e1a5	btrfs: free more things in btrfs_free_fs_info Things like the percpu_counters, the mapping_tree, and the csum hash can all be freed at btrfs_free_fs_info time, since the helpers all check if the structure has been initialized already. This significantly cleans up the error cases in open_ctree. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:32 +01:00
Josef Bacik	bc44d7c4b2	btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root Now that all callers of btrfs_get_fs_root are subsequently calling btrfs_grab_fs_root and handling dropping the ref when they are done appropriately, go ahead and push btrfs_grab_fs_root up into btrfs_get_fs_root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:32 +01:00
Josef Bacik	81f096edf0	btrfs: use btrfs_put_fs_root to free roots always If we are going to track leaked roots we need to free them all the same way, so don't kfree() roots directly, use btrfs_put_fs_root. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:32 +01:00
Josef Bacik	4c78e9f596	btrfs: hold a ref on the root in open_ctree We lookup the fs_root and put it in our fs_info directly, we should hold a ref on this root for the lifetime of the fs_info. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:32 +01:00
Josef Bacik	0d4b046301	btrfs: export and rename free_fs_info We're going to start freeing roots and doing other complicated things in free_fs_info, so we need to move it to disk-io.c and export it in order to use things lik btrfs_put_fs_root(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:32 +01:00
Josef Bacik	fbb0ce40d6	btrfs: hold a ref on the root in btrfs_check_uuid_tree_entry We lookup the uuid of arbitrary subvolumes, hold a ref on the root while we're doing this. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:31 +01:00
Josef Bacik	ca2037fba6	btrfs: hold a ref on the root in btrfs_recover_log_trees We replay the log into arbitrary fs roots, hold a ref on the root while we're doing this. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:31 +01:00
Josef Bacik	5119cfc36f	btrfs: hold a ref on the root in create_pending_snapshot We create the snapshot and then use it for a bunch of things, we need to hold a ref on it while we're messing with it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:31 +01:00
Josef Bacik	5168489a07	btrfs: hold a ref on the root in get_subvol_name_from_objectid We lookup the name of a subvol which means we'll cross into different roots. Hold a ref while we're doing the look ups in the fs_root we're searching. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:31 +01:00
Josef Bacik	6f9a3da5da	btrfs: hold a ref on the root in btrfs_ioctl_send We lookup all the clone roots and the parent root for send, so we need to hold refs on all of these roots while we're processing them. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:31 +01:00
Josef Bacik	fd79d43b34	btrfs: hold a ref on the root in scrub_print_warning_inode We look up the root for the bytenr that is failing, so we need to hold a ref on the root for that operation. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:30 +01:00
Josef Bacik	0b2dee5cff	btrfs: hold a ref for the root in btrfs_find_orphan_roots We lookup roots for every orphan item we have, we need to hold a ref on the root while we're doing this work. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:30 +01:00
Josef Bacik	9f583209f2	btrfs: push grab_fs_root into read_fs_root All of relocation uses read_fs_root to lookup fs roots, so push the btrfs_grab_fs_root() up into that helper and remove the individual calls. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:30 +01:00
Josef Bacik	932fd26df8	btrfs: hold a ref on the root in btrfs_recover_relocation We look up the fs root in various places in here when recovering from a crashed relcoation. Make sure we hold a ref on the root whenever we look them up. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:30 +01:00
Josef Bacik	76deacf023	btrfs: hold a ref on the root in create_reloc_inode We're creating a reloc inode in the data reloc tree, we need to hold a ref on the root while we're doing that. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:30 +01:00
Josef Bacik	3d7babdcf2	btrfs: hold a ref on the root in find_data_references We're looking up the data references for the bytenr in a root, we need to hold a ref on that root while we're doing that. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:30 +01:00
Josef Bacik	442b1ac524	btrfs: hold a ref on the root in record_reloc_root_in_trans We are recording this root in the transaction, so we need to hold a ref on it until we do that. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:29 +01:00
Josef Bacik	ab9737bd75	btrfs: hold a ref on the root in merge_reloc_roots We look up the corresponding root for the reloc root, we need to hold a ref while we're messing with it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:29 +01:00
Josef Bacik	db2c2ca2db	btrfs: hold a ref on the root in prepare_to_merge We look up the reloc roots corresponding root, we need to hold a ref on that root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:29 +01:00
Josef Bacik	0b530bc5e1	btrfs: hold a ref on the root in build_backref_tree This is trickier than the previous conversions. We have backref_node's that need to hold onto their root for their lifetime. Do the read of the root and grab the ref. If at any point we don't use the root we discard it, however if we use it in our backref node we don't free it until we free the backref node. Any time we switch the root's for the backref node we need to drop our ref on the old root and grab the ref on the new root, and if we dupe a node we need to get a ref on the root there as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:29 +01:00
Josef Bacik	2a2b5d6202	btrfs: hold ref on root in btrfs_ioctl_default_subvol We look up an arbitrary fs root here, we need to hold a ref on the root for the duration. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:29 +01:00
Josef Bacik	04734e8448	btrfs: hold a ref on the root in btrfs_ioctl_get_subvol_info We look up whatever root userspace has given us, we need to hold a ref throughout this operation. Use 'root' only for the on fs root and not as a temporary variable elsewhere. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:28 +01:00
Josef Bacik	b8a49ae191	btrfs: hold a ref on the root in btrfs_search_path_in_tree_user We can wander into a different root, so grab a ref on the root we look up. Later on we make root = fs_info->tree_root so we need this separate out label to make sure we do the right cleanup only in the case we're looking up a different root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:28 +01:00
Josef Bacik	88234012be	btrfs: hold a ref on the root in btrfs_search_path_in_tree We look up an arbitrary fs root, we need to hold a ref on it while we're doing our search. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:28 +01:00
Josef Bacik	3ca35e839e	btrfs: hold a ref on the root in search_ioctl We lookup a arbitrary fs root, we need to hold a ref on that root. If we're using our own inodes root then grab a ref on that as well to make the cleanup easier. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:28 +01:00
Josef Bacik	fc92f79856	btrfs: hold a ref on the root in create_subvol We're creating the new root here, but we should hold the ref until after we've initialized the inode for it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:28 +01:00
Josef Bacik	8727002f79	btrfs: hold a ref on the root in fixup_tree_root_location Looking up the inode from an arbitrary tree means we need to hold a ref on that root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:28 +01:00
Josef Bacik	02162a0265	btrfs: hold a ref on the root in __btrfs_run_defrag_inode We are looking up an arbitrary inode, we need to hold a ref on the root while we're doing this. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:27 +01:00
Josef Bacik	bdf70b9e75	btrfs: hold a root ref in btrfs_get_dentry Looking up the inode we need to search the root, make sure we hold a reference on that root while we're doing the lookup. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:27 +01:00
Josef Bacik	9326f76f4b	btrfs: hold a ref on the root in resolve_indirect_ref We're looking up a random root, we need to hold a ref on it while we're using it. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:27 +01:00
Josef Bacik	af01d2e53f	btrfs: hold a ref on fs roots while they're in the radix tree If the root is sitting in the radix tree, we should probably have a ref for the radix tree. Grab a ref on the root when we insert it, and drop it when it gets deleted. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:27 +01:00
Josef Bacik	4b8b052888	btrfs: describe the space reservation system in general Add another comment to cover how the space reservation system works generally. This covers the actual reservation flow, as well as how flushing is handled. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:27 +01:00
Josef Bacik	6f4ad559ea	btrfs: add a comment describing delalloc space reservation delalloc space reservation is tricky because it encompasses both data and metadata. Make it clear what each side does, the general flow of how space is moved throughout the lifetime of a write, and what goes into the calculations. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:27 +01:00
Josef Bacik	734d8c15df	btrfs: add a comment describing block reserves This is a giant comment at the top of block-rsv.c describing generally how block reserves work. It is purely about the block reserves themselves, and nothing to do with how the actual reservation system works. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:26 +01:00
Josef Bacik	4cdfd93002	btrfs: handle NULL roots in btrfs_put/btrfs_grab_fs_root We want to use this for dropping all roots, and in some error cases we may not have a root, so handle this to make the cleanup code easier. Make btrfs_grab_fs_root the same so we can use it in cases where the root may not exist (like the quota root). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:26 +01:00
Josef Bacik	a98db0f304	btrfs: make the fs root init functions static Now that the orphan cleanup stuff doesn't use this directly we can just make them static. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:26 +01:00
Josef Bacik	3619c94f07	btrfs: open code btrfs_read_fs_root_no_name All this does is call btrfs_get_fs_root() with check_ref == true. Just use btrfs_get_fs_root() so we don't have a bunch of different helpers that do the same thing. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:26 +01:00
Josef Bacik	83db2aadb3	btrfs: remove btrfs_read_fs_root, not used anymore All helpers should either be using btrfs_get_fs_root() or btrfs_read_tree_root(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:26 +01:00
Josef Bacik	3dbf1738a1	btrfs: make relocation use btrfs_read_tree_root() Relocation has it's special roots, we don't want to save these in the root cache either, so swap it to use btrfs_read_tree_root(). However the reloc root does need REF_COWS set, so make sure we set it everywhere we use this helper, as it no longer does the REF_COWS setting. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:25 +01:00
Josef Bacik	62a2c73ebd	btrfs: export and use btrfs_read_tree_root for tree-log Tree-log uses btrfs_read_fs_root to load its log, but this just calls btrfs_read_tree_root. We don't save the log roots in our root cache, so just export this helper and use it in the logging code. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:25 +01:00
Josef Bacik	e59d18b45d	btrfs: make btrfs_find_orphan_roots use btrfs_get_fs_root btrfs_find_orphan_roots has this weird thing where it looks up the root in cache to see if it is there before just reading the root. But the read it uses just reads the root, it doesn't do any of the init work, we do that by hand here. But this is unnecessary, all we really want is to see if the root still exists and add it to the dead roots list to be cleaned up, otherwise we delete the orphan item. Fix this by just using btrfs_get_fs_root directly with check_ref set to false so we get the orphan root items. Then we just handle in cache and out of cache roots the same, add them to the dead roots list and carry on. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:25 +01:00
Josef Bacik	f39e457156	btrfs: move fs root init stuff into btrfs_init_fs_root We have a helper for reading fs roots that just reads the fs root off the disk and then sets REF_COWS and init's the inheritable flags. Move this into btrfs_init_fs_root so we can later get rid of this helper and consolidate all of the fs root reading into one helper. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:25 +01:00
Josef Bacik	96dfcb46ff	btrfs: push __setup_root into btrfs_alloc_root There's no reason to not init the root at alloc time, and with later patches it actually causes problems if we error out mounting the fs before the tree_root is init'ed because we expect it to have a valid ref count. Fix this by pushing __setup_root into btrfs_alloc_root. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:25 +01:00
Josef Bacik	3f1c64ce04	btrfs: delete the ordered isize update code Now that we have a safe way to update the isize, remove all of this code as it's no longer needed. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:24 +01:00
Josef Bacik	d923afe96d	btrfs: replace all uses of btrfs_ordered_update_i_size Now that we have a safe way to update the i_size, replace all uses of btrfs_ordered_update_i_size with btrfs_inode_safe_disk_i_size_write. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:24 +01:00
Josef Bacik	9ddc959e80	btrfs: use the file extent tree infrastructure We want to use this everywhere we modify the file extent items permanently. These include: 1) Inserting new file extents for writes and prealloc extents. 2) Truncating inode items. 3) btrfs_cont_expand(). 4) Insert inline extents. 5) Insert new extents from log replay. 6) Insert a new extent for clone, as it could be past i_size. 7) Hole punching For hole punching in particular it might seem it's not necessary because anybody extending would use btrfs_cont_expand, however there is a corner that still can give us trouble. Start with an empty file and fallocate KEEP_SIZE 1M-2M We now have a 0 length file, and a hole file extent from 0-1M, and a prealloc extent from 1M-2M. Now punch 1M-1.5M Because this is past i_size we have [HOLE EXTENT][ NOTHING ][PREALLOC] [0 1M][1M 1.5M][1.5M 2M] with an i_size of 0. Now if we pwrite 0-1.5M we'll increas our i_size to 1.5M, but our disk_i_size is still 0 until the ordered extent completes. However if we now immediately truncate 2M on the file we'll just call btrfs_cont_expand(inode, 1.5M, 2M), since our old i_size is 1.5M. If we commit the transaction here and crash we'll expose the gap. To fix this we need to clear the file extent mapping for the range that we punched but didn't insert a corresponding file extent for. This will mean the truncate will only get an disk_i_size set to 1M if we crash before the finish ordered io happens. I've written an xfstest to reproduce the problem and validate this fix. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:24 +01:00
Josef Bacik	41a2ee75aa	btrfs: introduce per-inode file extent tree In order to keep track of where we have file extents on disk, and thus where it is safe to adjust the i_size to, we need to have a tree in place to keep track of the contiguous areas we have file extents for. Add helpers to use this tree, as it's not required for NO_HOLES file systems. We will use this by setting DIRTY for areas we know we have file extent item's set, and clearing it when we remove file extent items for truncation. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:24 +01:00
Josef Bacik	790a1d44f9	btrfs: use btrfs_ordered_update_i_size in clone_finish_inode_update We were using btrfs_i_size_write(), which unconditionally jacks up inode->disk_i_size. However since clone can operate on ranges we could have pending ordered extents for a range prior to the start of our clone operation and thus increase disk_i_size too far and have a hole with no file extent. Fix this by using the btrfs_ordered_update_i_size helper which will do the right thing in the face of pending ordered extents outside of our clone range. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:24 +01:00
Su Yue	cfe953c824	btrfs: update the comment of btrfs_control_ioctl() Btrfsctl was removed in 2012, now the function btrfs_control_ioctl() is only used for devices ioctls. So update the comment. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Su Yue <Damenly_Su@gmx.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:23 +01:00
Qu Wenruo	0c89138970	btrfs: relocation: Add introduction of how relocation works Relocation is one of the most complex part of btrfs, while it's also the foundation stone for online resizing, profile converting. For such a complex facility, we should at least have some introduction to it. This patch will add an basic introduction at pretty a high level, explaining: - What relocation does - How relocation is done Only mentioning how data reloc tree and reloc tree are involved in the operation. No details like the backref cache, or the data reloc tree contents. - Which function to refer. More detailed comments will be added for reloc tree creation, data reloc tree creation and backref cache. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:23 +01:00
Filipe Manana	42836cf4ba	Btrfs: don't iterate mod seq list when putting a tree mod seq Each new element added to the mod seq list is always appended to the list, and each one gets a sequence number coming from a counter which gets incremented everytime a new element is added to the list (or a new node is added to the tree mod log rbtree). Therefore the element with the lowest sequence number is always the first element in the list. So just remove the list iteration at btrfs_put_tree_mod_seq() that computes the minimum sequence number in the list and replace it with a check for the first element's sequence number. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:23 +01:00
Qu Wenruo	30b3688e1f	btrfs: Add overview of device replace The overview of btrfs dev-replace. It mentions some corner cases caused by the write duplication and scrub based data copy. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ adjust wording ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-23 17:01:23 +01:00
Hillf Danton	a5318d3cdf	io-uring: drop 'free_pfile' in struct io_file_put Sync removal of file is only used in case of a GFP_KERNEL kmalloc failure at the cost of io_file_put::done and work flush, while a glich like it can be handled at the call site without too much pain. That said, what is proposed is to drop sync removing of file, and the kink in neck as well. Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-23 09:22:15 -06:00
Hillf Danton	4afdb733b1	io-uring: drop completion when removing file A case of task hung was reported by syzbot, INFO: task syz-executor975:9880 blocked for more than 143 seconds. Not tainted 5.6.0-rc6-syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor975 D27576 9880 9878 0x80004000 Call Trace: schedule+0xd0/0x2a0 kernel/sched/core.c:4154 schedule_timeout+0x6db/0xba0 kernel/time/timer.c:1871 do_wait_for_common kernel/sched/completion.c:83 [inline] __wait_for_common kernel/sched/completion.c:104 [inline] wait_for_common kernel/sched/completion.c:115 [inline] wait_for_completion+0x26a/0x3c0 kernel/sched/completion.c:136 io_queue_file_removal+0x1af/0x1e0 fs/io_uring.c:5826 __io_sqe_files_update.isra.0+0x3a1/0xb00 fs/io_uring.c:5867 io_sqe_files_update fs/io_uring.c:5918 [inline] __io_uring_register+0x377/0x2c00 fs/io_uring.c:7131 __do_sys_io_uring_register fs/io_uring.c:7202 [inline] __se_sys_io_uring_register fs/io_uring.c:7184 [inline] __x64_sys_io_uring_register+0x192/0x560 fs/io_uring.c:7184 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe and bisect pointed to `05f3fb3c53` ("io_uring: avoid ring quiesce for fixed file set unregister and update"). It is down to the order that we wait for work done before flushing it while nobody is likely going to wake us up. We can drop that completion on stack as flushing work itself is a sync operation we need and no more is left behind it. To that end, io_file_put::done is re-used for indicating if it can be freed in the workqueue worker context. Reported-and-Inspired-by: syzbot <syzbot+538d1957ce178382a394@syzkaller.appspotmail.com> Signed-off-by: Hillf Danton <hdanton@sina.com> Rename ->done to ->free_pfile Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-23 09:21:06 -06:00
Luis Henriques	c8d6ee0144	ceph: fix memory leak in ceph_cleanup_snapid_map() kmemleak reports the following memory leak: unreferenced object 0xffff88821feac8a0 (size 96): comm "kworker/1:0", pid 17, jiffies 4294896362 (age 20.512s) hex dump (first 32 bytes): a0 c8 ea 1f 82 88 ff ff 00 c9 ea 1f 82 88 ff ff ................ 00 00 00 00 00 00 00 00 00 01 00 00 00 00 ad de ................ backtrace: [<00000000b3ea77fb>] ceph_get_snapid_map+0x75/0x2a0 [<00000000d4060942>] fill_inode+0xb26/0x1010 [<0000000049da6206>] ceph_readdir_prepopulate+0x389/0xc40 [<00000000e2fe2549>] dispatch+0x11ab/0x1521 [<000000007700b894>] ceph_con_workfn+0xf3d/0x3240 [<0000000039138a41>] process_one_work+0x24d/0x590 [<00000000eb751f34>] worker_thread+0x4a/0x3d0 [<000000007e8f0d42>] kthread+0xfb/0x130 [<00000000d49bd1fa>] ret_from_fork+0x3a/0x50 A kfree is missing while looping the 'to_free' list of ceph_snapid_map objects. Cc: stable@vger.kernel.org Fixes: `75c9627efb` ("ceph: map snapid to anonymous bdev ID") Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-03-23 13:07:08 +01:00
Ilya Dryomov	7614209736	ceph: check POOL_FLAG_FULL/NEARFULL in addition to OSDMAP_FULL/NEARFULL CEPH_OSDMAP_FULL/NEARFULL aren't set since mimic, so we need to consult per-pool flags as well. Unfortunately the backwards compatibility here is lacking: - the change that deprecated OSDMAP_FULL/NEARFULL went into mimic, but was guarded by require_osd_release >= RELEASE_LUMINOUS - it was subsequently backported to luminous in v12.2.2, but that makes no difference to clients that only check OSDMAP_FULL/NEARFULL because require_osd_release is not client-facing -- it is for OSDs Since all kernels are affected, the best we can do here is just start checking both map flags and pool flags and send that to stable. These checks are best effort, so take osdc->lock and look up pool flags just once. Remove the FIXME, since filesystem quotas are checked above and RADOS quotas are reflected in POOL_FLAG_FULL: when the pool reaches its quota, both POOL_FLAG_FULL and POOL_FLAG_FULL_QUOTA are set. Cc: stable@vger.kernel.org Reported-by: Yanhu Cao <gmayyyha@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Sage Weil <sage@redhat.com>	2020-03-23 13:07:08 +01:00
Yilu Lin	97adda8b3a	CIFS: Fix bug which the return value by asynchronous read is error This patch is used to fix the bug in collect_uncached_read_data() that rc is automatically converted from a signed number to an unsigned number when the CIFS asynchronous read fails. It will cause ctx->rc is error. Example: Share a directory and create a file on the Windows OS. Mount the directory to the Linux OS using CIFS. On the CIFS client of the Linux OS, invoke the pread interface to deliver the read request. The size of the read length plus offset of the read request is greater than the maximum file size. In this case, the CIFS server on the Windows OS returns a failure message (for example, the return value of smb2.nt_status is STATUS_INVALID_PARAMETER). After receiving the response message, the CIFS client parses smb2.nt_status to STATUS_INVALID_PARAMETER and converts it to the Linux error code (rdata->result=-22). Then the CIFS client invokes the collect_uncached_read_data function to assign the value of rdata->result to rc, that is, rc=rdata->result=-22. The type of the ctx->total_len variable is unsigned integer, the type of the rc variable is integer, and the type of the ctx->rc variable is ssize_t. Therefore, during the ternary operation, the value of rc is automatically converted to an unsigned number. The final result is ctx->rc=4294967274. However, the expected result is ctx->rc=-22. Signed-off-by: Yilu Lin <linyilu@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>	2020-03-22 22:49:10 -05:00
Murphy Zhou	ef4a632ccc	CIFS: check new file size when extending file by fallocate xfstests generic/228 checks if fallocate respect RLIMIT_FSIZE. After fallocate mode 0 extending enabled, we can hit this failure. Fix this by check the new file size with vfs helper, return error if file size is larger then RLIMIT_FSIZE(ulimit -f). This patch has been tested by LTP/xfstests aginst samba and Windows server. Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Murphy Zhou <jencce.kernel@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org>	2020-03-22 22:49:10 -05:00
Steve French	8895c66f2b	SMB3: Minor cleanup of protocol definitions And add one missing define (COMPRESSION_TRANSFORM_ID) and flag (TRANSFORM_FLAG_ENCRYPTED) Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-22 22:49:10 -05:00
Steve French	8f23343131	SMB3: Additional compression structures New transform header structures. See recent updates to MS-SMB2 adding section 2.2.42.1 and 2.2.42.2 Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>	2020-03-22 22:49:10 -05:00
Steve French	2fe4f62de4	SMB3: Add new compression flags Additional compression capabilities can now be negotiated and a new compression algorithm. Add the flags for these. See newly updated MS-SMB2 sections 3.1.4.4.1 and 2.2.3.1.3 Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>	2020-03-22 22:49:10 -05:00
Gustavo A. R. Silva	cff2def598	cifs: smb2pdu.h: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-22 22:49:10 -05:00
Eric Biggers	dc920277f1	cifs: clear PF_MEMALLOC before exiting demultiplex thread Leaving PF_MEMALLOC set when exiting a kthread causes it to remain set during do_exit(). That can confuse things. For example, if BSD process accounting is enabled and the accounting file has FS_SYNC_FL set and is located on an ext4 filesystem without a journal, then do_exit() can end up calling ext4_write_inode(). That triggers the WARN_ON_ONCE(current->flags & PF_MEMALLOC) there, as it assumes (appropriately) that inodes aren't written when allocating memory. This was originally reported for another kernel thread, xfsaild() [1]. cifs_demultiplex_thread() also exits with PF_MEMALLOC set, so it's potentially subject to this same class of issue -- though I haven't been able to reproduce the WARN_ON_ONCE() via CIFS, since unlike xfsaild(), cifs_demultiplex_thread() is sent SIGKILL before exiting, and that interrupts the write to the BSD process accounting file. Either way, leaving PF_MEMALLOC set is potentially problematic. Let's clean this up by properly saving and restoring PF_MEMALLOC. [1] https://lore.kernel.org/r/0000000000000e7156059f751d7b@google.com Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-22 22:49:10 -05:00
Gustavo A. R. Silva	266b9fecc5	cifs: cifspdu.h: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-22 22:49:10 -05:00
Steve French	ba55344f36	CIFS: Warn less noisily on default mount The warning we print on mount about how to use less secure dialects (when the user does not specify a version on mount) is useful but is noisy to print on every default mount, and can be changed to a warn_once. Slightly updated the warning text as well to note SMB3.1.1 which has been the default which is typically negotiated (for a few years now) by most servers. "No dialect specified on mount. Default has changed to a more secure dialect, SMB2.1 or later (e.g. SMB3.1.1), from CIFS (SMB1). To use the less secure SMB1 dialect to access old servers which do not support SMB3.1.1 (or even SMB3 or SMB2.1) specify vers=1.0 on mount." Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>	2020-03-22 22:49:09 -05:00
Qiujun Huang	f2d67931fd	fs/cifs: fix gcc warning in sid_to_id fix warning [-Wunused-but-set-variable] at variable 'rc', keeping the code readable. Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-22 22:49:09 -05:00
Murphy Zhou	0667059d0b	cifs: allow unlock flock and OFD lock across fork Since commit `d0677992d2` ("cifs: add support for flock") added support for flock, LTP/flock03[1] testcase started to fail. This testcase is testing flock lock and unlock across fork. The parent locks file and starts the child process, in which it unlock the same fd and lock the same file with another fd again. All the lock and unlock operation should succeed. Now the child process does not actually unlock the file, so the following lock fails. Fix this by allowing flock and OFD lock go through the unlock routine, not skipping if the unlock request comes from another process. Patch has been tested by LTP/xfstests on samba and Windows server, v3.11, with or without cache=none mount option. [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/flock/flock03.c Signed-off-by: Murphy Zhou <jencce.kernel@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com>	2020-03-22 22:49:09 -05:00
Steve French	c7e9f78f7b	cifs: do d_move in rename See commit `349457ccf2` "Allow file systems to manually d_move() inside of ->rename()" Lessens possibility of race conditions in rename Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-22 22:49:09 -05:00
Aurelien Aptel	69dda3059e	cifs: add SMB2_open() arg to return POSIX data allows SMB2_open() callers to pass down a POSIX data buffer that will trigger requesting POSIX create context and parsing the response into the provided buffer. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>	2020-03-22 22:49:09 -05:00
Aurelien Aptel	3d519bd126	cifs: plumb smb2 POSIX dir enumeration * add code to request POSIX info level * parse dir entries and fill cifs_fattr to get correct inode data since the POSIX payload is variable size the number of entries in a FIND response needs to be computed differently. Dirs and regular files are properly reported along with mode bits, hardlink number, c/m/atime. No special files yet (see below). Current experimental version of Samba with the extension unfortunately has issues with wildcards and needs the following patch: > --- i/source3/smbd/smb2_query_directory.c > +++ w/source3/smbd/smb2_query_directory.c > @@ -397,9 +397,7 @@ smbd_smb2_query_directory_send(TALLOC_CTX > mem_ctx, > } > } > > - if (!state->smbreq->posix_pathnames) { > wcard_has_wild = ms_has_wild(state->in_file_name); > - } > > / Ensure we've canonicalized any search path if not a wildcard. */ > if (!wcard_has_wild) { > Also for special files despite reporting them as reparse point samba doesn't set the reparse tag field. This patch will mark them as needing re-evaluation but the re-evaluate code doesn't deal with it yet. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-22 22:49:09 -05:00
Aurelien Aptel	349e13ad30	cifs: add smb2 POSIX info level * add new info level and structs for SMB2 posix extension * add functions to parse and validate it Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-22 22:49:09 -05:00
Aurelien Aptel	2e8af978d9	cifs: rename posix create rsp little progress on the posix create response. * rename struct to create_posix_rsp to match with the request create_posix context * make struct packed * pass smb info struct for parse_posix_ctxt to fill * use smb info struct as param * update TODO What needs to be done: SMB2_open() has an optional smb info out argument that it will fill. Callers making use of this are: - smb3_query_mf_symlink (need to investigate) - smb2_open_file Callers of smb2_open_file (via server->ops->open) are passing an smbinfo struct but that struct cannot hold POSIX information. All the call stack needs to be changed for a different info type. Maybe pass SMB generic struct like cifs_fattr instead. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-22 22:49:09 -05:00
Steve French	8fe0c2c2cb	cifs: print warning mounting with vers=1.0 We really, really don't want people using insecure dialects unless they realize what they are doing ... Add mount warning if mounting with vers=1.0 (older SMB1/CIFS dialect) instead of the default (SMB2.1 or later, typically SMB3.1.1). Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com>	2020-03-22 22:49:09 -05:00
Steve French	cf5371ae46	smb3: fix performance regression with setting mtime There are cases when we don't want to send the SMB2 flush operation (e.g. when user specifies mount parm "nostrictsync") and it can be a very expensive operation on the server. In most cases in order to set mtime, we simply need to flush (write) the dirtry pages from the client and send the writes to the server not also send a flush protocol operation to the server. Fixes: `aa081859b1` ("cifs: flush before set-info if we have writeable handles") CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-22 22:49:09 -05:00
Stefan Metzmacher	864138cb31	cifs: make use of cap_unix(ses) in cifs_reconnect_tcon() cap_unix(ses) defaults to false for SMB2. Signed-off-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-22 22:49:09 -05:00
Stefan Metzmacher	b08484d715	cifs: use mod_delayed_work() for &server->reconnect if already queued mod_delayed_work() is safer than queue_delayed_work() if there's a chance that the work is already in the queue. Signed-off-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-22 22:49:09 -05:00
Stefan Metzmacher	e2e87519bd	cifs: call wake_up(&server->response_q) inside of cifs_reconnect() This means it's consistently called and the callers don't need to care about it. Signed-off-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-22 22:49:09 -05:00
Paulo Alcantara (SUSE)	bacd704a95	cifs: handle prefix paths in reconnect For the case where we have a DFS path like below and we're currently connected to targetA: //dfsroot/link -> //targetA/share/foo, //targetB/share/bar after failover, we should make sure to update cifs_sb->prepath so the next operations will use the new prefix path "/bar". Besides, in order to simplify the use of different prefix paths, enforce CIFS_MOUNT_USE_PREFIX_PATH for DFS mounts so we don't have to revalidate the root dentry every time we set a new prefix path. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-22 22:49:09 -05:00
Steve French	ffdec8d642	cifs: do not ignore the SYNC flags in getattr Check the AT_STATX_FORCE_SYNC flag and force an attribute revalidation if requested by the caller, and if the caller specificies AT_STATX_DONT_SYNC only revalidate cached attributes if required. In addition do not flush writes in getattr (which can be expensive) if size or timestamps not requested by the caller. Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-03-22 22:49:09 -05:00
Pavel Begunkov	18a542ff19	io_uring: Fix ->data corruption on re-enqueue work->data and work->list are shared in union. io_wq_assign_next() sets ->data if a req having a linked_timeout, but then io-wq may want to use work->list, e.g. to do re-enqueue of a request, so corrupting ->data. ->data is not necessary, just remove it and extract linked_timeout through @link_list. Fixes: `60cf46ae60` ("io-wq: hash dependent work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-22 19:31:27 -06:00
Linus Torvalds	67d584e33e	for-5.6-rc6-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl53Vh0ACgkQxWXV+ddt WDtfOQ//bbUyKXcdH0FBZOCEcJmegcK1eUFYqKrwR2bHGe5JRdLM8pAvjCcqmWeO jtaRiFC4NSCqTIl3mkBUb+XmQtjZwixBUHRxJpuEO8zqawvFZXTqg/KJklNvi2rd KdflSNia6KrozTT+B/lpwZ5emS+wSdj5XTZ6VGj4riwtphSfWAjOu+4cOASMeFu+ Gfn+N9xu0ZcR/6zO20xAg0Xz+WU2uj4EfeM35dtRP2bPLG0yOGmiYT15Ll9h74Wm 7F+28iNTQfYutAexGvUpiouanGXE+ka3TCsJg5LuVTpdKGraOVGEuX+RhsyoKQrB E8bk91fbkLlooluhUC306iNA9/+RN/yFGtILX8JsgI2Od26ZuU01l/OHrc19MDIm gw1w3PMsD/hXLsG5ba4QsIYOzXofSrPdWej29h/o5p0VEQrAoCJEpAi7fVsiJDR1 sx6kCodw5jYhVs1P6DdXO1pgjE7iFUmjUQCFkl40edPMLy/LwB99A4zNnCOwI0KZ 49CMWHDe+tXVJBTzPvtma/PycQHIxJYMf1f8ko9E4stB7HtfH4dnUERDkb1UwQ5n aJgyhsCCnp/EJoPunUT7g9nLUdyu0Rtwknn3NascWZEieX2QhKEF5RcjAUSL+Hlo jbGGvoLhG0nOtYkU7BNSQbL8wxPJEEAq8e6F4tWMcOkhX4pNZP8= =YkB0 -----END PGP SIGNATURE----- Merge tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two fixes. The first is a regression: when dropping some incompat bits the conditions were reversed. The other is a fix for rename whiteout potentially leaving stack memory linked to a list" * tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix removal of raid[56\|1c34} incompat flags after removing block group btrfs: fix log context list corruption after rename whiteout error	2020-03-22 11:35:33 -07:00
Linus Torvalds	b3c03db67e	Merge branch 'akpm' (patches from Andrew) Merge misc fixes from Andrew Morton: "10 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: x86/mm: split vmalloc_sync_all() mm, slub: prevent kmalloc_node crashes and memory leaks mm/mmu_notifier: silence PROVE_RCU_LIST warnings epoll: fix possible lost wakeup on epoll_ctl() path mm: do not allow MADV_PAGEOUT for CoW pages mm, memcg: throttle allocators based on ancestral memory.high mm, memcg: fix corruption on 64-bit divisor in memory.high throttling page-flags: fix a crash at SetPageError(THP_SWAP) mm/hotplug: fix hot remove failure in SPARSEMEM\|!VMEMMAP case memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event	2020-03-22 10:46:50 -07:00
Pavel Begunkov	f2cf11492b	io-wq: close cancel gap for hashed linked work After io_assign_current_work() of a linked work, it can be decided to offloaded to another thread so doing io_wqe_enqueue(). However, until next io_assign_current_work() it can be cancelled, that isn't handled. Don't assign it, if it's not going to be executed. Fixes: `60cf46ae60` ("io-wq: hash dependent work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-22 11:33:58 -06:00
Roman Penyaev	1b53734bd0	epoll: fix possible lost wakeup on epoll_ctl() path This fixes possible lost wakeup introduced by commit `a218cc4914`. Originally modifications to ep->wq were serialized by ep->wq.lock, but in commit `a218cc4914` ("epoll: use rwlock in order to reduce ep_poll_callback() contention") a new rw lock was introduced in order to relax fd event path, i.e. callers of ep_poll_callback() function. After the change ep_modify and ep_insert (both are called on epoll_ctl() path) were switched to ep->lock, but ep_poll (epoll_wait) was using ep->wq.lock on wqueue list modification. The bug doesn't lead to any wqueue list corruptions, because wake up path and list modifications were serialized by ep->wq.lock internally, but actual waitqueue_active() check prior wake_up() call can be reordered with modifications of ep ready list, thus wake up can be lost. And yes, can be healed by explicit smp_mb(): list_add_tail(&epi->rdlink, &ep->rdllist); smp_mb(); if (waitqueue_active(&ep->wq)) wake_up(&ep->wp); But let's make it simple, thus current patch replaces ep->wq.lock with the ep->lock for wqueue modifications, thus wake up path always observes activeness of the wqueue correcty. Fixes: `a218cc4914` ("epoll: use rwlock in order to reduce ep_poll_callback() contention") Reported-by: Max Neunhoeffer <max@arangodb.com> Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Max Neunhoeffer <max@arangodb.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Christopher Kohlhoff <chris.kohlhoff@clearpool.io> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Jes Sorensen <jes.sorensen@gmail.com> Cc: <stable@vger.kernel.org> [5.1+] Link: http://lkml.kernel.org/r/20200214170211.561524-1-rpenyaev@suse.de References: https://bugzilla.kernel.org/show_bug.cgi?id=205933 Bisected-by: Max Neunhoeffer <max@arangodb.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-03-21 18:56:06 -07:00
Linus Torvalds	1ab7ea1f83	io_uring-5.6-20200320 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl51dbQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiV3EADJHB2r2hTTEym5u1PbrEEVkjvdL6InU8lD lFM7m2g6yZUncwm+aSZynHqAFY6Rd5Jk+gmYMuioi3ZxC2rs7jG1AOTpaeJYmhle lzkjqSLtl+gdPMA9ydivk1UwILFjtZKG1JNc++tnCn3q7+eCkgnWAlq5b7idG2eF BS0AEZP6Yz1zStTHLbHSB0StY8ovMIw0VaVQvguHLL9EBpbHmrs0cq3tipWkAyPR 2YwnXbxsJySukkwmBKxEWrGUYDze56jqJIqdFsOE0+WtGV+nk7OScPseXAaP4/+G Vl23VNfryuZcsBUwI9tY1SzCFEXIwdXVGpCAYwQ/kU5WfvFpYaei+fXVNnL4kjR0 PfpA6XnMsZ3DzqgepmUd92sAA56ZtBxuGjqcSYlg/JwjvUHdpaZDkE2WLqkAMeUN 8A7cUw+R6XWQ2/y6ob7QvKiT/ZDR8GrYUl3EdGE3LhB1ZsvLXJDZpWipwQBzuk9R vJJOkGst38rjsWnb+nfeLh3AsgjF14wo+2vQL4mKs24xKTIvadHsFAZjKLXZ93Wf Vn58FaPOYIkjBidYLWb3dlO1ZR8S0803gohLkLV6adH8bCNCWxGTOR51DZLomAsb nAUCEAJaZrOqaQAuJAFNNpS8+/da3AIF4HVd2EdZ1yFXU15y0+zIxtROjKzg+OxO M3jC/Aet1Q== =IMcu -----END PGP SIGNATURE----- Merge tag 'io_uring-5.6-20200320' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Two different fixes in here: - Fix for a potential NULL pointer deref for links with async or drain marked (Pavel) - Fix for not properly checking RLIMIT_NOFILE for async punted operations. This affects openat/openat2, which were added this cycle, and accept4. I did a full audit of other cases where we might check current->signal->rlim[] and found only RLIMIT_FSIZE for buffered writes and fallocate. That one is fixed and queued for 5.7 and marked stable" * tag 'io_uring-5.6-20200320' of git://git.kernel.dk/linux-block: io_uring: make sure accept honor rlimit nofile io_uring: make sure openat/openat2 honor rlimit nofile io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN}	2020-03-21 11:54:47 -07:00
Filipe Manana	d8e6fd5c79	btrfs: fix removal of raid[56\|1c34} incompat flags after removing block group We are incorrectly dropping the raid56 and raid1c34 incompat flags when there are still raid56 and raid1c34 block groups, not when we do not any of those anymore. The logic just got unintentionally broken after adding the support for the raid1c34 modes. Fix this by clear the flags only if we do not have block groups with the respective profiles. Fixes: `9c907446dc` ("btrfs: drop incompat bit for raid1c34 after last block group is gone") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-20 21:31:32 +01:00
Jens Axboe	4ed734b0d0	io_uring: honor original task RLIMIT_FSIZE With the previous fixes for number of files open checking, I added some debug code to see if we had other spots where we're checking rlimit() against the async io-wq workers. The only one I found was file size checking, which we should also honor. During write and fallocate prep, store the max file size and override that for the current ask if we're in io-wq worker context. Cc: stable@vger.kernel.org # 5.1+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-20 11:41:23 -06:00
Jens Axboe	09952e3e78	io_uring: make sure accept honor rlimit nofile Just like commit `4022e7af86`, this fixes the fact that IORING_OP_ACCEPT ends up using get_unused_fd_flags(), which checks current->signal->rlim[] for limits. Add an extra argument to __sys_accept4_file() that allows us to pass in the proper nofile limit, and grab it at request prep time. Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-20 08:48:36 -06:00
Jens Axboe	4022e7af86	io_uring: make sure openat/openat2 honor rlimit nofile Dmitry reports that a test case shows that io_uring isn't honoring a modified rlimit nofile setting. get_unused_fd_flags() checks the task signal->rlimi[] for the limits. As this isn't easily inheritable, provide a __get_unused_fd_flags() that takes the value instead. Then we can grab it when the request is prepared (from the original task), and pass that in when we do the async part part of the open. Reported-by: Dmitry Kadashev <dkadashev@gmail.com> Tested-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-20 08:47:27 -06:00
Eric Biggers	861261f2a9	ubifs: wire up FS_IOC_GET_ENCRYPTION_NONCE This new ioctl retrieves a file's encryption nonce, which is useful for testing. See the corresponding fs/crypto/ patch for more details. Link: https://lore.kernel.org/r/20200314205052.93294-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-03-19 21:57:06 -07:00
Eric Biggers	ee446e1af4	f2fs: wire up FS_IOC_GET_ENCRYPTION_NONCE This new ioctl retrieves a file's encryption nonce, which is useful for testing. See the corresponding fs/crypto/ patch for more details. Link: https://lore.kernel.org/r/20200314205052.93294-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-03-19 21:57:06 -07:00
Eric Biggers	7ec9f3b47a	ext4: wire up FS_IOC_GET_ENCRYPTION_NONCE This new ioctl retrieves a file's encryption nonce, which is useful for testing. See the corresponding fs/crypto/ patch for more details. Link: https://lore.kernel.org/r/20200314205052.93294-3-ebiggers@kernel.org Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-03-19 21:56:59 -07:00
Eric Biggers	e98ad46475	fscrypt: add FS_IOC_GET_ENCRYPTION_NONCE ioctl Add an ioctl FS_IOC_GET_ENCRYPTION_NONCE which retrieves the nonce from an encrypted file or directory. The nonce is the 16-byte random value stored in the inode's encryption xattr. It is normally used together with the master key to derive the inode's actual encryption key. The nonces are needed by automated tests that verify the correctness of the ciphertext on-disk. Except for the IV_INO_LBLK_64 case, there's no way to replicate a file's ciphertext without knowing that file's nonce. The nonces aren't secret, and the existing ciphertext verification tests in xfstests retrieve them from disk using debugfs or dump.f2fs. But in environments that lack these debugging tools, getting the nonces by manually parsing the filesystem structure would be very hard. To make this important type of testing much easier, let's just add an ioctl that retrieves the nonce. Link: https://lore.kernel.org/r/20200314205052.93294-2-ebiggers@kernel.org Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-03-19 21:56:54 -07:00
David S. Miller	3ac9eb4210	RxRPC fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl5zWDsACgkQ+7dXa6fL C2upyg/+KFOmCLFEAgwRnBn4zDDcdDT9du25Duv2d/XfAo2Zx+Nbwm7jjKR/mrRZ mRbcvb8qj92O4dzMCwcqDGpKT3xJmCZhxJQORBm55Bjme7tJDqXuQVYp1fZVy3Ka XJS0jr4n5HTorW8iGSIPJmE76XpIPq0ANhPnLbq8wZELyw87K7+J5ZdHcnUh+myd uKs8sIQ8PQZg6JBBj5wPRgrAkOFUTTINiUqy37ADIY1oZyzW1rUlAeAxVXV7Dnx7 G1HvlVaDw72G1XG4pn0pNBCdGJuNF0dG2zRbdjS+kGCmf6MB6x8e22JjWW9r+r9m iJd4B2R/3V/kUn4i3B+jfOWD5DKzCW4lDixh9D2LzM16GUinYQTkrH9e8jMBBJGW 7p7X9Vl3o0Nt6NDVLmTKuyomvvtT/jMYiDtKjPuvxlPCGduXB8HvNRFxsKIEVRHi 4RcdTqUSOsyUnOvTfDTfyBu1srKFqTC3HzAunntV88UfGtWdhXRCWMejHdNK3uI9 BC4Ym6jkmFnbQzytW/6noprvVlDfgAuyplcyhnnJ5fVNm4YQ7lZLZPgf5TS+gchI fMwDfRz3hOLDZ5WjCx6QLT1NHaowQLTrzTq0X3uj2ZrcnRORURvk8GfamzZoS9a5 omyQgBfm+1YpF2VwCyU42DytdmFDUCDofKondOXh8QciwhXqaRs= =bvsn -----END PGP SIGNATURE----- Merge tag 'rxrpc-fixes-20200319' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc, afs: Interruptibility fixes Here are a number of fixes for AF_RXRPC and AFS that make AFS system calls less interruptible and so less likely to leave the filesystem in an uncertain state. There's also a miscellaneous patch to make tracing consistent. (1) Firstly, abstract out the Tx space calculation in sendmsg. Much the same code is replicated in a number of places that subsequent patches are going to alter, including adding another copy. (2) Fix Tx interruptibility by allowing a kernel service, such as AFS, to request that a call be interruptible only when waiting for a call slot to become available (ie. the call has not taken place yet) or that a call be not interruptible at all (e.g. when we want to do writeback and don't want a signal interrupting a VM-induced writeback). (3) Increase the minimum delay on MSG_WAITALL for userspace sendmsg() when waiting for Tx buffer space as a 2*RTT delay is really small over 10G ethernet and a 1 jiffy timeout might be essentially 0 if at the end of the jiffy period. (4) Fix some tracing output in AFS to make it consistent with rxrpc. (5) Make sure aborted asynchronous AFS operations are tidied up properly so we don't end up with stuck rxrpc calls. (6) Make AFS client calls uninterruptible in the Rx phase. If we don't wait for the reply to be fully gathered, we can't update the local VFS state and we end up in an indeterminate state with respect to the server. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-19 20:28:34 -07:00
Linus Torvalds	cd607737f3	three small smb3 fixes, 2 for stable -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl5xH1sACgkQiiy9cAdy T1HzMgv/d27qMlDe1jrLgPY40FT6kjTfG6zKA8ikTg5LHt/esgqRrKsPTQVSVq/m f6ZVGNlcTDfwAq+90Rw38hreUKRYCkkVWoCEE9SUkCqlg/3MVMorA72p9eDnp0/u htADzvyBCNoMPJj1WGi5uyhGw58LBy5zWT4vibovGzEdlZ2Lv1qvVzyiGnju8ypy 2+0cgGhucQ8jfEAjqEP28T7nCT96+G0KJGqXX122+Mrx/agjGQ2xCCZRIH5ndVnp VmaN7WxGQmN9AdLtsVgkrRa9VYtndspMzo7xUArrferlF/yLijvO2Lcu7o3QtH8N RvLSc0qOD7eH3ETcAwvYd/luGH5OvvZDu4jHphK9KBz9GtGGRCKc7nxElv13S4LJ 27DG71x2XqTGmNoLmY57EZOtKVCsu6VBDlhq7u17RsYWDEurrvda0Nhe/Wo8P2yT dESnNEX5YGi+nWIjvxwRGMJ7Gb1ZXLdjkJC5QNzDID4AZVHE678AxDR+ZjkHCYLE Rsbsbmaw =x6+U -----END PGP SIGNATURE----- Merge tag '5.6-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Three small smb3 fixes, two for stable" * tag '5.6-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: CIFS: fiemap: do not return EINVAL if get nothing CIFS: Increment num_remote_opens stats counter even in case of smb2_query_dir_first cifs: potential unintitliazed error code in cifs_getattr()	2020-03-19 10:19:11 -07:00
Linus Torvalds	dcf23ac3e8	locks: reinstate locks_delete_block optimization There is measurable performance impact in some synthetic tests due to commit `6d390e4b5d` (locks: fix a potential use-after-free problem when wakeup a waiter). Fix the race condition instead by clearing the fl_blocker pointer after the wake_up, using explicit acquire/release semantics. This does mean that we can no longer use the clearing of fl_blocker as the wait condition, so switch the waiters over to checking whether the fl_blocked_member list_head is empty. Reviewed-by: yangerkun <yangerkun@huawei.com> Reviewed-by: NeilBrown <neilb@suse.de> Fixes: `6d390e4b5d` (locks: fix a potential use-after-free problem when wakeup a waiter) Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-03-18 13:03:38 -07:00
Christoph Hellwig	d981cb5b9f	block: fix a device invalidation regression Historically we only set the capacity to zero for devices that support partitions (independ of actually having partitions created). Doing that is rather inconsistent, but changing it broke legacy udisks polling for legacy ide-cdrom devices. Use the crude a crude check for devices that either are non-removable or partitionable to get the sane behavior for most device while not breaking userspace for this particular setup. Fixes: `a1548b6744` ("block: move rescan_partitions to fs/block_dev.c") Reported-by: He Zhe <zhe.he@windriver.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: He Zhe <zhe.he@windriver.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-18 08:47:04 -06:00
Greg Kroah-Hartman	526ee72dfd	debugfs: remove return value of debugfs_create_file_size() No one checks the return value of debugfs_create_file_size, as it's not needed, so make the return value void, so that no one tries to do so in the future. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20200309163640.237984-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-03-18 13:35:29 +01:00
Taehee Yoo	275678e7a9	debugfs: Check module state before warning in {full/open}_proxy_open() When the module is being removed, the module state is set to MODULE_STATE_GOING. At this point, try_module_get() fails. And when {full/open}_proxy_open() is being called, it calls try_module_get() to try to hold module reference count. If it fails, it warns about the possibility of debugfs file leak. If {full/open}_proxy_open() is called while the module is being removed, it fails to hold the module. So, It warns about debugfs file leak. But it is not the debugfs file leak case. So, this patch just adds module state checking routine in the {full/open}_proxy_open(). Test commands: #SHELL1 while : do modprobe netdevsim echo 1 > /sys/bus/netdevsim/new_device modprobe -rv netdevsim done #SHELL2 while : do cat /sys/kernel/debug/netdevsim/netdevsim1/ports/0/ipsec done Splat looks like: [ 298.766738][T14664] debugfs file owner did not clean up at exit: ipsec [ 298.766766][T14664] WARNING: CPU: 2 PID: 14664 at fs/debugfs/file.c:312 full_proxy_open+0x10f/0x650 [ 298.768595][T14664] Modules linked in: netdevsim(-) openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 n][ 298.771343][T14664] CPU: 2 PID: 14664 Comm: cat Tainted: G W 5.5.0+ #1 [ 298.772373][T14664] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 298.773545][T14664] RIP: 0010:full_proxy_open+0x10f/0x650 [ 298.774247][T14664] Code: 48 c1 ea 03 80 3c 02 00 0f 85 c1 04 00 00 49 8b 3c 24 e8 e4 b5 78 ff 84 c0 75 2d 4c 89 ee 48 [ 298.776782][T14664] RSP: 0018:ffff88805b7df9b8 EFLAGS: 00010282[ 298.777583][T14664] RAX: dffffc0000000008 RBX: ffff8880511725c0 RCX: 0000000000000000 [ 298.778610][T14664] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8880540c5c14 [ 298.779637][T14664] RBP: 0000000000000000 R08: fffffbfff15235ad R09: 0000000000000000 [ 298.780664][T14664] R10: 0000000000000001 R11: 0000000000000000 R12: ffffffffc06b5000 [ 298.781702][T14664] R13: ffff88804c234a88 R14: ffff88804c22dd00 R15: ffffffff8a1b5660 [ 298.782722][T14664] FS: 00007fafa13a8540(0000) GS:ffff88806c800000(0000) knlGS:0000000000000000 [ 298.783845][T14664] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 298.784672][T14664] CR2: 00007fafa0e9cd10 CR3: 000000004b286005 CR4: 00000000000606e0 [ 298.785739][T14664] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 298.786769][T14664] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 298.787785][T14664] Call Trace: [ 298.788237][T14664] do_dentry_open+0x63c/0xf50 [ 298.788872][T14664] ? open_proxy_open+0x270/0x270 [ 298.789524][T14664] ? __x64_sys_fchdir+0x180/0x180 [ 298.790169][T14664] ? inode_permission+0x65/0x390 [ 298.790832][T14664] path_openat+0xc45/0x2680 [ 298.791425][T14664] ? save_stack+0x69/0x80 [ 298.791988][T14664] ? save_stack+0x19/0x80 [ 298.792544][T14664] ? path_mountpoint+0x2e0/0x2e0 [ 298.793233][T14664] ? check_chain_key+0x236/0x5d0 [ 298.793910][T14664] ? sched_clock_cpu+0x18/0x170 [ 298.794527][T14664] ? find_held_lock+0x39/0x1d0 [ 298.795153][T14664] do_filp_open+0x16a/0x260 [ ... ] Fixes: `9fd4dcece4` ("debugfs: prevent access to possibly dead file_operations at file open") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Link: https://lore.kernel.org/r/20200218043150.29447-1-ap420073@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-03-18 12:37:49 +01:00
Murphy Zhou	979a2665eb	CIFS: fiemap: do not return EINVAL if get nothing If we call fiemap on a truncated file with none blocks allocated, it makes sense we get nothing from this call. No output means no blocks have been counted, but the call succeeded. It's a valid response. Simple example reproducer: xfs_io -f 'truncate 2M' -c 'fiemap -v' /cifssch/testfile xfs_io: ioctl(FS_IOC_FIEMAP) ["/cifssch/testfile"]: Invalid argument Signed-off-by: Murphy Zhou <jencce.kernel@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> CC: Stable <stable@vger.kernel.org>	2020-03-17 13:27:06 -05:00
Shyam Prasad N	1be1fa42eb	CIFS: Increment num_remote_opens stats counter even in case of smb2_query_dir_first The num_remote_opens counter keeps track of the number of open files which must be maintained by the server at any point. This is a per-tree-connect counter, and the value of this counter gets displayed in the /proc/fs/cifs/Stats output as a following... Open files: 0 total (local), 1 open on server ^^^^^^^^^^^^^^^^ As a thumb-rule, we want to increment this counter for each open/create that we successfully execute on the server. Similarly, we should decrement the counter when we successfully execute a close. In this case, an increment was being missed in case of smb2_query_dir_first, in case of successful open. As a result, we would underflow the counter and we could even see the counter go to negative after sufficient smb2_query_dir_first calls. I tested the stats counter for a bunch of filesystem operations with the fix. And it looks like the counter looks correct to me. I also check if we missed the increments and decrements elsewhere. It does not seem so. Few other cases where an open is done and we don't increment the counter are the compound calls where the corresponding close is also sent in the request. Signed-off-by: Shyam Prasad N <nspmangalore@gmail.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2020-03-17 13:27:03 -05:00
Dan Carpenter	39946886fc	cifs: potential unintitliazed error code in cifs_getattr() Smatch complains that "rc" could be uninitialized. fs/cifs/inode.c:2206 cifs_getattr() error: uninitialized symbol 'rc'. Changing it to "return 0;" improves readability as well. Fixes: cc1baf98c8f6 ("cifs: do not ignore the SYNC flags in getattr") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>	2020-03-17 13:26:26 -05:00
Linus Torvalds	34d5a4b336	Fix for yet another subtle futex issue. The futex code used ihold() to prevent inodes from vanishing, but ihold() does not guarantee inode persistence. Replace the inode pointer with a per boot, machine wide, unique inode identifier. The second commit fixes the breakage of the hash mechanism whihc causes a 100% performance regression. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5uQJsTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoYp4EAC5fr/AyRaIn/AEIZUmoyK6ELUaknfH Z788avxDB/t5GkzC9A2dMpybYi78tzLSAEfB8jYgwbrqExapVtiqvjGZ1RIi3HoN f/DWLnOb2s+yYQ3BQlHu4RdKONEzCqBwKFpElGRv3JzCY8Qeh5cQBzdqzvOEFmYw P7DJVtJRZ2dud7AzJ+xk6KuNIKCT2F7Djmtop6nq1EVw0J/2oYOVgQu76APBj7cj 32srLmpP4xcQiJmWLC5ksXKiZrMPnyNfwXhHFufNvJ2Re6+Wf8mmglqG/5DmA+Ns Sq3L7D7yXwtWQZ8Po1qnWhPDZVXQbWzHyTn4YAMJAK7yoO7mut8jgECt+A8vf4L+ hsc41c6THfdCQQ9gmxLL+c08nZGlmvIC4/1RsihNZ3kd2o4k6Ah9xFp8lBFcpjWd 7tuhakNqJvUOvB34t2AYqzMFbZ/FJG+QNGyIW0bTUn4YIgRPxI/zsdMxqGVBZ4oN 0iuy1kPLGbGAnLU9thkiVMmAyaPesuiB6f+mmzobEUgGI35GrCJi6a4YaTG1sqFn Gl8oPzcU2n+DWbVBfJrVFHJye7oi78kCw6wpNLBCJQp8NP8doAH0Sgspglg52E/p G4GGLz0vGauHBC5wQ3WYiGLImWbzC1dwKdcNE7dhuTgXbhz8ChVlOSU9Fu4+pGpq 6URL6DVTLwDZPg== =e2iB -----END PGP SIGNATURE----- Merge tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull futex fix from Thomas Gleixner: "Fix for yet another subtle futex issue. The futex code used ihold() to prevent inodes from vanishing, but ihold() does not guarantee inode persistence. Replace the inode pointer with a per boot, machine wide, unique inode identifier. The second commit fixes the breakage of the hash mechanism which causes a 100% performance regression" * tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Unbreak futex hashing futex: Fix inode life-time issue	2020-03-15 12:55:52 -07:00
Pavel Begunkov	60cf46ae60	io-wq: hash dependent work Enable io-wq hashing stuff for dependent works simply by re-enqueueing such requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-14 17:02:30 -06:00
Pavel Begunkov	8766dd516c	io-wq: split hashing and enqueueing It's a preparation patch removing io_wq_enqueue_hashed(), which now should be done by io_wq_hash_work() + io_wq_enqueue(). Also, set hash value for dependant works, and do it as late as possible, because req->file can be unavailable before. This hash will be ignored by io-wq. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-14 17:02:28 -06:00
Pavel Begunkov	d78298e73a	io-wq: don't resched if there is no work This little tweak restores the behaviour that was before the recent io_worker_handle_work() optimisation patches. It makes the function do cond_resched() and flush_signals() only if there is an actual work to execute. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-14 17:02:26 -06:00
Pavel Begunkov	f1d96a8fcb	io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN} Processing links, io_submit_sqe() prepares requests, drops sqes, and passes them with sqe=NULL to io_queue_sqe(). There IOSQE_DRAIN and/or IOSQE_ASYNC requests will go through the same prep, which doesn't expect sqe=NULL and fail with NULL pointer deference. Always do full prepare including io_alloc_async_ctx() for linked requests, and then it can skip the second preparation. Cc: stable@vger.kernel.org # 5.5 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-14 16:57:41 -06:00
David S. Miller	44ef976ab3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2020-03-13 The following pull-request contains BPF updates for your net-next tree. We've added 86 non-merge commits during the last 12 day(s) which contain a total of 107 files changed, 5771 insertions(+), 1700 deletions(-). The main changes are: 1) Add modify_return attach type which allows to attach to a function via BPF trampoline and is run after the fentry and before the fexit programs and can pass a return code to the original caller, from KP Singh. 2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher objects to be visible in /proc/kallsyms so they can be annotated in stack traces, from Jiri Olsa. 3) Extend BPF sockmap to allow for UDP next to existing TCP support in order in order to enable this for BPF based socket dispatch, from Lorenz Bauer. 4) Introduce a new bpftool 'prog profile' command which attaches to existing BPF programs via fentry and fexit hooks and reads out hardware counters during that period, from Song Liu. Example usage: bpftool prog profile id 337 duration 3 cycles instructions llc_misses 4228 run_cnt 3403698 cycles (84.08%) 3525294 instructions # 1.04 insn per cycle (84.05%) 13 llc_misses # 3.69 LLC misses per million isns (83.50%) 5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition of a new bpf_link abstraction to keep in particular BPF tracing programs attached even when the applicaion owning them exits, from Andrii Nakryiko. 6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering and which returns the PID as seen by the init namespace, from Carlos Neira. 7) Refactor of RISC-V JIT code to move out common pieces and addition of a new RV32G BPF JIT compiler, from Luke Nelson. 8) Add gso_size context member to __sk_buff in order to be able to know whether a given skb is GSO or not, from Willem de Bruijn. 9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output implementation but can be called from tracepoint programs, from Eelco Chaudron. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-13 20:52:03 -07:00
David Howells	7d7587db0d	afs: Fix client call Rx-phase signal handling Fix the handling of signals in client rxrpc calls made by the afs filesystem. Ignore signals completely, leaving call abandonment or connection loss to be detected by timeouts inside AF_RXRPC. Allowing a filesystem call to be interrupted after the entire request has been transmitted and an abort sent means that the server may or may not have done the action - and we don't know. It may even be worse than that for older servers. Fixes: `bc5e3a546d` ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Signed-off-by: David Howells <dhowells@redhat.com>	2020-03-13 23:04:35 +00:00
David Howells	dde9f09558	afs: Fix handling of an abort from a service handler When an AFS service handler function aborts a call, AF_RXRPC marks the call as complete - which means that it's not going to get any more packets from the receiver. This is a problem because reception of the final ACK is what triggers afs_deliver_to_call() to drop the final ref on the afs_call object. Instead, aborted AFS service calls may then just sit around waiting for ever or until they're displaced by a new call on the same connection channel or a connection-level abort. Fix this by calling afs_set_call_complete() to finalise the afs_call struct representing the call. However, we then need to drop the ref that stops the call from being deallocated. We can do this in afs_set_call_complete(), as the work queue is holding a separate ref of its own, but then we shouldn't do it in afs_process_async_call() and afs_delete_async_call(). call->drop_ref is set to indicate that a ref needs dropping for a call and this is dealt with when we transition a call to AFS_CALL_COMPLETE. But then we also need to get rid of the ref that pins an asynchronous client call. We can do this by the same mechanism, setting call->drop_ref for an async client call too. We can also get rid of call->incoming since nothing ever sets it and only one thing ever checks it (futilely). A trace of the rxrpc_call and afs_call struct ref counting looks like: <idle>-0 [001] ..s5 164.764892: rxrpc_call: c=00000002 SEE u=3 sp=rxrpc_new_incoming_call+0x473/0xb34 a=00000000442095b5 <idle>-0 [001] .Ns5 164.766001: rxrpc_call: c=00000002 QUE u=4 sp=rxrpc_propose_ACK+0xbe/0x551 a=00000000442095b5 <idle>-0 [001] .Ns4 164.766005: rxrpc_call: c=00000002 PUT u=3 sp=rxrpc_new_incoming_call+0xa3f/0xb34 a=00000000442095b5 <idle>-0 [001] .Ns7 164.766433: afs_call: c=00000002 WAKE u=2 o=11 sp=rxrpc_notify_socket+0x196/0x33c kworker/1:2-1810 [001] ...1 164.768409: rxrpc_call: c=00000002 SEE u=3 sp=rxrpc_process_call+0x25/0x7ae a=00000000442095b5 kworker/1:2-1810 [001] ...1 164.769439: rxrpc_tx_packet: c=00000002 e9f1a7a8:95786a88:00000008:09c5 00000001 00000000 02 22 ACK CallAck kworker/1:2-1810 [001] ...1 164.769459: rxrpc_call: c=00000002 PUT u=2 sp=rxrpc_process_call+0x74f/0x7ae a=00000000442095b5 kworker/1:2-1810 [001] ...1 164.770794: afs_call: c=00000002 QUEUE u=3 o=12 sp=afs_deliver_to_call+0x449/0x72c kworker/1:2-1810 [001] ...1 164.770829: afs_call: c=00000002 PUT u=2 o=12 sp=afs_process_async_call+0xdb/0x11e kworker/1:2-1810 [001] ...2 164.771084: rxrpc_abort: c=00000002 95786a88:00000008 s=0 a=1 e=1 K-1 kworker/1:2-1810 [001] ...1 164.771461: rxrpc_tx_packet: c=00000002 e9f1a7a8:95786a88:00000008:09c5 00000002 00000000 04 00 ABORT CallAbort kworker/1:2-1810 [001] ...1 164.771466: afs_call: c=00000002 PUT u=1 o=12 sp=SRXAFSCB_ProbeUuid+0xc1/0x106 The abort generated in SRXAFSCB_ProbeUuid(), labelled "K-1", indicates that the local filesystem/cache manager didn't recognise the UUID as its own. Fixes: `2067b2b3f4` ("afs: Fix the CB.ProbeUuid service handler to reply correctly") Signed-off-by: David Howells <dhowells@redhat.com>	2020-03-13 23:04:35 +00:00
David Howells	4636cf184d	afs: Fix some tracing details Fix a couple of tracelines to indicate the usage count after the atomic op, not the usage count before it to be consistent with other afs and rxrpc trace lines. Change the wording of the afs_call_trace_work trace ID label from "WORK" to "QUEUE" to reflect the fact that it's queueing work, not doing work. Fixes: `341f741f04` ("afs: Refcount the afs_call struct") Signed-off-by: David Howells <dhowells@redhat.com>	2020-03-13 23:04:34 +00:00
David Howells	e138aa7d32	rxrpc: Fix call interruptibility handling Fix the interruptibility of kernel-initiated client calls so that they're either only interruptible when they're waiting for a call slot to come available or they're not interruptible at all. Either way, they're not interruptible during transmission. This should help prevent StoreData calls from being interrupted when writeback is in progress. It doesn't, however, handle interruption during the receive phase. Userspace-initiated calls are still interruptable. After the signal has been handled, sendmsg() will return the amount of data copied out of the buffer and userspace can perform another sendmsg() call to continue transmission. Fixes: `bc5e3a546d` ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Signed-off-by: David Howells <dhowells@redhat.com>	2020-03-13 23:04:30 +00:00
Linus Torvalds	b0ea262a23	NFS Client Bugfixes for Linux 5.6-rc5 Fixes: - Ensure the fs_context has the correct fs_type when mounting and submounting - Fix leaking of ctx->nfs_server.hostname - Add minor version to fscache key to prevent collisions -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl5r+QkACgkQ18tUv7Cl QOtq4Q/+Oo707rb3N7DrPikUARB8D7FMTs/m/+xSPNm2DSllImIXJUdckqaoZkwc DwBMLw+ZDvHtcNytytJQOWJNp9LGjHpZ20g0TLr2p2/JRrQyGgpc0FxTJONwA5Pp zU6MSgqqfMZ5nLgxpMKsqoPNzO45sS8SKi2I6yZIupLZlsZOzF8L1wL/zc6gJvv8 71UGrSId9mEMKCrE8EQRx7etct5VPuP+pXfDGz4oaI1tdEmfmx3FoJlzZA1/Pf90 YSHdGZb7mR3LFkFRDlnh6NFHWU+yE+b5iWCt32ifO8pCN/CyIUvBxQblx4VLA47H 6S5nrYA96zRcQwhh9B/8yWLiqqxXo2hNl574uBJL/iDqSKSmkEBxZmCbE3aFEGa8 ClWlF6T5z4dlcAlKWXQkn3EXBHzL5+Opev5dArMhqNkr55g4z9Opsa6sc0ZWdywf h/rSM8bHn9SNYkCGFHQ1MjAn6eNU0vVQ/s9DhM2xdtyfyTQOOHx5yA/KF6aGG5oQ 3mlVEJCEfsBKyWWjHhq3e/7ezgLlKRRlauxdLgjmKy+PmtlY6mGii0eF601e9OSL RvK7I5/9spbcYmkyyQs0BxitrDZObyhxk31hgNrUMlN/JJrhPhKiyll8/Z7Z4sq6 QP6T/Vfn2FORA3ZAMtMH6V/ZOeiXUZjOkBRVWArIrPwOBJn/EQY= =3DDn -----END PGP SIGNATURE----- Merge tag 'nfs-for-5.6-3' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client bugfixes from Anna Schumaker: "These are mostly fscontext fixes, but there is also one that fixes collisions seen in fscache: - Ensure the fs_context has the correct fs_type when mounting and submounting - Fix leaking of ctx->nfs_server.hostname - Add minor version to fscache key to prevent collisions" * tag 'nfs-for-5.6-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: nfs: add minor version to nfs_server_key for fscache NFS: Fix leak of ctx->nfs_server.hostname NFS: Don't hard-code the fs_type when submounting NFS: Ensure the fs_context has the correct fs_type before mounting	2020-03-13 15:21:32 -07:00
Linus Torvalds	7e6d869f5f	fuse fixes for 5.6-rc6 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXmpHOAAKCRDh3BK/laaZ PP0XAQCN52kSOBiSvr8xiQrO5YOONo4yfPDi6qIk/ltvA1yr6wEA3NWAepAL07AS n51hMi02+JNXuMVnxOm0z2us5/PYJw0= =MJC1 -----END PGP SIGNATURE----- Merge tag 'fuse-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fix from Miklos Szeredi: "Fix an Oops introduced in v5.4" * tag 'fuse-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix stack use after return	2020-03-13 15:19:38 -07:00
Linus Torvalds	2af82177af	overlayfs fixes for 5.6-rc6 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXmufyAAKCRDh3BK/laaZ POXNAQDmkgiy41nUQZ3LxtGKstsgVuzFhqBq+erinBPcF1r9mQEA/xJp4uc2Q8NO JKZZHyWFLtAN8gGNYTCli4vrm1LoKQc= =JV3K -----END PGP SIGNATURE----- Merge tag 'ovl-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "Fix three bugs introduced in this cycle" * tag 'ovl-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: fix lockdep warning for async write ovl: fix some xino configurations ovl: fix lock in ovl_llseek()	2020-03-13 15:17:21 -07:00
Filipe Manana	236ebc20d9	btrfs: fix log context list corruption after rename whiteout error During a rename whiteout, if btrfs_whiteout_for_rename() returns an error we can end up returning from btrfs_rename() with the log context object still in the root's log context list - this happens if 'sync_log' was set to true before we called btrfs_whiteout_for_rename() and it is dangerous because we end up with a corrupt linked list (root->log_ctxs) as the log context object was allocated on the stack. After btrfs_rename() returns, any task that is running btrfs_sync_log() concurrently can end up crashing because that linked list is traversed by btrfs_sync_log() (through btrfs_remove_all_log_ctxs()). That results in the same issue that commit `e6c617102c` ("Btrfs: fix log context list corruption after rename exchange operation") fixed. Fixes: `d4682ba03e` ("Btrfs: sync log after logging new name") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-03-13 22:15:09 +01:00
Linus Torvalds	5007928eae	io_uring-5.6-2020-03-13 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5rxtkQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpv/xEACifgfgyE3a2ZM7w2VTe41IpMxOouEnUWOJ oVKRp+9gkynE8pUGlE1igTa7T2nQIZM+Qd0KWqknkP2iFiQaNXSqqr8U6qIz9lzV I6SAcj0Pa2FRzlRly5UXLKiadIHbt2OfP6PIk6sXTcMCFUXb75/WzNFVOnNnBuee j8F5JUw45xyLXvQnfxpYSt8LeZyGYLoOwJEZX3j+hFHl1GCqSrAY8EB5tkXFbCZi L9JdJYOBEvnwFF4qxWl++2bmEOywnKeFea84JqbGr9BaVrDAOjAWMairZAU82xiI EWdQRKkSyDzrl+TACz/ri4J87fzE8FhBpHLufSY3HCxizaayNawxItDg5CCW1ghn i+bEaKq6djZn1CpSU0w0CTfA1g0D1DnErBS82znC8ciV1ZflAed8oADh3/+X64j8 HzPT1DRoDGnzp4pBwTiZcG7Jb605Mh8i1TY1p35riaUbIR4y84BVNroEUHtO5Cmh U09efdYifsU9XM+u0OXK+SvrHqtDb6EVSx5x37qiV1SVxZ3JSsr9/uTjnBOrjH5W nUjqCzQfJZYSNmvRT6aSGDzk5wON95nnv7hYE9HWER/Cw7/VwKdJmBwehIAZUaXG NxJ7I/mVndGKV8ghoN119XVl7t2i56Ctj2pwu/UJH7lZB/Yfu9qZ5oKpku/Kbriy pYqSdy8J/Q== =0jJw -----END PGP SIGNATURE----- Merge tag 'io_uring-5.6-2020-03-13' of git://git.kernel.dk/linux-block Pull io_uring fix from Jens Axboe: "Just a single fix here, improving the RCU callback ordering from last week. After a bit more perusing by Paul, he poked a hole in the original" * tag 'io_uring-5.6-2020-03-13' of git://git.kernel.dk/linux-block: io_uring: ensure RCU callback ordering with rcu_barrier()	2020-03-13 13:00:08 -07:00
Jann Horn	ddd2b85ff7	afs: Use kfree_rcu() instead of casting kfree() to rcu_callback_t afs_put_addrlist() casts kfree() to rcu_callback_t. Apart from being wrong in theory, this might also blow up when people start enforcing function types via compiler instrumentation, and it means the rcu_head has to be first in struct afs_addr_list. Use kfree_rcu() instead, it's simpler and more correct. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-03-13 10:47:33 -07:00
Miklos Szeredi	c853680453	ovl: fix lockdep warning for async write Lockdep reports "WARNING: lock held when returning to user space!" due to async write holding freeze lock over the write. Apparently aio.c already deals with this by lying to lockdep about the state of the lock. Do the same here. No need to check for S_IFREG() here since these file ops are regular-only. Reported-by: syzbot+9331a354f4f624a52a55@syzkaller.appspotmail.com Fixes: `2406a307ac` ("ovl: implement async IO routines") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-13 15:53:06 +01:00
Amir Goldstein	53afcd310e	ovl: fix some xino configurations Fix up two bugs in the coversion to xino_mode: 1. xino=off does not always end up in disabled mode 2. xino=auto on 32bit arch should end up in disabled mode Take a proactive approach to disabling xino on 32bit kernel: 1. Disable XINO_AUTO config during build time 2. Disable xino with a warning on mount time As a by product, xino=on on 32bit arch also ends up in disabled mode. We never intended to enable xino on 32bit arch and this will make the rest of the logic simpler. Fixes: `0f831ec85e` ("ovl: simplify ovl_same_sb() helper") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-13 15:53:06 +01:00
David S. Miller	1d34357931	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Minor overlapping changes, nothing serious. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-12 22:34:48 -07:00
Carlos Neira	1e2328e762	fs/nsfs.c: Added ns_match ns_match returns true if the namespace inode and dev_t matches the ones provided by the caller. Signed-off-by: Carlos Neira <cneirabustos@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200304204157.58695-2-cneirabustos@gmail.com	2020-03-12 17:33:11 -07:00
Linus Torvalds	807f030b44	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "A couple of fixes for old crap in ->atomic_open() instances" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: cifs_atomic_open(): fix double-put on late allocation failure gfs2_atomic_open(): fix O_EXCL\|O_CREAT handling on cold dcache	2020-03-12 15:51:26 -07:00
Al Viro	d9a9f4849f	cifs_atomic_open(): fix double-put on late allocation failure several iterations of ->atomic_open() calling conventions ago, we used to need fput() if ->atomic_open() failed at some point after successful finish_open(). Now (since 2016) it's not needed - struct file carries enough state to make fput() work regardless of the point in struct file lifecycle and discarding it on failure exits in open() got unified. Unfortunately, I'd missed the fact that we had an instance of ->atomic_open() (cifs one) that used to need that fput(), as well as the stale comment in finish_open() demanding such late failure handling. Trivially fixed... Fixes: `fe9ec8291f` "do_last(): take fput() on error after opening to out:" Cc: stable@kernel.org # v4.7+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-03-12 18:25:20 -04:00
Al Viro	2103913265	gfs2_atomic_open(): fix O_EXCL\|O_CREAT handling on cold dcache with the way fs/namei.c:do_last() had been done, ->atomic_open() instances needed to recognize the case when existing file got found with O_EXCL\|O_CREAT, either by falling back to finish_no_open() or failing themselves. gfs2 one didn't. Fixes: `6d4ade986f` (GFS2: Add atomic_open support) Cc: stable@kernel.org # v3.11 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-03-12 18:21:24 -04:00
Amir Goldstein	531d3040bc	ovl: fix lock in ovl_llseek() ovl_inode_lock() is interruptible. When inode_lock() in ovl_llseek() was replaced with ovl_inode_lock(), we did not add a check for error. Fix this by making ovl_inode_lock() uninterruptible and change the existing call sites to use an _interruptible variant. Reported-by: syzbot+66a9752fa927f745385e@syzkaller.appspotmail.com Fixes: `b1f9d3858f` ("ovl: use ovl_inode_lock in ovl_llseek()") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-03-12 16:38:10 +01:00
Pavel Begunkov	2293b41958	io-wq: remove duplicated cancel code Deduplicate cancellation parts, as many of them looks the same, as do e.g. - io_wqe_cancel_cb_work() and io_wqe_cancel_work() - io_wq_worker_cancel() and io_work_cancel() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-12 07:50:22 -06:00
Linus Torvalds	e6e6ec48dd	fscrypt fix for v5.6-rc6 Fix a bug where if userspace is writing to encrypted files while the FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (introduced in v5.4) is running, dirty inodes could be evicted, causing writes could be lost or the filesystem to hang due to a use-after-free. This was encountered during real-world use, not just theoretical. Tested with the existing fscrypt xfstests, and with a new xfstest I wrote to reproduce this bug. This fix does expose an existing bug with '-o lazytime' that Ted is working on fixing, but this fix is more critical and needed anyway regardless of the lazytime fix. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXmk8HxQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK4YiAQC1RZyH4/mZ890Or6s8SzCgJTVmiLk9 ZTO/56XmLte6LAD+IBAExqDkkybmAF0rQ4kY1oL75f/e/nEs+50TXra9NQc= =s2KD -----END PGP SIGNATURE----- Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fscrypt fix from Eric Biggers: "Fix a bug where if userspace is writing to encrypted files while the FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (introduced in v5.4) is running, dirty inodes could be evicted, causing writes could be lost or the filesystem to hang due to a use-after-free. This was encountered during real-world use, not just theoretical. Tested with the existing fscrypt xfstests, and with a new xfstest I wrote to reproduce this bug. This fix does expose an existing bug with '-o lazytime' that Ted is working on fixing, but this fix is more critical and needed anyway regardless of the lazytime fix" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: don't evict dirty inodes after removing key	2020-03-11 13:35:34 -07:00
Jens Axboe	3f9d64415f	io_uring: fix truncated async read/readv and write/writev retry Ensure we keep the truncated value, if we did truncate it. If not, we might read/write more than the registered buffer size. Also for retry, ensure that we return the truncated mapped value for the vectorized versions of the read/write commands. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-11 12:29:15 -06:00
Xiaoguang Wang	32b2244a84	io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL\|SETUP_SQPOLL enabled When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need to do io completion events polling again, they can rely on io_sq_thread to do polling work, which can reduce cpu usage and uring_lock contention. I modify fio io_uring engine codes a bit to evaluate the performance: static int fio_ioring_getevents(struct thread_data *td, unsigned int min, continue; } - if (!o->sqpoll_thread) { + if (o->sqpoll_thread && o->hipri) { r = io_uring_enter(ld, 0, actual_min, IORING_ENTER_GETEVENTS); if (r < 0) { and use "fio -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread -rw=read -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k -size=10G -numjobs=1 -time_based -runtime=120" original codes -------------------------------------------------------------------- iodepth \| 4 \| 8 \| 16 \| 32 \| 64 bw \| 1133MB/s \| 1519MB/s \| 2090MB/s \| 2710MB/s \| 3012MB/s fio cpu usage \| 100% \| 100% \| 100% \| 100% \| 100% -------------------------------------------------------------------- with patch -------------------------------------------------------------------- iodepth \| 4 \| 8 \| 16 \| 32 \| 64 bw \| 1196MB/s \| 1721MB/s \| 2351MB/s \| 2977MB/s \| 3357MB/s fio cpu usage \| 63.8% \| 74.4%% \| 81.1% \| 83.7% \| 82.4% -------------------------------------------------------------------- bw improve \| 5.5% \| 13.2% \| 12.3% \| 9.8% \| 11.5% -------------------------------------------------------------------- From above test results, we can see that bw has above 5.5%~13% improvement, and fio process's cpu usage also drops much. Note this won't improve io_sq_thread's cpu usage when SETUP_IOPOLL\|SETUP_SQPOLL are both enabled, in this case, io_sq_thread always has 100% cpu usage. I think this patch will be friendly to applications which will often use io_uring_wait_cqe() or similar from liburing. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-11 07:14:12 -06:00
YueHaibing	469956e853	io_uring: Fix unused function warnings If CONFIG_NET is not set, gcc warns: fs/io_uring.c:3110:12: warning: io_setup_async_msg defined but not used [-Wunused-function] static int io_setup_async_msg(struct io_kiocb *req, ^~~~~~~~~~~~~~~~~~ There are many funcions wraped by CONFIG_NET, move them together to simplify code, also fix this warning. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Minor tweaks. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-10 09:12:56 -06:00
Jens Axboe	84557871f2	io_uring: add end-of-bits marker and build time verify it Not easy to tell if we're going over the size of bits we can shove in req->flags, so add an end-of-bits marker and a BUILD_BUG_ON() check for it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-10 09:12:56 -06:00
Jens Axboe	067524e914	io_uring: provide means of removing buffers We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers is to trigger IO on them. The usual case of shrinking a buffer pool would be to just not replenish the buffers when IO completes, and instead just free it. But it may be nice to have a way to manually remove a number of buffers from a given group, and IORING_OP_REMOVE_BUFFERS provides that functionality. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-10 09:12:56 -06:00

... 2 3 4 5 6 ...

63308 Commits