OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Liu Bo	f716abd55d	Btrfs: fix out of bounds array access while reading extent buffer There is a corner case that slips through the checkers in functions reading extent buffer, ie. if (start < eb->len) and (start + len > eb->len), then a) map_private_extent_buffer() returns immediately because it's thinking the range spans across two pages, b) and the checkers in read_extent_buffer(), WARN_ON(start > eb->len) and WARN_ON(start + len > eb->start + eb->len), both are OK in this corner case, but it'd actually try to access the eb->pages out of bounds because of (start + len > eb->len). The case is found by switching extent inline ref type from shared data ref to non-shared data ref, which is a kind of metadata corruption. It'd use the wrong helper to access the eb, eg. btrfs_extent_data_ref_root(eb, ref) is used but the %ref passing here is "struct btrfs_shared_data_ref". And if the extent item happens to be the first item in the eb, then offset/length will get over eb->len which ends up an invalid memory access. This is adding proper checks in order to avoid invalid memory access, ie. 'general protection fault', before it's too late. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-21 17:47:14 +02:00
Markus Elfring	8898662268	isofs: Delete an error message for a failed memory allocation in isofs_read_inode() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-21 16:05:48 +02:00
Markus Elfring	434aafb572	quota_v2: Delete an error message for a failed memory allocation in v2_read_file_info() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-21 16:02:59 +02:00
Trond Myklebust	7af7a5963c	Merge branch 'bugfixes'	2017-08-20 13:04:12 -04:00
Chuck Lever	53a75f22e7	NFS: Fix NFSv2 security settings For a while now any NFSv2 mount where sec= is specified uses AUTH_NULL. If sec= is not specified, the mount uses AUTH_UNIX. Commit `e68fd7c807` ("mount: use sec= that was specified on the command line") attempted to address a very similar problem with NFSv3, and should have fixed this too, but it has a bug. The MNTv1 MNT procedure does not return a list of security flavors, so our client makes up a list containing just AUTH_NULL. This should enable nfs_verify_authflavors() to assign the sec= specified flavor, but instead, it incorrectly sets it to AUTH_NULL. I expect this would also be a problem for any NFSv3 server whose MNTv3 MNT procedure returned a security flavor list containing only AUTH_NULL. Fixes: `e68fd7c807` ("mount: use sec= that was specified on ... ") BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=310 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-20 12:43:34 -04:00
NeilBrown	b79e87e070	NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys' An NFSv4.1 client might close a file after the user who opened it has logged off. In this case the user's credentials may no longer be valid, if they are e.g. kerberos credentials that have expired. NFSv4.1 has a mechanism to allow the client to use machine credentials to close a file. However due to a short-coming in the RFC, a CLOSE with those credentials may not be possible if the file in question isn't exported to the same security flavor - the required PUTFH must be rejected when this is the case. Specifically if a server and client support kerberos in general and have used it to form a machine credential, but the file is only exported to "sec=sys", a PUTFH with the machine credentials will fail, so CLOSE is not possible. As RPC_AUTH_UNIX (used by sec=sys) credentials can never expire, there is no value in using the machine credential in place of them. So in that case, just use the users credentials for CLOSE etc, as you would in NFSv4.0 Signed-off-by: Neil Brown <neilb@suse.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-20 12:43:17 -04:00
Linus Torvalds	7f680d7ec3	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "Another pile of small fixes and updates for x86: - Plug a hole in the SMAP implementation which misses to clear AC on NMI entry - Fix the norandmaps/ADDR_NO_RANDOMIZE logic so the command line parameter works correctly again - Use the proper accessor in the startup64 code for next_early_pgt to prevent accessing of invalid addresses and faulting in the early boot code. - Prevent CPU hotplug lock recursion in the MTRR code - Unbreak CPU0 hotplugging - Rename overly long CPUID bits which got introduced in this cycle - Two commits which mark data 'const' and restrict the scope of data and functions to file scope by making them 'static'" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Constify attribute_group structures x86/boot/64/clang: Use fixup_pointer() to access 'next_early_pgt' x86/elf: Remove the unnecessary ADDR_NO_RANDOMIZE checks x86: Fix norandmaps/ADDR_NO_RANDOMIZE x86/mtrr: Prevent CPU hotplug lock recursion x86: Mark various structures and functions as 'static' x86/cpufeature, kvm/svm: Rename (shorten) the new "virtualized VMSAVE/VMLOAD" CPUID flag x86/smpboot: Unbreak CPU0 hotplug x86/asm/64: Clear AC on NMI entries	2017-08-20 09:36:52 -07:00
Trond Myklebust	3bde7afdab	NFS: Remove unused parameter gfp_flags from nfs_pageio_init() Now that the mirror allocation has been moved, the parameter can go. Also remove the redundant symbol export. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-20 11:35:33 -04:00
Trond Myklebust	14abcb0bf5	NFSv4: Fix up mirror allocation There are a number of callers of nfs_pageio_complete() that want to continue using the nfs_pageio_descriptor without needing to call nfs_pageio_init() again. Examples include nfs_pageio_resend() and nfs_pageio_cond_complete(). The problem is that nfs_pageio_complete() also calls nfs_pageio_cleanup_mirroring(), which frees up the array of mirrors. This can lead to writeback errors, in the next call to nfs_pageio_setup_mirroring(). Fix by simply moving the allocation of the mirrors to nfs_pageio_setup_mirroring(). Link: https://bugzilla.kernel.org/show_bug.cgi?id=196709 Reported-by: JianhongYin <yin-jianhong@163.com> Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-20 10:36:35 -04:00
Linus Torvalds	cc28fcdc01	Changes since last time: - Don't leak resources when mount fails - Don't accidentally clobber variables when looking for free inodes -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABCgAGBQJZlfFsAAoJEPh/dxk0SrTrTmQP/1Yga+FXQ1vjsyi0SyPRupwd 6beHGDEyLSmYaZKqye8v/nJlNVT8nmJofM20Hyu04f41K4oShQrzrI7jOOscOaYY jGEpgbx9fpLPD7AupgDvEDcrZyzZD/j3XxoSsOEGe5D6m3t2X0B4RtHz3jtj2s3e wkaBTE7GpzwrhC+9L+3AAtlpNlwkbjcCz0Wfrqlo8DjvRHTlutbYF51fthLJACtz U5XgNlxrjQlxGxn4IRHEqxmxWKz2iF4aQHGIX8OEGyt8J3YEO2t3K+nSalWduiBc mynExqVFIdGddNWoW4au6IKkPEahytsPVAiyt1TQMNvgkOMCO6DfUz+WmyQbd483 2r/xUbMdP78RQsUDXdrIEcTiHs/GEfQmIxUongf/0au3r2wmpQfbqzQuBxhuVbzW 1tQQsDKrO3r+GeEEoBPehtWVF/QPlQvlpT6pfft69kcgp5ukPDvOyOoM0ZEbKy72 zBWEs5O/kHUOBBXXdV2cqazplq3LyLuBMok1y+gUXXOyXfEd2w9LPqmoK3RmqSQ2 FnZc2A6tjko1NDLrSkq/uYRXIGi7ZAfxzqhP0L6XLUnu+kjN/A2Xb6pdfB9Wngl2 8nLVbBL/d28lMVPLJ5M3yxoVcQbIfcNqNA5QmWVCmPUqEwgMQFCsbBdYMKILI0ok B76xb0VyZBP5l9QJ514S =vJe/ -----END PGP SIGNATURE----- Merge tag 'xfs-4.13-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Darrick Wong: "A handful more bug fixes for you today. Changes since last time: - Don't leak resources when mount fails - Don't accidentally clobber variables when looking for free inodes" * tag 'xfs-4.13-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: don't leak quotacheck dquots when cow recovery xfs: clear MS_ACTIVE after finishing log recovery iomap: fix integer truncation issues in the zeroing and dirtying helpers xfs: fix inobt inode allocation search optimization	2017-08-18 14:25:50 -07:00
Trond Myklebust	b7561e5186	Merge branch 'writeback'	2017-08-18 14:51:10 -04:00
Nikolay Borisov	c59efa7eb2	btrfs: Fix -EOVERFLOW handling in btrfs_ioctl_tree_search_v2 The buffer passed to btrfs_ioctl_tree_search* functions have to be at least sizeof(struct btrfs_ioctl_search_header). If this is not the case then the ioctl should return -EOVERFLOW and set the uarg->buf_size to the minimum required size. Currently btrfs_ioctl_tree_search_v2 would return an -EOVERFLOW error with ->buf_size being set to the value passed by user space. Fix this by removing the size check and relying on search_ioctl, which already includes it and correctly sets buf_size. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-18 16:36:29 +02:00
Nikolay Borisov	e6961cac73	btrfs: Move skip checksum check from btrfs_submit_direct to __btrfs_submit_dio_bio Currently the code checks whether we should do data checksumming in btrfs_submit_direct and the boolean result of this check is passed to btrfs_submit_direct_hook, in turn passing it to __btrfs_submit_dio_bio which actually consumes it. The last function actually has all the necessary context to figure out whether to skip the check or not, so let's move the check closer to where it's being consumed. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-18 16:36:29 +02:00
Filipe Manana	6399fb5a0b	Btrfs: fix assertion failure during fsync in no-holes mode When logging an inode in full mode that has an inline compressed extent that represents a range with a size matching the sector size (currently the same as the page size), has a trailing hole and the no-holes feature is enabled, we end up failing an assertion leading to a trace like the following: [141812.031528] assertion failed: len == i_size, file: fs/btrfs/tree-log.c, line: 4453 [141812.033069] ------------[ cut here ]------------ [141812.034330] kernel BUG at fs/btrfs/ctree.h:3452! [141812.035137] invalid opcode: 0000 [#1] PREEMPT SMP [141812.035932] Modules linked in: btrfs dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_flakey dm_mod dax ppdev evdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 tpm_tis psmouse crypto_simd parport_pc sg pcspkr tpm_tis_core cryptd parport serio_raw glue_helper tpm i2c_piix4 i2c_core button sunrpc loop autofs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod ata_generic virtio_scsi ata_piix floppy crc32c_intel libata scsi_mod virtio_pci virtio_ring e1000 virtio [last unloaded: btrfs] [141812.036790] CPU: 3 PID: 845 Comm: fdm-stress Tainted: G B W 4.12.3-btrfs-next-52+ #1 [141812.036790] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [141812.036790] task: ffff8801e6694180 task.stack: ffffc90009004000 [141812.036790] RIP: 0010:assfail.constprop.18+0x1c/0x1e [btrfs] [141812.036790] RSP: 0018:ffffc90009007bc0 EFLAGS: 00010282 [141812.036790] RAX: 0000000000000046 RBX: ffff88017512c008 RCX: 0000000000000001 [141812.036790] RDX: ffff88023fd95201 RSI: ffffffff8182264c RDI: 00000000ffffffff [141812.036790] RBP: ffffc90009007bc0 R08: 0000000000000001 R09: 0000000000000001 [141812.036790] R10: 0000000000001000 R11: ffffffff82f5a0c9 R12: ffff88014e5947e8 [141812.036790] R13: 00000000000b4000 R14: ffff8801b234d008 R15: 0000000000000000 [141812.036790] FS: 00007fdba6ffd700(0000) GS:ffff88023fd80000(0000) knlGS:0000000000000000 [141812.036790] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [141812.036790] CR2: 00007fdb9c000010 CR3: 000000016efa2000 CR4: 00000000001406e0 [141812.036790] Call Trace: [141812.036790] btrfs_log_inode+0x9f0/0xd3d [btrfs] [141812.036790] ? __mutex_lock+0x120/0x3ce [141812.036790] btrfs_log_inode_parent+0x224/0x685 [btrfs] [141812.036790] ? lock_acquire+0x16b/0x1af [141812.036790] btrfs_log_dentry_safe+0x60/0x7b [btrfs] [141812.036790] btrfs_sync_file+0x32e/0x3f8 [btrfs] [141812.036790] vfs_fsync_range+0x8a/0x9d [141812.036790] vfs_fsync+0x1c/0x1e [141812.036790] do_fsync+0x31/0x4a [141812.036790] SyS_fdatasync+0x13/0x17 [141812.036790] entry_SYSCALL_64_fastpath+0x18/0xad [141812.036790] RIP: 0033:0x7fdbac41a47d [141812.036790] RSP: 002b:00007fdba6ffce30 EFLAGS: 00000293 ORIG_RAX: 000000000000004b [141812.036790] RAX: ffffffffffffffda RBX: ffffffff81092c9f RCX: 00007fdbac41a47d [141812.036790] RDX: 0000004cf0160a40 RSI: 0000000000000000 RDI: 0000000000000006 [141812.036790] RBP: ffffc90009007f98 R08: 0000000000000000 R09: 0000000000000010 [141812.036790] R10: 00000000000002e8 R11: 0000000000000293 R12: ffffffff8110cd90 [141812.036790] R13: ffffc90009007f78 R14: 0000000000000000 R15: 0000000000000000 [141812.036790] ? time_hardirqs_off+0x9/0x14 [141812.036790] ? trace_hardirqs_off_caller+0x1f/0xa3 [141812.036790] Code: c7 d6 61 6b a0 48 89 e5 e8 ba ef a8 e0 0f 0b 55 89 f1 48 c7 c2 6d 65 6b a0 48 89 fe 48 c7 c7 81 65 6b a0 48 89 e5 e8 9c ef a8 e0 <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 [141812.036790] RIP: assfail.constprop.18+0x1c/0x1e [btrfs] RSP: ffffc90009007bc0 [141812.084448] ---[ end trace 44e472684c7a32cc ]--- Which happens because the code that logs a trailing hole when the no-holes feature is enabled, did not consider that a compressed inline extent can represent a range with a size matching the sector size, in which case expanding the inode's i_size, through a truncate operation, won't lead to padding with zeroes the page that represents the inline extent, and therefore the inline extent remains after the truncation. Fix this by adapting the assertion to accept inline extents representing data with a sector size length if, and only if, the inline extents are compressed. A sample and trivial reproducer (for systems with a 4K page size) for this issue: mkfs.btrfs -O no-holes -f /dev/sdc mount -o compress /dev/sdc /mnt xfs_io -f -c "pwrite -S 0xab 0 4K" /mnt/foobar sync xfs_io -c "truncate 32K" /mnt/foobar xfs_io -c "fsync" /mnt/foobar Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-18 16:36:29 +02:00
Filipe Manana	4a4b964f42	Btrfs: avoid unnecessarily locking inode when clearing a range If the range being cleared was not marked for defrag and we are not about to clear the range from the defrag status, we don't need to lock and unlock the inode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Chris Mason <clm@fb.com> Reviewed-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-18 16:36:29 +02:00
Colin Ian King	938e1c77f8	btrfs: remove redundant check on ret being non-zero The error return variable ret is initialized to zero and then is checked to see if it is non-zero in the if-block that follows it. It is therefore impossible for ret to be non-zero after the if-block hence the check is redundant and can be removed. Detected by CoverityScan, CID#1021040 ("Logically dead code") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-18 16:36:29 +02:00
Nikolay Borisov	2d77ab3cfb	btrfs: expose internal free space tree routine only if sanity tests are enabled The internal free space tree management routines are always exposed for testing purposes. Make them dependent on SANITY_TESTS being on so that they are exposed only when they really have to. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-18 16:36:29 +02:00
Nikolay Borisov	db7c942ce8	btrfs: Remove unused sectorsize variable from struct map_lookup This variable was added in `1abe9b8a13` ("Btrfs: add initial tracepointi support for btrfs"), yet it never really got used, only assigned to. So let's remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-18 16:36:29 +02:00
Nikolay Borisov	92ac58ec99	btrfs: Remove never-reached WARN_ON We have a WARN_ON(!var) inside an if branch which is executed (among others) only when var is true. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-18 16:36:29 +02:00
Anand Jain	dc2f29212a	btrfs: remove unused BTRFS_COMPRESS_LAST We aren't using this define, so removing it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-18 16:36:29 +02:00
Anand Jain	44880fdc91	btrfs: use appropriate define for the fsid Though BTRFS_FSID_SIZE and BTRFS_UUID_SIZE are of the same size, we should use the matching constant for the fsid buffer. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-18 16:36:29 +02:00
Josef Bacik	42e9cc46fb	btrfs: increase ctx->pos for delayed dir index Our dir_context->pos is supposed to hold the next position we're supposed to look. If we successfully insert a delayed dir index we could end up with a duplicate entry because we don't increase ctx->pos after doing the dir_emit. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-18 16:36:20 +02:00
Kees Cook	c71b02e4d2	Revert "pstore: Honor dmesg_restrict sysctl on dmesg dumps" This reverts commit `68c4a4f8ab`, with various conflict clean-ups. The capability check required too much privilege compared to simple DAC controls. A system builder was forced to have crash handler processes run with CAP_SYSLOG which would give it the ability to read (and wipe) the _current_ dmesg, which is much more access than being given access only to the historical log stored in pstorefs. With the prior commit to make the root directory 0750, the files are protected by default but a system builder can now opt to give access to a specific group (via chgrp on the pstorefs root directory) without being forced to also give away CAP_SYSLOG. Suggested-by: Nick Kralevich <nnk@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.cz> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>	2017-08-17 16:29:19 -07:00
Kees Cook	d7caa33687	pstore: Make default pstorefs root dir perms 0750 Currently only DMESG and CONSOLE record types are protected, and it isn't obvious that they are using a capability check. Instead switch to explicit root directory mode of 0750 to keep files private by default. This will allow the removal of the capability check, which was non-obvious and forces a process to have possibly too much privilege when simple post-boot chgrp for readers would be possible without it. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>	2017-08-17 16:28:37 -07:00
Jan Kara	7b9ca4c61b	quota: Reduce contention on dq_data_lock dq_data_lock is currently used to protect all modifications of quota accounting information, consistency of quota accounting on the inode, and dquot pointers from inode. As a result contention on the lock can be pretty heavy. Reduce the contention on the lock by protecting quota accounting information by a new dquot->dq_dqb_lock and consistency of quota accounting with inode usage by inode->i_lock. This change reduces time to create 500000 files on ext4 on ramdisk by 50 different processes in separate directories by 6% when user quota is turned on. When those 50 processes belong to 50 different users, the improvement is about 9%. Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 22:07:59 +02:00
Jan Kara	f4a8116a4c	fs: Provide __inode_get_bytes() Provide helper __inode_get_bytes() which assumes i_lock is already acquired. Quota code will need this to be able to use i_lock to protect consistency of quota accounting information and inode usage. Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 22:06:03 +02:00
Jan Kara	3ab167d2ba	quota: Inline dquot_[re]claim_reserved_space() into callsite dquot_claim_reserved_space() and dquot_reclaim_reserved_space() have only a single callsite. Inline them there. Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 22:03:03 +02:00
Jan Kara	a478e522e3	quota: Inline inode_{incr,decr}_space() into callsites inode_incr_space() and inode_decr_space() have only two callsites. Inline them there as that will make locking changes simpler. Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 22:01:14 +02:00
Jan Kara	0ed60de34a	quota: Inline functions into their callsites inode_add_rsv_space() and inode_sub_rsv_space() had only one callsite. Inline them there directly. inode_claim_rsv_space() and inode_reclaim_rsv_space() had two callsites so inline them there as well. This will simplify further locking changes. Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 22:00:59 +02:00
Jan Kara	91389240a2	ext4: Disable dirty list tracking of dquots when journalling quotas When journalling quotas, we writeback all dquots immediately after changing them as part of current transation. Thus there's no need to write anything in dquot_writeback_dquots() and so we can avoid updating list of dirty dquots to reduce dq_list_lock contention. This change reduces time to create 500000 files on ext4 on ramdisk by 50 different processes in separate directories by 15% when user quota is turned on. Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 22:00:54 +02:00
Jan Kara	834057bf84	quota: Allow disabling tracking of dirty dquots in a list Filesystems that are journalling quotas generally don't need tracking of dirty dquots in a list since forcing a transaction commit flushes all quotas anyway. Allow filesystem to say it doesn't want dquots to be tracked as it reduces contention on the dq_list_lock. Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 22:00:45 +02:00
Jan Kara	503330f382	quota: Remove dq_wait_unused from dquot Currently every dquot carries a wait_queue_head_t used only when we are turning quotas off to wait for last users to drop dquot references. Since such rare case is not performance sensitive in any means, just use a global waitqueue for this and save space in struct dquot. Also convert the logic to use wait_event() instead of open-coding it. Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 22:00:40 +02:00
Jan Kara	1e0b7cb062	quota: Move locking into clear_dquot_dirty() Move locking of dq_list_lock into clear_dquot_dirty(). It makes the function more self-contained and will simplify our life later. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 22:00:24 +02:00
Jan Kara	4580b30ea8	quota: Do not dirty bad dquots Currently we mark dirty even dquots that are not active (i.e., initialization or reading failed for them). Thus later we have to check whether dirty dquot is really active and just clear the dirty bit if not. Avoid this complication by just never marking non-active dquot as dirty. Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 22:00:07 +02:00
Jan Kara	15512377bd	quota: Fix possible corruption of dqi_flags dqi_flags modifications are protected by dq_data_lock. However the modifications in vfs_load_quota_inode() and in mark_info_dirty() were not which could lead to corruption of dqi_flags. Since modifications to dqi_flags are rare, this is hard to observe in practice but in theory it could happen. Fix the problem by always using dq_data_lock for protection. Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 22:00:04 +02:00
Darrick J. Wong	77aff8c764	xfs: don't leak quotacheck dquots when cow recovery If we fail a mount on account of cow recovery errors, it's possible that a previous quotacheck left some dquots in memory. The bailout clause of xfs_mountfs forgets to purge these, and so we leak them. Fix that. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2017-08-17 12:40:33 -07:00
Darrick J. Wong	8204f8ddaa	xfs: clear MS_ACTIVE after finishing log recovery Way back when we established inode block-map redo log items, it was discovered that we needed to prevent the VFS from evicting inodes during log recovery because any given inode might be have bmap redo items to replay even if the inode has no link count and is ultimately deleted, and any eviction of an unlinked inode causes the inode to be truncated and freed too early. To make this possible, we set MS_ACTIVE so that inodes would not be torn down immediately upon release. Unfortunately, this also results in the quota inodes not being released at all if a later part of the mount process should fail, because we never reclaim the inodes. So, set MS_ACTIVE right before we do the last part of log recovery and clear it immediately after we finish the log recovery so that everything will be torn down properly if we abort the mount. Fixes: `17c12bcd30` ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2017-08-17 12:40:33 -07:00
Jan Kara	f98bbe37ae	quota: Propagate ->quota_read errors from v2_read_file_info() Currently we return -EIO on any error (or short read) from ->quota_read() while reading quota info. Propagate the error code instead. Suggested-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 19:20:28 +02:00
Jan Kara	cb8d01b4f6	quota: Fix error codes in v2_read_file_info() v2_read_file_info() returned -1 instead of proper error codes on error. Luckily this is not easily visible from userspace as we have called ->check_quota_file shortly before and thus already verified the quota file is sane. Still set the error codes to proper values. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 19:18:11 +02:00
Jan Kara	42fdb8583d	quota: Push dqio_sem down to ->read_file_info() Push down acquisition of dqio_sem into ->read_file_info() callback. This is for consistency with other operations and it also allows us to get rid of an ugliness in OCFS2. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 19:16:24 +02:00
Jan Kara	9a8ae30e73	quota: Push dqio_sem down to ->write_file_info() Push down acquisition of dqio_sem into ->write_file_info() callback. Mostly for consistency with other operations. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 19:11:23 +02:00
Jan Kara	f14618c682	quota: Push dqio_sem down to ->get_next_id() Push down acquisition of dqio_sem into ->get_next_id() callback. Mostly for consistency with other operations. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 19:05:16 +02:00
Jan Kara	b9a1a7f4b6	quota: Push dqio_sem down to ->release_dqblk() Push down acquisition of dqio_sem into ->release_dqblk() callback. It will allow quota formats to decide whether they need it or not. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 19:04:01 +02:00
Jan Kara	f0c5bae5cc	quota: Remove locking for writing to the old quota format The old quota quota format has fixed offset in quota file based on ID so there's no locking needed against concurrent modifications of the file (locking against concurrent IO on the same dquot is still provided by dq_lock). Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 19:03:44 +02:00
Jan Kara	d2faa41516	quota: Do not acquire dqio_sem for dquot overwrites in v2 format When dquot has space already allocated in a quota file, we just overwrite that place when writing dquot. So we don't need any protection against other modifications of quota file as these keep dquot in place. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 19:03:16 +02:00
Jan Kara	8fc32c2b0d	quota: Push dqio_sem down to ->write_dqblk() Push down acquisition of dqio_sem into ->write_dqblk() callback. It will allow quota formats to decide whether they need it or not. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 19:03:09 +02:00
Jan Kara	47cdc11dee	quota: Remove locking for reading from the old quota format The old quota format has fixed offset in quota file based on ID so there's no locking needed against concurrent modifications of the file (locking against concurrent IO on the same dquot is still provided by dq_lock). Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 19:01:16 +02:00
Jan Kara	e342e38df9	quota: Push dqio_sem down to ->read_dqblk() Push down acquisition of dqio_sem into ->read_dqblk() callback. It will allow quota formats to decide whether they need it or not. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 19:01:00 +02:00
Jan Kara	5e8cb9b624	quota: Protect dquot writeout with dq_lock Currently dquot writeout is only protected by dqio_sem held for writing. As we transition to a finer grained locking we will use dquot->dq_lock instead. So acquire it in dquot_commit() and move dqio_sem just around ->commit_dqblk() call as it is still needed to serialize quota file changes. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 19:00:50 +02:00
Jan Kara	d6ab366102	quota: Acquire dqio_sem for reading in vfs_load_quota_inode() vfs_load_quota_inode() needs dqio_sem only for reading. In fact dqio_sem is not needed there at all since the function can be called only during quota on when quota file cannot be modified but let's leave the protection there since it is logical and the path is in no way performance critical. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 18:59:04 +02:00
Jan Kara	0cff9151d3	quota: Acquire dqio_sem for reading in dquot_get_next_id() dquot_get_next_id() needs dqio_sem only for reading to protect against racing with modification of quota file structure. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 18:58:44 +02:00
Jan Kara	62676838cb	quota: Do more fine-grained locking in dquot_acquire() We need dqio_sem held just for reading when calling ->read_dqblk() in dquot_acquire(). Also dqio_sem is not needed when setting DQ_READ_B and DQ_ACTIVE_B as concurrent reads and dquot activations are serialized by dq_lock. So acquire and release dqio_sem closer to the place where it is needed. This reduces lock hold time and will make locking changes easier. Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 18:56:07 +02:00
Jan Kara	bc8230ee8e	quota: Convert dqio_mutex to rwsem Convert dqio_mutex to rwsem and call it dqio_sem. No functional changes yet. Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-17 18:52:48 +02:00
Linus Torvalds	99f781b1bf	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota fix from Jan Kara: "A fix of a check for quota limit" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: quota: correct space limit check	2017-08-17 09:26:10 -07:00
Linus Torvalds	c8c03f1858	pty: fix the cached path of the pty slave file descriptor in the master Christian Brauner reported that if you use the TIOCGPTPEER ioctl() to get a slave pty file descriptor, the resulting file descriptor doesn't look right in /proc/<pid>/fd/<fd>. In particular, he wanted to use readlink() on /proc/self/fd/<fd> to get the pathname of the slave pty (basically implementing "ptsname{_r}()"). The reason for that was that we had generated the wrong 'struct path' when we create the pty in ptmx_open(). In particular, the dentry was correct, but the vfsmount pointed to the mount of the ptmx node. That _can_ be correct - in case you use "/dev/pts/ptmx" to open the master - but usually is not. The normal case is to use /dev/ptmx, which then looks up the pts/ directory, and then the vfsmount of the ptmx node is obviously the /dev directory, not the /dev/pts/ directory. We actually did have the right vfsmount available, but in the wrong place (it gets looked up in 'devpts_acquire()' when we get a reference to the pts filesystem), and so ptmx_open() used the wrong mnt pointer. The end result of this confusion was that the pty worked fine, but when if you did TIOCGPTPEER to get the slave side of the pty, end end result would also work, but have that dodgy 'struct path'. And then when doing "d_path()" on to get the pathname, the vfsmount would not match the root of the pts directory, and d_path() would return an empty pathname thinking that the entry had escaped a bind mount into another mount. This fixes the problem by making devpts_acquire() return the vfsmount for the pts filesystem, allowing ptmx_open() to trivially just use the right mount for the pts dentry, and create the proper 'struct path'. Reported-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-08-17 09:10:48 -07:00
Oleg Nesterov	01578e3616	x86/elf: Remove the unnecessary ADDR_NO_RANDOMIZE checks The ADDR_NO_RANDOMIZE checks in stack_maxrandom_size() and randomize_stack_top() are not required. PF_RANDOMIZE is set by load_elf_binary() only if ADDR_NO_RANDOMIZE is not set, no need to re-check after that. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: stable@vger.kernel.org Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20170815154011.GB1076@redhat.com	2017-08-16 20:32:02 +02:00
Markus Elfring	b5f5245491	fs-udf: Delete an error message for a failed memory allocation in two functions Omit an extra message for a memory allocation failure in these functions. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-16 16:43:23 +02:00
Markus Elfring	033c9da008	fs-udf: Improve six size determinations Replace the specification of data structures by variable references as the parameter for the operator "sizeof" to make the corresponding size determination a bit safer according to the Linux coding style convention. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-16 16:42:03 +02:00
Markus Elfring	ba2eb866a8	fs-udf: Adjust two checks for null pointers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The script “checkpatch.pl” pointed information out like the following. Comparison to NULL could be written !… Thus fix affected source code places. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-16 16:38:54 +02:00
Josef Bacik	23b5ec7494	btrfs: fix readdir deadlock with pagefault Readdir does dir_emit while under the btree lock. dir_emit can trigger the page fault which means we can deadlock. Fix this by allocating a buffer on opening a directory and copying the readdir into this buffer and doing dir_emit from outside of the tree lock. Thread A readdir <holding tree lock> dir_emit <page fault> down_read(mmap_sem) Thread B mmap write down_write(mmap_sem) page_mkwrite wait_ordered_extents Process C finish_ordered_extent insert_reserved_file_extent try to lock leaf <hang> Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copy the deadlock scenario to changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:05 +02:00
Nikolay Borisov	8d8aafeea2	btrfs: Simplify math in should_alloc chunk Currently should_alloc_chunk uses ->total_bytes - ->bytes_readonly to signify the total amount of bytes in this space info. However, given Jeff's patch which adds bytes_pinned and bytes_may_use to the calculation of num_allocated it becomes a lot more clear to just eliminate num_bytes altogether and add the bytes_readonly to the amount of used space. That way we don't change the results of the following statements. In the process also start using btrfs_space_info_used. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:05 +02:00
Jeff Mahoney	f44d2287d2	btrfs: account for pinned bytes in should_alloc_chunk In a heavy write scenario, we can end up with a large number of pinned bytes. This can translate into (very) premature ENOSPC because pinned bytes must be accounted for when allowing a reservation but aren't accounted for when deciding whether to create a new chunk. This patch adds the accounting to should_alloc_chunk so that we can create the chunk. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:05 +02:00
David Sterba	a7164fa4e0	btrfs: prepare for extensions in compression options This is a minimal patch intended to be backported to older kernels. We're going to extend the string specifying the compression method and this would fail on kernels before that change (the string is compared exactly). Relax the string matching only to the prefix, ie. ignoring anything that goes after "zlib" or "lzo", regardless of th format extension we decide to use. This applies to the mount options and properties. That way, patched old kernels could be booted on systems already utilizing the new compression spec. Applicable since commit `63541927c8`, v3.14. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:05 +02:00
David Sterba	1e20d1c45f	btrfs: allow defrag compress to override NOCOMPRESS attribute Currently, the BTRFS_INODE_NOCOMPRESS will prevent any compression on a given file, except when the mount is force-compress. As users have reported on IRC, this will also prevent compression when requested by defrag (btrfs fi defrag -c file). The nocompress flag is set automatically by filesystem when the ratios are bad and the user would have to manually drop the bit in order to make defrag -c work. This is not good from the usability perspective. This patch will raise priority for the defrag -c over nocompress, ie. any file with NOCOMPRESS bit set will get defragmented. The bit will remain untouched. Alternate option was to also drop the nocompress bit and keep the decision logic as is, but I think this is not the right solution. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:05 +02:00
David Sterba	1e2ef46d89	btrfs: defrag: cleanup checking for compression status Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:05 +02:00
David Sterba	eec63c65dc	btrfs: separate defrag and property compression Add new value for compression to distinguish between defrag and property. Previously, a single variable was used and this caused clashes when the per-file 'compression' was set and a defrag -c was called. The property-compression is loaded when the file is open, defrag will overwrite the same variable and reset to 0 (ie. NONE) at when the file defragmentaion is finished. That's considered a usability bug. Now we won't touch the property value, use the defrag-compression. The precedence of defrag is higher than for property (and whole-filesystem). Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:05 +02:00
David Sterba	b52aa8c93e	btrfs: rename variable holding per-inode compression type This is preparatory for separating inode compression requested by defrag and set via properties. This will fix a usability bug when defrag will reset compression type to NONE. If the file has compression set via property, it will not apply anymore (until next mount or reset through command line). We're going to fix that by adding another variable just for the defrag call and won't touch the property. The defrag will have higher priority when deciding whether to compress the data. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:05 +02:00
Timofey Titovets	c2fcdcdf36	Btrfs: add skeleton code for compression heuristic Add skeleton code for compresison heuristics. Now it iterates over all the pages, but in the end always says "yes, compress please", ie it does not change the current behaviour. In the future we're going to add various heuristics to analyze the data. This patch can be used as a baseline for measuring if the effectivness and performance. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> [ enhanced changelog, modified comments ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
David Sterba	131ce4367a	btrfs: account that we're waiting for IO in scrub_submit_raid56_bio_wait Correctly account for IO when waiting for a submitted bio in scrub. This only for the accounting purposes and should not change other behaviour. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
David Sterba	9c17f6cda1	btrfs: account that we're waiting for DIO read Correctly account for IO when waiting for a submitted DIO read, the case when we're retrying. This only for the accounting purposes and should not change other behaviour. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
David Sterba	4958aa6821	btrfs: drop chunk locks at the end of close_ctree The pinned chunks might be left over so we clean them but at this point of close_ctree, there's noone to race with, the locking can be removed. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
David Sterba	d3c0bab563	btrfs: remove trivial wrapper btrfs_force_ra It's a simple call page_cache_sync_readahead, same arguments in the same order. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
David Sterba	35dc313046	btrfs: drop ancient page flag mappings There's no PageFsMisc. Added by patch `4881ee5a2e` in 2008, the flag is not present in current kernels. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
David Sterba	ea14b57fd1	btrfs: fix spelling of snapshotting Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
Nikolay Borisov	e38ae7a086	btrfs: Make flush_space return void The return value of flush_space was used to have significance in the early days when the code was first introduced and before the ticketed enospc rework. Since the latter got introduced the return value lost any significance whatsoever to its callers. So let's remove it. While at it also remove the unused ticket variable in btrfs_async_reclaim_metadata_space. It was used in the initial version of the ticketed ENOSPC work, however Wang Xiaoguang detected a problem with this and fixed it in `ce129655c9` ("btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress"). Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
Nikolay Borisov	3558d4f88e	btrfs: Deprecate userspace transaction ioctls Userspace transactions were introduced in commit `6bf13c0cc8` ("Btrfs: transaction ioctls") to provide semantics that Ceph's object store required. However, things have changed significantly since then, to the point where btrfs is no longer suitable as a backend for ceph and in fact it's actively advised against such usages. Considering this, there doesn't seem to be a widespread, legit use case of userspace transaction. They also clutter the file->private pointer. So to end the agony let's nuke the userspace transaction ioctls. As a first step let's give time for people to voice their objection by just WARN()ining when the userspace transaction is used. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ move the warning past perm checks, keep the has-been-printed state; we're ok with just one warning over all filesystems ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
David Sterba	9f6d251033	btrfs: use named constant for bdev blocksize Superblock is read and written using buffer heads, we need to set the bdev blocksize. The magic constant has been hardcoded in several places, so replace it with a named constant. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
David Sterba	abbb3b8ebf	btrfs: split write_dev_supers to two functions There are two independent parts, one that writes the superblocks and another that waits for completion. No functional changes, but cleanups, reformatting and comment updates. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
David Sterba	35c70103a5	btrfs: refactor find_device helper Polish the helper: * drop underscores, no special meaning here * pass fs_devices, as this is what the API implements * drop noinline, no apparent reason for such simple helper * constify uuid * add comment Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
David Sterba	2dfeca9bfb	btrfs: merge alloc_device helpers There are two helpers called in chain from one location, we can merge the functionaliy. Originally, alloc_fs_devices could fill the device uuid randomly if we we didn't give the uuid buffer. This happens for seed devices but the fsid is generated in btrfs_prepare_sprout, so we can remove it. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
David Sterba	4b81ba48c6	btrfs: merge REQ_OP and REQ_ flags to one parameter in submit_extent_page The function submit_extent_page has 15(!) parameters right now, op and op_flags are effectively one value stored to bio::bi_opf, no need to pass them separately. So it's 14 parameters now. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
David Sterba	f1c77c55cd	btrfs: cleanup types storing REQ_* Unify types of local variables and parameters that store various REQ_* values to unsigned int. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
David Sterba	abe60ba45c	btrfs: get fs_info from eb in btrfs_print_tree, remove argument Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:03 +02:00
David Sterba	a4f78750ef	btrfs: get fs_info from eb in btrfs_print_leaf, remove argument Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:03 +02:00
David Sterba	f1b8a1e8c0	btrfs: simplify btrfs_dev_replace_kthread This function prints an informative message and then continues dev-replace. The message contains a progress percentage which is read from the status. The status is allocated dynamically, about 2600 bytes, just to read the single value. That's an overkill. We'll use the new helper and drop the allocation. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:03 +02:00
David Sterba	74b595fe67	btrfs: factor reading progress out of btrfs_dev_replace_status We'll want to read the percentage value from dev_replace elsewhere, move the logic to a separate helper. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:03 +02:00
David Sterba	0a52d10808	btrfs: defrag: make readahead state allocation failure non-fatal All sorts of readahead errors are not considered fatal. We can continue defragmentation without it, with some potential slow down, which will last only for the current inode. Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:03 +02:00
David Sterba	63e727ecd2	btrfs: use GFP_KERNEL in btrfs_defrag_file We can safely use GFP_KERNEL, the function is called from two contexts: - ioctl handler, called directly, no locks taken - cleaner thread, running all queued defrag work, outside of any locks Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:03 +02:00
David Sterba	3ec8362111	btrfs: use GFP_KERNEL in mount and remount We don't need to restrict the allocation flags in btrfs_mount or _remount. No big filesystem locks are held (possibly s_umount but that does no count here). Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:03 +02:00
Nikolay Borisov	e3f3ad1268	btrfs: Remove never reached error handling code in __add_reloc_root One of the error handling paths in __add_reloc_root contains btrfs_panic() followed by some other code. As the name implies what it does is print some error message and call BUG, naturally what follow afterwards is not invoked. So remove this extra code. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:03 +02:00
Nikolay Borisov	e4ff5fb5dc	btrfs: Remove unused parameters from volume.c functions This also adjusts the respective callers in other files. Those were found with -Wunused-parameter. btrfs_full_stripe_len's mapping_tree - introduced by `53b381b3ab` ("Btrfs: RAID5 and RAID6") but it was never really used even in that commit btrfs_is_parity_mirror's mirror_num - same as above chunk_drange_filter's chunk_offset - introduced by `94e60d5a5c` ("Btrfs: devid subset filter") and never used. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:03 +02:00
Nikolay Borisov	110840bb62	btrfs: Remove unused variables clear_super - usage was removed in commit `cea67ab92d` ("btrfs: clean the old superblocks before freeing the device") but that change forgot to remove the actual variable. max_key - commit `6174d3cb43` ("Btrfs: remove unused max_key arg from btrfs_search_forward") removed the max_key parameter but it forgot to remove references from callers. stripe_len - this one was added by `e06cd3dd7c` ("Btrfs: add validadtion checks for chunk loading") but even then it wasn't used. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:03 +02:00
Nikolay Borisov	500ceed807	btrfs: Remove find_raid56_stripe_len find_raid56_stripe_len statically returns SZ_64K which equals BTRFS_STRIPE_LEN. It's sole caller is __btrfs_alloc_chunk and it assigns the return value to ai variable which is already set to BTRFS_STRIPE_LEN. So remove the function invocation altogether and remove the function itself. Also remove the variable since it's only aliasing BTRFS_STRIPE_LEN and use the define directly. Use the occassion to simplify the rounding down of stripe_size now that the value we want it to align is a power of 2. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:03 +02:00
Nikolay Borisov	47f08b9699	btrfs: Use explicit round_down macro in btrfs resize ioctl handler No functional changes, just make the code more self-explanatory. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:03 +02:00
Anand Jain	19aee8dea3	btrfs: btrfs_inherit_iflags() can be static btrfs_new_inode() is the only consumer move it to inode.c, from ioctl.c. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:02 +02:00
Nick Terrell	26b28dce50	btrfs: Keep one more workspace around find_workspace() allocates up to num_online_cpus() + 1 workspaces. free_workspace() will only keep num_online_cpus() workspaces. When (de)compressing we will allocate num_online_cpus() + 1 workspaces, then free one, and repeat. Instead, we can just keep num_online_cpus() + 1 workspaces around, and never have to allocate/free another workspace in the common case. I tested on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. I mounted a BtrFS partition with -o compress-force={lzo,zlib,zstd} and logged whenever a workspace was allocated of freed. Then I copied vmlinux (527 MB) to the partition. Before the patch, during the copy it would allocate and free 5-6 workspaces. After, it only allocated the initial 3. This held true for lzo, zlib, and zstd. The time it took to execute cp vmlinux /mnt/btrfs && sync dropped from 1.70s to 1.44s with lzo compression, and from 2.04s to 1.80s for zstd compression. Signed-off-by: Nick Terrell <terrelln@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:02 +02:00
David Sterba	913e153572	btrfs: drop newlines from strings when using btrfs_* helpers The helpers append "\n" so we can keep the actual strings shorter. The extra newline will print an empty line. Some messages have been slightly modified to be more consistent with the rest (lowercase first letter). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:02 +02:00
Nikolay Borisov	b6e6bca51e	btrfs: qgroups: Fix BUG_ON condition in tree level check The current code was erroneously checking for root_level > BTRFS_MAX_LEVEL. If we had a root_level of 8 then the check won't trigger and we could potentially hit a buffer overflow. The correct check should be root_level >= BTRFS_MAX_LEVEL . Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:02 +02:00
Qu Wenruo	c550245148	btrfs: Enhance message when a device is missing during mount For a missing device, btrfs will just refuse to mount with almost meaningless kernel message like: BTRFS info (device vdb6): disk space caching is enabled BTRFS info (device vdb6): has skinny extents BTRFS error (device vdb6): failed to read the system array: -5 BTRFS error (device vdb6): open_ctree failed This patch will print a new message about the missing device: BTRFS info (device vdb6): disk space caching is enabled BTRFS info (device vdb6): has skinny extents BTRFS warning (device vdb6): devid 2 uuid 80470722-cad2-4b90-b7c3-fee294552f1b is missing BTRFS error (device vdb6): failed to read the system array: -5 BTRFS error (device vdb6): open_ctree failed Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:02 +02:00
Qu Wenruo	bc3cce2378	btrfs: Cleanup num_tolerated_disk_barrier_failures As we use per-chunk degradable check, the global num_tolerated_disk_barrier_failures is of no use. We can now remove it. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:02 +02:00
Qu Wenruo	d10b82fe29	btrfs: Allow barrier_all_devices to do chunk level device check The last user of num_tolerated_disk_barrier_failures is barrier_all_devices(). But it can be easily changed to the new per-chunk degradable check framework. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:02 +02:00
Qu Wenruo	b382cfe889	btrfs: Do chunk level check for degraded remount Just the same for mount time check, use btrfs_check_rw_degradable() to check if we are OK to be remounted rw. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:02 +02:00
Qu Wenruo	4330e183c9	btrfs: Do chunk level check for degraded rw mount Now use the btrfs_check_rw_degradable() to check if we can mount in the degraded mode. With this patch, we can mount in the following case: # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc # wipefs -a /dev/sdc # mount /dev/sdb /mnt/btrfs -o degraded As the single data chunk is only on sdb, so it's OK to mount as degraded, as missing one device is OK for RAID1. But still fail in the following case as expected: # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc # wipefs -a /dev/sdb # mount /dev/sdc /mnt/btrfs -o degraded As the data chunk is only in sdb, so it's not OK to mount it as degraded. Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reported-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:02 +02:00
Qu Wenruo	21634a19f6	btrfs: Introduce a function to check if all chunks a OK for degraded rw mount Introduce a new function, btrfs_check_rw_degradable(), to check if all chunks in btrfs is OK for degraded rw mount. It provides the new basis for accurate btrfs mount/remount and even runtime degraded mount check other than old one-size-fit-all method. Btrfs currently uses num_tolerated_disk_barrier_failures to do global check for tolerated missing device. Although the one-size-fit-all solution is quite safe, it's too strict if data and metadata has different duplication level. For example, if one use Single data and RAID1 metadata for 2 disks, it means any missing device will make the fs unable to be degraded mounted. But in fact, some times all single chunks may be in the existing device and in that case, we should allow it to be rw degraded mounted. Such case can be easily reproduced using the following script: # mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc # wipefs -f /dev/sdc # mount /dev/sdb -o degraded,rw If using btrfs-debug-tree to check /dev/sdb, one should find that the data chunk is only in sdb, so in fact it should allow degraded mount. This patchset will introduce a new per-chunk degradable check for btrfs, allow above case to succeed, and it's quite small anyway. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copied text from cover letter with more details about the problem being solved ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:02 +02:00
Liu Bo	0d1e0bead6	Btrfs: report errors when checksum is not found When btrfs fails the checksum check, it'll fill the whole page with "1". However, if %csum_expected is 0 (which means there is no checksum), then for some unknown reason, we just pretend that the read is correct, so userspace would be confused about the dilemma that read is successful but getting a page with all content being "1". This can happen due to a bug in btrfs-convert. This fixes it by always returning errors if checksum doesn't match. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:02 +02:00
Nikolay Borisov	69f03f137a	btrfs: Prevent possible ERR_PTR() dereference In btrfs_full_stripe_len/btrfs_is_parity_mirror we have similar code which gets the chunk map for a particular range via get_chunk_map. However, get_chunk_map can return an ERR_PTR value and while the 2 callers do catch this with a WARN_ON they then proceed to indiscriminately dereference the extent map. This of course leads to a crash. Fix the offenders by making the dereference conditional on IS_ERR. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:02 +02:00
Nikolay Borisov	1174cade81	btrfs: Remove redundant checks from btrfs_alloc_data_chunk_ondemand Many commits ago the data space_info in alloc_data_chunk_ondemand used to be acquired from the inode. At that point commit `33b4d47f5e` ("Btrfs: deal with NULL space info") got introduced to deal with spurios cases where the space info could be null, following a rebalance. Nowadays, however, the space info is referenced directly from the btrfs_fs_info struct which is initialised at filesystem mount time. This makes the null checks redundant, so remove them. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:02 +02:00
Nikolay Borisov	7bdd6277e0	btrfs: Remove redundant argument of flush_space All callers of flush_space pass the same number for orig/num_bytes arguments. Let's remove one of the numbers and also modify the trace point to show only a single number - bytes requested. Seems that last point where the two parameters were treated differently is before the ticketed enospc rework. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:01 +02:00
Aleksa Sarai	6c6b5a39c4	btrfs: resume qgroup rescan on rw remount Several distributions mount the "proper root" as ro during initrd and then remount it as rw before pivot_root(2). Thus, if a rescan had been aborted by a previous shutdown, the rescan would never be resumed. This issue would manifest itself as several btrfs ioctl(2)s causing the entire machine to hang when btrfs_qgroup_wait_for_completion was hit (due to the fs_info->qgroup_rescan_running flag being set but the rescan itself not being resumed). Notably, Docker's btrfs storage driver makes regular use of BTRFS_QUOTA_CTL_DISABLE and BTRFS_IOC_QUOTA_RESCAN_WAIT (causing this problem to be manifested on boot for some machines). Cc: <stable@vger.kernel.org> # v3.11+ Cc: Jeff Mahoney <jeffm@suse.com> Fixes: `b382a324b6` ("Btrfs: fix qgroup rescan resume on mount") Signed-off-by: Aleksa Sarai <asarai@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:01 +02:00
Edmund Nadolski	01747e92a9	btrfs: clean up extraneous computations in add_delayed_refs Repeating the same computation in multiple places is not necessary. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:01 +02:00
Edmund Nadolski	3ec4d3238a	btrfs: allow backref search checks for shared extents When called with a struct share_check, find_parent_nodes() will detect a shared extent and immediately return with BACKREF_SHARED_FOUND. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:01 +02:00
Edmund Nadolski	9dd14fd696	btrfs: add cond_resched() calls when resolving backrefs Since backref resolution is CPU-intensive, the cond_resched calls should help alleviate soft lockup occurences. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:01 +02:00
Jeff Mahoney	00142756e1	btrfs: backref, add tracepoints for prelim_ref insertion and merging This patch adds a tracepoint event for prelim_ref insertion and merging. For each, the ref being inserted or merged and the count of tree nodes is issued. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:01 +02:00
Jeff Mahoney	6c336b212b	btrfs: add a node counter to each of the rbtrees This patch adds counters to each of the rbtrees so that we can tell how large they are growing for a given workload. These counters will be exported by tracepoints in the next patch. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:01 +02:00
Edmund Nadolski	86d5f99442	btrfs: convert prelimary reference tracking to use rbtrees It's been known for a while that the use of multiple lists that are periodically merged was an algorithmic problem within btrfs. There are several workloads that don't complete in any reasonable amount of time (e.g. btrfs/130) and others that cause soft lockups. The solution is to use a set of rbtrees that do insertion merging for both indirect and direct refs, with the former converting refs into the latter. The result is a btrfs/130 workload that used to take several hours now takes about half of that. This runtime still isn't acceptable and a future patch will address that by moving the rbtrees higher in the stack so the lookups can be shared across multiple calls to find_parent_nodes. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:11:55 +02:00
Colin Ian King	65f2b26339	reiserfs: fix spelling mistake: "tranasction" -> "transaction" Trivial fix to spelling mistake in reiserfs_warning message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-16 15:59:47 +02:00
Edmund Nadolski	f6954245d9	btrfs: remove ref_tree implementation from backref.c Commit `afce772e87` ("btrfs: fix check_shared for fiemap ioctl") added the ref_tree code in backref.c to reduce backref searching for shared extents under the FIEMAP ioctl. This code will not be compatible with the upcoming rbtree changes for improved backref searching, so this patch removes the ref_tree code. The rbtree changes will provide the equivalent functionality for FIEMAP. The above commit also introduced transaction semantics around calls to btrfs_check_shared() in order to accurately account for delayed refs. This functionality needs to be retained, so a complete revert of the above commit is not desirable. This patch therefore removes the ref_tree portion of the commit as above, however it does not remove the transaction portion. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 14:19:53 +02:00
Edmund Nadolski	bb739cf08e	btrfs: btrfs_check_shared should manage its own transaction Commit `afce772e87` ("btrfs: fix check_shared for fiemap ioctl") added transaction semantics around calls to btrfs_check_shared() in order to provide accurate accounting of delayed refs. The transaction management should be done inside btrfs_check_shared(), so that callers do not need to manage transactions individually. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 14:19:53 +02:00
Jeff Mahoney	e0c476b128	btrfs: backref, cleanup __ namespace abuse We typically use __ to indicate a helper routine that shouldn't be called directly without understanding the proper context required to do so. We use static functions to indicate that a function is private to a particular C file. The backref code uses static function and __ prefixes on nearly everything, which makes the code difficult to read and establishes a pattern for future code that shouldn't be followed. This patch drops all the unnecessary prefixes. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 14:19:53 +02:00
Jeff Mahoney	4dae077a83	btrfs: backref, add unode_aux_to_inode_list helper Replacing the double cast and ternary conditional with a helper makes the code easier on the eyes. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 14:19:53 +02:00
Jeff Mahoney	73980becae	btrfs: backref, constify some arguments This constifies a few buffers used in the backref code. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 14:19:53 +02:00
Jeff Mahoney	9a35b63728	btrfs: constify tracepoint arguments Tracepoint arguments are all read-only. If we mark the arguments as const, we're able to keep or convert those arguments to const where appropriate. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 14:19:53 +02:00
Jeff Mahoney	1cbb1f454e	btrfs: struct-funcs, constify readers We have reader helpers for most of the on-disk structures that use an extent_buffer and pointer as offset into the buffer that are read-only. We should mark them as const and, in turn, allow consumers of these interfaces to mark the buffers const as well. No impact on code, but serves as documentation that a buffer is intended not to be modified. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 14:19:53 +02:00
Nikolay Borisov	23d1f73788	btrfs: remove unused sectorsize member The sectorsize member of btrfs_block_group_cache is unused. So remove it, this reduces the number of holes in the struct. With patch: /* size: 856, cachelines: 14, members: 40 / / sum members: 837, holes: 4, sum holes: 19 / / bit holes: 1, sum bit holes: 29 bits / / last cacheline: 24 bytes / Without patch: / size: 864, cachelines: 14, members: 41 / / sum members: 841, holes: 5, sum holes: 23 / / bit holes: 1, sum bit holes: 29 bits / / last cacheline: 32 bytes */ Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 14:19:53 +02:00
Nikolay Borisov	f148ef4d3a	btrfs: Be explicit about usage of min() __btrfs_alloc_chunk contains code which boils down to: ndevs = min(ndevs, devs_max) It's conditional upon devs_max not being 0. However, it cannot really be 0 since it's always set to either BTRFS_MAX_DEVS_SYS_CHUNK or BTRFS_MAX_DEVS(fs_info->chunk_root). So eliminate the condition check and use min explicitly. This has no functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 14:19:52 +02:00
Nikolay Borisov	e5600fd6fc	btrfs: Use explicit round_down call rather than open-coding it No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 14:19:52 +02:00
Nikolay Borisov	ebcc9301ea	btrfs: convert while loop to list_for_each_entry No functional changes, just make the loop a bit more readable Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 14:19:52 +02:00
David S. Miller	463910e2df	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2017-08-15 20:23:23 -07:00
Chao Yu	b8c502b81e	f2fs: fix potential overflow when adjusting GC cycle While comparing signed and unsigned variables, compiler will converts the signed value to unsigned one, due to this reason, {in,de}crease_sleep_time may return overflowed result. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-08-15 10:40:14 -07:00
Chao Yu	9a20d391cd	f2fs: avoid unneeded sync on quota file We only need to sync quota file with appointed quota type instead of all types in f2fs_quota_{on,off}. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-08-15 10:40:13 -07:00
Jaegeuk Kim	d9872a698c	f2fs: introduce gc_urgent mode for background GC This patch adds a sysfs entry to control urgent mode for background GC. If this is set, background GC thread conducts GC with gc_urgent_sleep_time all the time. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-08-15 10:40:12 -07:00
Jaegeuk Kim	3537581a72	f2fs: use IPU for cold files We expect cold files write data sequentially, but sometimes some of small data can be updated, which incurs fragmentation. Let's avoid that. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-08-15 10:40:11 -07:00
Yunlong Song	008396e1b0	f2fs: fix the size value in __check_sit_bitmap The current size value is not correct and will miss bitmap check. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-08-15 10:40:10 -07:00
Thomas Tai	cc1dfa8b75	gfs2: fix slab corruption during mounting and umounting gfs file system When using cman-3.0.12.1 and gfs2-utils-3.0.12.1, mounting and unmounting GFS2 file system would cause kernel to hang. The slab allocator suggests that it is likely a double free memory corruption. The issue is traced back to v3.9-rc6 where a patch is submitted to use kzalloc() for storing a bitmap instead of using a local variable. The intention is to allocate memory during mount and to free memory during unmount. The original patch misses a code path which has already freed the memory and caused memory corruption. This patch sets the memory pointer to NULL after the memory is freed, so that double free memory corruption will not happen. gdlm_mount() '-- set_recover_size() which use kzalloc() '-- if dlm does not support ops callbacks then '--- free_recover_size() which use kfree() gldm_unmount() '-- free_recover_size() which use kfree() Previous patch which introduced the double free issue is commit `57c7310b8e` ("GFS2: use kmalloc for lvb bitmap") Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>	2017-08-15 11:54:09 -05:00
Trond Myklebust	2ce209c42c	NFS: Wait for requests that are locked on the commit list If a request is on the commit list, but is locked, we will currently skip it, which can lead to livelocking when the commit count doesn't reduce to zero. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:48 -04:00
Trond Myklebust	8205b9ce03	NFSv4/pnfs: Replace pnfs_put_lseg_locked() with pnfs_put_lseg() Now that we no longer hold the inode->i_lock when manipulating the commit lists, it is safe to call pnfs_put_lseg() again. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:48 -04:00
Trond Myklebust	4b9bb25b36	NFS: Switch to using mapping->private_lock for page writeback lookups. Switch from using the inode->i_lock for this to avoid contention with other metadata manipulation. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:48 -04:00
Trond Myklebust	5cb953d4b1	NFS: Use an atomic_long_t to count the number of commits Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:48 -04:00
Trond Myklebust	a6b6d5b85a	NFS: Use an atomic_long_t to count the number of requests Rather than forcing us to take the inode->i_lock just in order to bump the number. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:47 -04:00
Trond Myklebust	e824f99ada	NFSv4: Use a mutex to protect the per-inode commit lists The commit lists can get very large, so using the inode->i_lock can end up affecting general metadata performance. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:47 -04:00
Trond Myklebust	b30d2f04c3	NFS: Refactor nfs_page_find_head_request() Split out the 2 cases so that we can treat the locking differently. The issue is that the locking in the pageswapcache cache is highly linked to the commit list locking. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:47 -04:00
Trond Myklebust	bd37d6fce1	NFSv4: Convert nfs_lock_and_join_requests() to use nfs_page_find_head_request() Hide the locking from nfs_lock_and_join_requests() so that we can separate out the requirements for swapcache pages. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:47 -04:00
Trond Myklebust	7e8a30f8b4	NFS: Fix up nfs_page_group_covers_page() Fix up the test in nfs_page_group_covers_page(). The simplest implementation is to check that we have a set of intersecting or contiguous subrequests that connect page offset 0 to nfs_page_length(req->wb_page). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:47 -04:00
Trond Myklebust	1344b7ea17	NFS: Remove unused parameter from nfs_page_group_lock() nfs_page_group_lock() is now always called with the 'nonblock' parameter set to 'false'. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:47 -04:00
Trond Myklebust	dee83046e7	NFS: Remove unuse function nfs_page_group_lock_wait() Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:47 -04:00
Trond Myklebust	902a4c0046	NFS: Remove nfs_page_group_clear_bits() At this point, we only expect ever to potentially see PG_REMOVE and PG_TEARDOWN being set on the subrequests. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:47 -04:00
Trond Myklebust	5b2b5187fa	NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases Since nfs_page_group_destroy() does not take any locks on the requests to be freed, we need to ensure that we don't inadvertently free the request in nfs_destroy_unlinked_subrequests() while the last reference is being released elsewhere. Do this by: 1) Taking a reference to the request unless it is already being freed 2) Checking (under the page group lock) if PG_TEARDOWN is already set before freeing an unreferenced request in nfs_destroy_unlinked_subrequests() Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:47 -04:00
Trond Myklebust	74a6d4b5ae	NFS: Further optimise nfs_lock_and_join_requests() When locking the entire group in order to remove subrequests, the locks are always taken in order, and with the page group lock being taken after the page head is locked. The intention is that: 1) The lock on the group head guarantees that requests may not be removed from the group (although new entries could be appended if we're not holding the group lock). 2) It is safe to drop and retake the page group lock while iterating through the list, in particular when waiting for a subrequest lock. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:47 -04:00
Trond Myklebust	b5bab9bf91	NFS: Reduce inode->i_lock contention in nfs_lock_and_join_requests() We should no longer need the inode->i_lock, now that we've straightened out the request locking. The locking schema is now: 1) Lock page head request 2) Lock the page group 3) Lock the subrequests one by one Note that there is a subtle race with nfs_inode_remove_request() due to the fact that the latter does not lock the page head, when removing it from the struct page. Only the last subrequest is locked, hence we need to re-check that the PagePrivate(page) is still set after we've locked all the subrequests. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:47 -04:00
Trond Myklebust	7e6cca6caf	NFS: Remove page group limit in nfs_flush_incompatible() nfs_try_to_update_request() should be able to cope now. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:47 -04:00
Trond Myklebust	f6032f216f	NFS: Teach nfs_try_to_update_request() to deal with request page_groups Simplify the code, and avoid some flushes to disk. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:47 -04:00
Trond Myklebust	b66aaa8dfe	NFS: Fix the inode request accounting when pages have subrequests Both nfs_destroy_unlinked_subrequests() and nfs_lock_and_join_requests() manipulate the inode flags adjusting the NFS_I(inode)->nrequests. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:47 -04:00
Trond Myklebust	31a01f093e	NFS: Don't unlock writebacks before declaring PG_WB_END We don't want nfs_lock_and_join_requests() to start fiddling with the request before the call to nfs_page_group_sync_on_bit(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:46 -04:00
Trond Myklebust	e14bebf6de	NFS: Don't check request offset and size without holding a lock Request offsets and sizes are not guaranteed to be stable unless you are holding the request locked. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:46 -04:00
Trond Myklebust	a0e265bc78	NFS: Fix an ABBA issue in nfs_lock_and_join_requests() All other callers of nfs_page_group_lock() appear to already hold the page lock on the head page, so doing it in the opposite order here is inefficient, although not deadlock prone since we roll back all locks on contention. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:46 -04:00
Trond Myklebust	7cb9cd9aa2	NFS: Fix a reference and lock leak in nfs_lock_and_join_requests() Yes, this is a situation that should never happen (hence the WARN_ON) but we should still ensure that we free up the locks and references to the faulty pages. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:46 -04:00
Trond Myklebust	08fead2ae5	NFS: Ensure we always dereference the page head last This fixes a race with nfs_page_group_sync_on_bit() whereby the call to wake_up_bit() in nfs_page_group_unlock() could occur after the page header had been freed. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:46 -04:00
Trond Myklebust	1403390d83	NFS: Reduce lock contention in nfs_try_to_update_request() Micro-optimisation to move the lockless check into the for(;;) loop. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:46 -04:00
Trond Myklebust	82749dd4ef	NFS: Reduce lock contention in nfs_page_find_head_request() Add a lockless check for whether or not the page might be carrying an existing writeback before we grab the inode->i_lock. Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:46 -04:00
Trond Myklebust	6d17d653c9	NFS: Simplify page writeback We don't expect the page header lock to ever be held across I/O, so it should always be safe to wait for it, even if we're doing nonblocking writebacks. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-15 11:54:46 -04:00
Trond Myklebust	55cfcd1211	Merge branch 'open_state'	2017-08-15 11:54:13 -04:00
Greg Kroah-Hartman	d985524680	Merge 4.13-rc5 into char-misc-next We want the firmware, and other changes, in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-08-14 13:29:31 -07:00
Tahsin Erdogan	32aaf19420	ext4: add missing xattr hash update When updating an extended attribute, if the padded value sizes are the same, a shortcut is taken to avoid the bulk of the work. This was fine until the xattr hash update was moved inside ext4_xattr_set_entry(). With that change, the hash update got missed in the shortcut case. Thanks to ZhangYi (yizhang089@gmail.com) for root causing the problem. Fixes: `daf8328172` ("ext4: eliminate xattr entry e_hash recalculation for removes") Reported-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2017-08-14 08:30:06 -04:00
Theodore Ts'o	b80b32b6d5	ext4: fix clang build regression Arnd Bergmann <arnd@arndb.de> As Stefan pointed out, I misremembered what clang can do specifically, and it turns out that the variable-length array at the end of the structure did not work (a flexible array would have worked here but not solved the problem): fs/ext4/mballoc.c:2303:17: error: fields must have a constant size: 'variable length array in structure' extension will never be supported ext4_grpblk_t counters[blocksize_bits + 2]; This reverts part of my previous patch, using a fixed-size array again, but keeping the check for the array overflow. Fixes: `2df2c3402f` ("ext4: fix warning about stack corruption") Reported-by: Stefan Agner <stefan@agner.ch> Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2017-08-14 08:29:18 -04:00
Trond Myklebust	75e8c48b9e	NFSv4: Use the nfs4_state being recovered in _nfs4_opendata_to_nfs4_state() If we're recovering a nfs4_state, then we should try to use that instead of looking up a new stateid. Only do that if the inodes match, though. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-13 20:36:15 -04:00
Trond Myklebust	4e2fcac773	NFSv4: Use correct inode in _nfs4_opendata_to_nfs4_state() When doing open by filehandle we don't really want to lookup a new inode, but rather update the one we've got. Add a helper which does this for us. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-13 20:36:15 -04:00
Boris Brezillon	d4092d76a4	mtd: nand: Rename nand.h into rawnand.h We are planning to share more code between different NAND based devices (SPI NAND, OneNAND and raw NANDs), but before doing that we need to move the existing include/linux/mtd/nand.h file into include/linux/mtd/rawnand.h so we can later create a nand.h header containing all common structure and function prototypes. Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Peter Pan <peterpandong@micron.com> Acked-by: Vladimir Zapolskiy <vz@mleia.com> Acked-by: Alexander Sverdlin <alexander.sverdlin@gmail.com> Acked-by: Wenyou Yang <wenyou.yang@microchip.com> Acked-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Han Xu <han.xu@nxp.com> Acked-by: H Hartley Sweeten <hsweeten@visionengravers.com> Acked-by: Shawn Guo <shawnguo@kernel.org> Acked-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Acked-by: Neil Armstrong <narmstrong@baylibre.com> Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-By: Harvey Hunt <harveyhuntnexus@gmail.com> Acked-by: Tony Lindgren <tony@atomide.com> Acked-by: Krzysztof Halasa <khalasa@piap.pl>	2017-08-13 10:11:49 +02:00
Christoph Hellwig	e28ae8e428	iomap: fix integer truncation issues in the zeroing and dirtying helpers Fix the min_t calls in the zeroing and dirtying helpers to perform the comparisms on 64-bit types, which prevents them from incorrectly being truncated, and larger zeroing operations being stuck in a never ending loop. Special thanks to Markus Stockhausen for spotting the bug. Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-08-11 16:56:33 -07:00
Omar Sandoval	c44245b3d5	xfs: fix inobt inode allocation search optimization When we try to allocate a free inode by searching the inobt, we try to find the inode nearest the parent inode by searching chunks both left and right of the chunk containing the parent. As an optimization, we cache the leftmost and rightmost records that we previously searched; if we do another allocation with the same parent inode, we'll pick up the search where it last left off. There's a bug in the case where we found a free inode to the left of the parent's chunk: we need to update the cached left and right records, but because we already reassigned the right record to point to the left, we end up assigning the left record to both the cached left and right records. This isn't a correctness problem strictly, but it can result in the next allocation rechecking chunks unnecessarily or allocating inodes further away from the parent than it needs to. Fix it by swapping the record pointer after we update the cached left and right records. Fixes: `bd16956599` ("xfs: speed up free inode search") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-08-11 16:56:33 -07:00
Linus Torvalds	216e4a1def	Some more NFS client bugfixes for 4.13 Stable fix: - Fix leaking nfs4_ff_ds_version array Other fixes: - Improve TEST_STATEID OLD_STATEID handling to prevent recovery loop - Require 64-bit sector_t for pNFS blocklayout to prevent 32-bit compile errors -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlmOFIIACgkQ18tUv7Cl QOsaUQ/9E7lAP6yYp8HfjIBayN1gcme0ZeGzmWVdP8R9isvqTE0MjrwoNxk7h61H La/qUcymE32bMX8qYlDs0mw+yhiTcR/UoP5lS/4FCSUZoQsE6BWXoh+O9QlqEcuE mFbA9SV52Pf5Mdc/bTNKyh7jgCjeqzlu2sRo5LUM+N7G/M2a5RPfJVGVNYpOmVs/ ay30B5tHG/K3eeXECLjFTw3HeMorsS2coTaxtX6RghqPoVF6OFZarMUt69IX3zgg jBjokz7YfaPSeOEIOapGGRRARHRBAaPE8TvAtRd45R2pMk+Lr12cFWLjT72wRCCM nXrTpJc+q8feje9YpT5yoKtgRnW6etxKM8dtyYrXG1NO+dfZHNIe2Z1ARplhzhV3 Rt8lBV0N0b7kHZfyMJjYINhAbUxvS8UghRpljuHm4+f1lkoV6cVhKoaat/7MQDwZ I55M2Edl+A6wPQA7hpFuIT++PVN6GDK7D1rZTKaDBfZ3OCTOQLx0g1kZwHYs/lmk gvvtkj82RmbIPoG1rbxHTJFoQdVrpVCYAWr4rbgqNvUrZCjxTRmwRmyMpC/M1cXI noyZ/F+VdVLa0mADKMUmiQJ6QkoHjRIAIqlJbLRRl2VFlWHfu7hUiXk7hqt5ocQW cpxwird0Fur8cbEKVriRcwNpqGBrDDO7bv1lyQkwEOeHWZ6Fv9o= =1/Ms -----END PGP SIGNATURE----- Merge tag 'nfs-for-4.13-5' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client fixes from Anna Schumaker: "A few more NFS client bugfixes from me for rc5. Dros has a stable fix for flexfiles to prevent leaking the nfs4_ff_ds_version arrays when freeing a layout, Trond fixed a potential recovery loop situation with the TEST_STATEID operation, and Christoph fixed up the pNFS blocklayout Kconfig options to prevent unsafe use with kernels that don't have large block device support. Summary: Stable fix: - fix leaking nfs4_ff_ds_version array Other fixes: - improve TEST_STATEID OLD_STATEID handling to prevent recovery loop - require 64-bit sector_t for pNFS blocklayout to prevent 32-bit compile errors" * tag 'nfs-for-4.13-5' of git://git.linux-nfs.org/projects/anna/linux-nfs: pnfs/blocklayout: require 64-bit sector_t NFSv4: Ignore NFS4ERR_OLD_STATEID in nfs41_check_open_stateid() nfs/flexfiles: fix leak of nfs4_ff_ds_version arrays	2017-08-11 13:54:09 -07:00
Linus Torvalds	2bfc37cdef	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "Fix a few bugs in fuse" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: set mapping error in writepage_locked when it fails fuse: Dont call set_page_dirty_lock() for ITER_BVEC pages for async_dio fuse: initialize the flock flag in fuse_file on allocation	2017-08-11 11:20:48 -07:00
Christoph Hellwig	8a9d6e964d	pnfs/blocklayout: require 64-bit sector_t The blocklayout code does not compile cleanly for a 32-bit sector_t, and also has no reliable checks for devices sizes, which makes it unsafe to use with a kernel that doesn't support large block devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Arnd Bergmann <arnd@arndb.de> Fixes: `5c83746a0c` ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-08-11 14:10:13 -04:00
Ingo Molnar	040cca3ab2	Merge branch 'linus' into locking/core, to resolve conflicts Conflicts: include/linux/mm_types.h mm/huge_memory.c I removed the smp_mb__before_spinlock() like the following commit does: `8b1b436dd1` ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()") and fixed up the affected commits. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-11 13:51:59 +02:00
Jeff Layton	9183976ef1	fuse: set mapping error in writepage_locked when it fails This ensures that we see errors on fsync when writeback fails. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-08-11 11:38:26 +02:00
Mike Rapoport	e86b298beb	userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage When the process exit races with outstanding mcopy_atomic, it would be better to return ESRCH error. When such race occurs the process and it's mm are going away and returning "no such process" to the uffd monitor seems better fit than ENOSPC. Link: http://lkml.kernel.org/r/1502111545-32305-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-08-10 15:54:07 -07:00
Minchan Kim	b3a81d0841	mm: fix KSM data corruption Nadav reported KSM can corrupt the user data by the TLB batching race[1]. That means data user written can be lost. Quote from Nadav Amit: "For this race we need 4 CPUs: CPU0: Caches a writable and dirty PTE entry, and uses the stale value for write later. CPU1: Runs madvise_free on the range that includes the PTE. It would clear the dirty-bit. It batches TLB flushes. CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We care about the fact that it clears the PTE write-bit, and of course, batches TLB flushes. CPU3: Runs KSM. Our purpose is to pass the following test in write_protect_page(): if (pte_write(pvmw.pte) \|\| pte_dirty(pvmw.pte) \|\| (pte_protnone(pvmw.pte) && pte_savedwrite(pvmw.pte))) Since it will avoid TLB flush. And we want to do it while the PTE is stale. Later, and before replacing the page, we would be able to change the page. Note that all the operations the CPU1-3 perform canhappen in parallel since they only acquire mmap_sem for read. We start with two identical pages. Everything below regards the same page/PTE. CPU0 CPU1 CPU2 CPU3 ---- ---- ---- ---- Write the same value on page [cache PTE as dirty in TLB] MADV_FREE pte_mkclean() 4 > clear_refs pte_wrprotect() write_protect_page() [ success, no flush ] pages_indentical() [ ok ] Write to page different value [Ok, using stale PTE] replace_page() Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0 already wrote on the page, but KSM ignored this write, and it got lost" In above scenario, MADV_FREE is fixed by changing TLB batching API including [set\|clear]_tlb_flush_pending. Remained thing is soft-dirty part. This patch changes soft-dirty uses TLB batching API instead of flush_tlb_mm and KSM checks pending TLB flush by using mm_tlb_flush_pending so that it will flush TLB to avoid data lost if there are other parallel threads pending TLB flush. [1] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com Link: http://lkml.kernel.org/r/20170802000818.4760-8-namit@vmware.com Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Nadav Amit <namit@vmware.com> Reported-by: Nadav Amit <namit@vmware.com> Tested-by: Nadav Amit <namit@vmware.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-08-10 15:54:07 -07:00
Johannes Weiner	d507e2ebd2	mm: fix global NR_SLAB_.*CLAIMABLE counter reads As Tetsuo points out: "Commit `385386cff4` ("mm: vmstat: move slab statistics from zone to node counters") broke "Slab:" field of /proc/meminfo . It shows nearly 0kB" In addition to /proc/meminfo, this problem also affects the slab counters OOM/allocation failure info dumps, can cause early -ENOMEM from overcommit protection, and miscalculate image size requirements during suspend-to-disk. This is because the patch in question switched the slab counters from the zone level to the node level, but forgot to update the global accessor functions to read the aggregate node data instead of the aggregate zone data. Use global_node_page_state() to access the global slab counters. Fixes: `385386cff4` ("mm: vmstat: move slab statistics from zone to node counters") Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Stefan Agner <stefan@agner.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-08-10 15:54:06 -07:00
Abhi Das	b066a4eebd	gfs2: forcibly flush ail to relieve memory pressure On systems with low memory, it is possible for gfs2 to infinitely loop in balance_dirty_pages() under heavy IO (creating sparse files). balance_dirty_pages() attempts to write out the dirty pages via gfs2_writepages() but none are found because these dirty pages are being used by the journaling code in the ail. Normally, the journal has an upper threshold which when hit triggers an automatic flush of the ail. But this threshold can be higher than the number of allowable dirty pages and result in the ail never being flushed. This patch forces an ail flush when gfs2_writepages() fails to write anything. This is a good indication that the ail might be holding some dirty pages. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2017-08-10 10:51:03 -05:00
Andreas Gruenbacher	a91323e255	gfs2: Clean up waiting on glocks The prepare_to_wait_on_glock and finish_wait_on_glock functions introduced in commit 56a365be "gfs2: gfs2_glock_get: Wait on freeing glocks" are better removed, resulting in cleaner code. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2017-08-10 10:51:02 -05:00
Andreas Gruenbacher	6a1c8f6dcf	gfs2: Defer deleting inodes under memory pressure When under memory pressure and an inode's link count has dropped to zero, defer deleting the inode to the delete workqueue. This avoids calling into DLM under memory pressure, which can deadlock. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2017-08-10 10:49:13 -05:00
Andreas Gruenbacher	71c1b21368	gfs2: gfs2_evict_inode: Put glocks asynchronously gfs2_evict_inode is called to free inodes under memory pressure. The function calls into DLM when an inode's last cluster-wide reference goes away (remote unlink) and to release the glock and associated DLM lock before finally destroying the inode. However, if DLM is blocked on memory to become available, calling into DLM again will deadlock. Avoid that by decoupling releasing glocks from destroying inodes in that case: with gfs2_glock_queue_put, glocks will be dequeued asynchronously in work queue context, when the associated inodes have likely already been destroyed. With this change, inodes can end up being unlinked, remote-unlink can be triggered, and then the inode can be reallocated before all remote-unlink callbacks are processed. To detect that, revalidate the link count in gfs2_evict_inode to make sure we're not deleting an allocated, referenced inode. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2017-08-10 10:45:21 -05:00
Andreas Gruenbacher	eebd2e813f	gfs2: Get rid of gfs2_set_nlink Remove gfs2_set_nlink which prevents the link count of an inode from becoming non-zero once it has reached zero. The next commit reduces the amount of waiting on glocks when an inode is evicted from memory. With that, an inode can become reallocated before all the remote-unlink callbacks from a previous delete are processed, which causes the link count to change from zero to non-zero. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2017-08-10 10:42:11 -05:00
Andreas Gruenbacher	0515480ad4	gfs2: gfs2_glock_get: Wait on freeing glocks Keep glocks in their hash table until they are freed instead of removing them when their last reference is dropped. This allows to wait for any previous instances of a glock to go away in gfs2_glock_get before creating a new glocks. Special thanks to Andy Price for finding and fixing a problem which also required us to delete the rcu_read_unlock from the error case in function gfs2_glock_get. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2017-08-10 10:39:31 -05:00
Peter Zijlstra	a9668cd6ee	locking: Remove smp_mb__before_spinlock() Now that there are no users of smp_mb__before_spinlock() left, remove it entirely. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 12:29:03 +02:00
Peter Zijlstra	ff7a5fb0f1	overlayfs, locking: Remove smp_mb__before_spinlock() usage While we could replace the smp_mb__before_spinlock() with the new smp_mb__after_spinlock(), the normal pattern is to use smp_store_release() to publish an object that is used for lockless_dereference() -- and mirrors the regular rcu_assign_pointer() / rcu_dereference() patterns. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 12:29:02 +02:00
Aleksa Sarai	74dc3384fc	sched/debug: Use task_pid_nr_ns in /proc/$pid/sched It appears as though the addition of the PID namespace did not update the output code for /proc//sched, which resulted in it providing PIDs that were not self-consistent with the /proc mount. This additionally made it trivial to detect whether a process was inside &init_pid_ns from userspace, making container detection trivial: https://github.com/jessfraz/amicontained This leads to situations such as: % unshare -pmf % mount -t proc proc /proc % head -n1 /proc/1/sched head (10047, #threads: 1) Fix this by just using task_pid_nr_ns for the output of /proc//sched. All of the other uses of task_pid_nr in kernel/sched/debug.c are from a sysctl context and thus don't need to be namespaced. Signed-off-by: Aleksa Sarai <asarai@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Jess Frazelle <acidburn@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: cyphar@cyphar.com Link: http://lkml.kernel.org/r/20170806044141.5093-1-asarai@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 12:18:19 +02:00
Chao Yu	b0af6d491a	f2fs: add app/fs io stat This patch enables inner app/fs io stats and introduces below virtual fs nodes for exposing stats info: /sys/fs/f2fs/<dev>/iostat_enable /proc/fs/f2fs/<dev>/iostat_info Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix wrong stat assignment] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-08-09 21:43:58 -07:00
Yunlong Song	35ee82ca13	f2fs: do not change the valid_block value if cur_valid_map was wrongly set or cleared Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-08-09 17:45:23 -07:00
Yunlong Song	6415fedc57	f2fs: update cur_valid_map_mir together with cur_valid_map When cur_valid_map passes the f2fs_test_and_set(,clear)_bit test, cur_valid_map_mir update is skipped unlikely, so fix it. The fix now changes the mirror check together with cur_valid_map all the time. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: Fix unused variable and add unlikely for corner condition.] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-08-09 17:45:21 -07:00
David S. Miller	3118e6e19d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net The UDP offload conflict is dealt with by simply taking what is in net-next where we have removed all of the UFO handling code entirely. The TCP conflict was a case of local variables in a function being removed from both net and net-next. In netvsc we had an assignment right next to where a missing set of u64 stats sync object inits were added. Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-09 16:28:45 -07:00
Trond Myklebust	c0ca0e5934	NFSv4: Ignore NFS4ERR_OLD_STATEID in nfs41_check_open_stateid() If the call to TEST_STATEID returns NFS4ERR_OLD_STATEID, then it just means we raced with other calls to OPEN. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-08-09 13:36:56 -04:00
Andreas Gruenbacher	61b91cfdc6	gfs2: Fix trivial typos Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2017-08-09 09:36:39 -05:00
Bob Peterson	b2fb7dab7f	GFS2: Delete debugfs files only after we evict the glocks This patch moves the call to gfs2_delete_debugfs_file so that it comes after the glock hash table has been cleared. This way we can query the debugfs files if umount hangs. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2017-08-09 09:36:39 -05:00
Bob Peterson	645ebd49f0	GFS2: Don't waste time locking lru_lock for non-lru glocks Before this patch, glock_dq would call gfs2_glock_remove_from_lru. For glocks that are never put on the LRU, such as the transaction glock, this just takes the spin_lock, determines there's nothing to be done because the list is empty, then unlocks again. This was causing unnecessary lock contention on the lru_lock spin_lock. This patch adds a check for GLOF_LRU in the glops before taking the spin_lock. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2017-08-09 09:36:39 -05:00
Bob Peterson	2d821a8b71	GFS2: Don't bother trying to add rgrps to the lru list This patch removes a call to gfs2_glock_add_to_lru from function gfs2_clear_rgrpd. The call is just a waste of time because as soon as it adds it to the lru_list, the call to gfs2_glock_put takes it back off again. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2017-08-09 09:36:38 -05:00
Bob Peterson	240c6235df	GFS2: Clear gl_object when deleting an inode in gfs2_delete_inode This patch adds some calls to clear gl_object in function gfs2_delete_inode. Since we are deleting the inode, and the glock typically outlives the inode in core, we must clear gl_object so subsequent use of the glock (e.g. for a new inode in its place) will not have the old pointer sitting there. In error cases we need to tidy up after ourselves. In non-error cases, we need to clear gl_object before we set the block free in the bitmap so residules aren't left for potential inode creators. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>	2017-08-09 09:36:38 -05:00
Bob Peterson	9c1b28081f	GFS2: Clear gl_object if gfs2_create_inode fails If function gfs2_create_inode fails after the inode has been created (for example, if the inode_refresh fails for some reason) the function was setting gl_object but never clearing it again. The glocks are left pointing to a freed inode. This patch adds the calls to clear gl_object in the appropriate error paths. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>	2017-08-09 09:36:26 -05:00
Weston Andros Adamson	1feb26162b	nfs/flexfiles: fix leak of nfs4_ff_ds_version arrays The client was freeing the nfs4_ff_layout_ds, but not the contained nfs4_ff_ds_version array. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-08-08 17:18:10 -04:00
Linus Torvalds	1742c0f055	Changes since last update: - Fix memory leak when issuing discard - Fix propagation of the dax inode flag -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABCgAGBQJZhNzbAAoJEPh/dxk0SrTrhMQP/jskrkmob2pHDV/C3jEkLI5g 2tcM9iS1AF3eWjdJtyIsTyejqaJONwLKjKC/pFA+zJtmv4hbC1DnVFy+3F1iU3Ws /BC4PzOnhdZrzbY0fjvg4M9sJOOfEPJbUm0eQyYRlUW3s+uRBhylz0/soa6JTA4G ZbxW9EhToJrHmT7T8oXXU9HVFLvJzhXdu+hbIGOiraTMcDkkEBGoW4Zz4dcRvjMU TZEt6WlBISKrCaGbtb38ChoMv97LGOQLbDM9oy4evvfnuJUQJJT/ayUZH6nvC/3d e9Lko4mPNLmTwfVh7hR4b8nC2TwAPPEcrvQcrKfgDolnzNJJU7en1TLJxJbWKEnM dvxixDp18E4lzSjVCC9pfCCY3esGLNKtmT5m9aCyNRl7oIdAhHxbIZABUquSurTn ii9Ulz+sRWZjY/X4/y+2tyEHLgGaJhDyHqz3I+1iBA2FBn2Wic/cZLvy/2ngmDWX rsVEj0ll8i9CLFGFgs6gjfe9dkmwVN+KA2VzgFuNFuNQlUyFZSq6Eqv7aKbgEDjM NzeKhkG2RMEBuHVZLHdeoJ2xNSD5Cuo6laJauevqFQ901rSAMqkUu6OjKHJQPKpt YMSgHVcnOJ0LaUcqNjJ+j1XlI7HLByu76s3uilvBnISUlLoRoUXwRBwi/BfCv0M0 MMgB+DAg66T4wPQfTh1y =4UKT -----END PGP SIGNATURE----- Merge tag 'xfs-4.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Darrick Wong: "I have a couple more bug fixes for you today: - fix memory leak when issuing discard - fix propagation of the dax inode flag" * tag 'xfs-4.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: Fix per-inode DAX flag inheritance xfs: Fix leak of discard bio	2017-08-07 18:16:22 -07:00
David S. Miller	fde6af4729	mlx5-shared-2017-08-07 This series includes some mlx5 updates for both net-next and rdma trees. From Saeed, Core driver updates to allow selectively building the driver with or without some large driver components, such as - E-Switch (Ethernet SRIOV support). - Multi-Physical Function Switch (MPFs) support. For that we split E-Switch and MPFs functionalities into separate files. From Erez, Delay mlx5_core events when mlx5 interfaces, namely mlx5_ib, registration is taking place and until it completes. From Rabie, Increase the maximum supported flow counters. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJZiDoAAAoJEEg/ir3gV/o+594H/RH5kRwC719s/5YQFJXvGsVC fjtj3UUJPLrWB8XBh7a4PRcxXPIHaFKJuY3MU7KHFIeZQFklJcit3njjpxDlUINo F5S1LHBSYBkeMD/ksWBA8OLCBprNGN6WQ2tuFfAjZlQQ44zqv8LJmegoDtW9bGRy aGAkjUmALEblQsq81y0BQwN2/8DA8HAywrs8L2dkH1LHwijoIeYMZFOtKugv1FbB ABSKxcU7D/NYw6rsVdZG59fHFQ+eKOspDFqBZrUzfQ+zUU2hFFo96ovfXBfIqYCV 7BtJuKXu2LeGPzFLsuw4h1131iqFT1iSMy9fEhf/4OwaL/KPP/+Umy8vP/XfM+U= =wCpd -----END PGP SIGNATURE----- Merge tag 'mlx5-shared-2017-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== mlx5-shared-2017-08-07 This series includes some mlx5 updates for both net-next and rdma trees. From Saeed, Core driver updates to allow selectively building the driver with or without some large driver components, such as - E-Switch (Ethernet SRIOV support). - Multi-Physical Function Switch (MPFs) support. For that we split E-Switch and MPFs functionalities into separate files. From Erez, Delay mlx5_core events when mlx5 interfaces, namely mlx5_ib, registration is taking place and until it completes. From Rabie, Increase the maximum supported flow counters. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-07 10:42:09 -07:00
Guoqing Jiang	1c24285372	dlm: use sock_create_lite inside tcp_accept_from_sock With commit `0ffdaf5b41` ("net/sock: add WARN_ON(parent->sk) in sock_graft()"), a calltrace happened as follows: [ 457.018340] WARNING: CPU: 0 PID: 15623 at ./include/net/sock.h:1703 inet_accept+0x135/0x140 ... [ 457.018381] RIP: 0010:inet_accept+0x135/0x140 [ 457.018381] RSP: 0018:ffffc90001727d18 EFLAGS: 00010286 [ 457.018383] RAX: 0000000000000001 RBX: ffff880012413000 RCX: 0000000000000001 [ 457.018384] RDX: 000000000000018a RSI: 00000000fffffe01 RDI: ffffffff8156fae8 [ 457.018384] RBP: ffffc90001727d38 R08: 0000000000000000 R09: 0000000000004305 [ 457.018385] R10: 0000000000000001 R11: 0000000000004304 R12: ffff880035ae7a00 [ 457.018386] R13: ffff88001282af10 R14: ffff880034e4e200 R15: 0000000000000000 [ 457.018387] FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 [ 457.018388] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 457.018389] CR2: 00007fdec22f9000 CR3: 0000000002b5a000 CR4: 00000000000006f0 [ 457.018395] Call Trace: [ 457.018402] tcp_accept_from_sock.part.8+0x12d/0x449 [dlm] [ 457.018405] ? vprintk_emit+0x248/0x2d0 [ 457.018409] tcp_accept_from_sock+0x3f/0x50 [dlm] [ 457.018413] process_recv_sockets+0x3b/0x50 [dlm] [ 457.018415] process_one_work+0x138/0x370 [ 457.018417] worker_thread+0x4d/0x3b0 [ 457.018419] kthread+0x109/0x140 [ 457.018421] ? rescuer_thread+0x320/0x320 [ 457.018422] ? kthread_park+0x60/0x60 [ 457.018424] ret_from_fork+0x25/0x30 Since newsocket created by sock_create_kern sets it's sock by the path: sock_create_kern -> __sock_creat ->pf->create => inet_create -> sock_init_data Then WARN_ON is triggered by "con->sock->ops->accept => inet_accept -> sock_graft", it also means newsock->sk is leaked since sock_graft will replace it with a new sk. To resolve the issue, we need to use sock_create_lite instead of sock_create_kern, like commit `0933a578cd` ("rds: tcp: use sock_create_lite() to create the accept socket") did. Reported-by: Zhilong Liu <zlliu@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
Edwin Török	55acdd926f	dlm: avoid double-free on error path in dlm_device_{register,unregister} Can be reproduced when running dlm_controld (tested on 4.4.x, 4.12.4): # seq 1 100 \| xargs -P0 -n1 dlm_tool join # seq 1 100 \| xargs -P0 -n1 dlm_tool leave misc_register fails due to duplicate sysfs entry, which causes dlm_device_register to free ls->ls_device.name. In dlm_device_deregister the name was freed again, causing memory corruption. According to the comment in dlm_device_deregister the name should've been set to NULL when registration fails, so this patch does that. sysfs: cannot create duplicate filename '/dev/char/10:1' ------------[ cut here ]------------ warning: cpu: 1 pid: 4450 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x56/0x70 modules linked in: msr rfcomm dlm ccm bnep dm_crypt uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev btusb media btrtl btbcm btintel bluetooth ecdh_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm snd_hda_codec_hdmi irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel thinkpad_acpi pcbc nvram snd_seq_midi snd_seq_midi_event aesni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_rawmidi aes_x86_64 crypto_simd glue_helper snd_hda_intel snd_hda_codec cryptd intel_cstate arc4 snd_hda_core snd_seq snd_seq_device snd_hwdep iwldvm intel_rapl_perf mac80211 joydev input_leds iwlwifi serio_raw cfg80211 snd_pcm shpchp snd_timer snd mac_hid mei_me lpc_ich mei soundcore sunrpc parport_pc ppdev lp parport autofs4 i915 psmouse e1000e ahci libahci i2c_algo_bit sdhci_pci ptp drm_kms_helper sdhci pps_core syscopyarea sysfillrect sysimgblt fb_sys_fops drm wmi video cpu: 1 pid: 4450 comm: dlm_test.exe not tainted 4.12.4-041204-generic hardware name: lenovo 232425u/232425u, bios g2et82ww (2.02 ) 09/11/2012 task: ffff96b0cbabe140 task.stack: ffffb199027d0000 rip: 0010:sysfs_warn_dup+0x56/0x70 rsp: 0018:ffffb199027d3c58 eflags: 00010282 rax: 0000000000000038 rbx: ffff96b0e2c49158 rcx: 0000000000000006 rdx: 0000000000000000 rsi: 0000000000000086 rdi: ffff96b15e24dcc0 rbp: ffffb199027d3c70 r08: 0000000000000001 r09: 0000000000000721 r10: ffffb199027d3c00 r11: 0000000000000721 r12: ffffb199027d3cd1 r13: ffff96b1592088f0 r14: 0000000000000001 r15: ffffffffffffffef fs: 00007f78069c0700(0000) gs:ffff96b15e240000(0000) knlgs:0000000000000000 cs: 0010 ds: 0000 es: 0000 cr0: 0000000080050033 cr2: 000000178625ed28 cr3: 0000000091d3e000 cr4: 00000000001406e0 call trace: sysfs_do_create_link_sd.isra.2+0x9e/0xb0 sysfs_create_link+0x25/0x40 device_add+0x5a9/0x640 device_create_groups_vargs+0xe0/0xf0 device_create_with_groups+0x3f/0x60 ? snprintf+0x45/0x70 misc_register+0x140/0x180 device_write+0x6a8/0x790 [dlm] __vfs_write+0x37/0x160 ? apparmor_file_permission+0x1a/0x20 ? security_file_permission+0x3b/0xc0 vfs_write+0xb5/0x1a0 sys_write+0x55/0xc0 ? sys_fcntl+0x5d/0xb0 entry_syscall_64_fastpath+0x1e/0xa9 rip: 0033:0x7f78083454bd rsp: 002b:00007f78069bbd30 eflags: 00000293 orig_rax: 0000000000000001 rax: ffffffffffffffda rbx: 0000000000000006 rcx: 00007f78083454bd rdx: 000000000000009c rsi: 00007f78069bee00 rdi: 0000000000000005 rbp: 00007f77f8000a20 r08: 000000000000fcf0 r09: 0000000000000032 r10: 0000000000000024 r11: 0000000000000293 r12: 00007f78069bde00 r13: 00007f78069bee00 r14: 000000000000000a r15: 00007f78069bbd70 code: 85 c0 48 89 c3 74 12 b9 00 10 00 00 48 89 c2 31 f6 4c 89 ef e8 2c c8 ff ff 4c 89 e2 48 89 de 48 c7 c7 b0 8e 0c a8 e8 41 e8 ed ff <0f> ff 48 89 df e8 00 d5 f4 ff 5b 41 5c 41 5d 5d c3 66 0f 1f 84 ---[ end trace 40412246357cc9e0 ]--- dlm: 59f24629-ae39-44e2-9030-397ebc2eda26: leaving the lockspace group... bug: unable to handle kernel null pointer dereference at 0000000000000001 ip: [<ffffffff811a3b4a>] kmem_cache_alloc+0x7a/0x140 pgd 0 oops: 0000 [#1] smp modules linked in: dlm 8021q garp mrp stp llc openvswitch nf_defrag_ipv6 nf_conntrack libcrc32c iptable_filter dm_multipath crc32_pclmul dm_mod aesni_intel psmouse aes_x86_64 sg ablk_helper cryptd lrw gf128mul glue_helper i2c_piix4 nls_utf8 tpm_tis tpm isofs nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc xen_wdt ip_tables x_tables autofs4 hid_generic usbhid hid sr_mod cdrom sd_mod ata_generic pata_acpi 8139too serio_raw ata_piix 8139cp mii uhci_hcd ehci_pci ehci_hcd libata scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6 cpu: 0 pid: 394 comm: systemd-udevd tainted: g w 4.4.0+0 #1 hardware name: xen hvm domu, bios 4.7.2-2.2 05/11/2017 task: ffff880002410000 ti: ffff88000243c000 task.ti: ffff88000243c000 rip: e030:[<ffffffff811a3b4a>] [<ffffffff811a3b4a>] kmem_cache_alloc+0x7a/0x140 rsp: e02b:ffff88000243fd90 eflags: 00010202 rax: 0000000000000000 rbx: ffff8800029864d0 rcx: 000000000007b36c rdx: 000000000007b36b rsi: 00000000024000c0 rdi: ffff880036801c00 rbp: ffff88000243fdc0 r08: 0000000000018880 r09: 0000000000000054 r10: 000000000000004a r11: ffff880034ace6c0 r12: 00000000024000c0 r13: ffff880036801c00 r14: 0000000000000001 r15: ffffffff8118dcc2 fs: 00007f0ab77548c0(0000) gs:ffff880036e00000(0000) knlgs:0000000000000000 cs: e033 ds: 0000 es: 0000 cr0: 0000000080050033 cr2: 0000000000000001 cr3: 000000000332d000 cr4: 0000000000040660 stack: ffffffff8118dc90 ffff8800029864d0 0000000000000000 ffff88003430b0b0 ffff880034b78320 ffff88003430b0b0 ffff88000243fdf8 ffffffff8118dcc2 ffff8800349c6700 ffff8800029864d0 000000000000000b 00007f0ab7754b90 call trace: [<ffffffff8118dc90>] ? anon_vma_fork+0x60/0x140 [<ffffffff8118dcc2>] anon_vma_fork+0x92/0x140 [<ffffffff8107033e>] copy_process+0xcae/0x1a80 [<ffffffff8107128b>] _do_fork+0x8b/0x2d0 [<ffffffff81071579>] sys_clone+0x19/0x20 [<ffffffff815a30ae>] entry_syscall_64_fastpath+0x12/0x71 ] code: f6 75 1c 4c 89 fa 44 89 e6 4c 89 ef e8 a7 e4 00 00 41 f7 c4 00 80 00 00 49 89 c6 74 47 eb 32 49 63 45 20 48 8d 4a 01 4d 8b 45 00 <49> 8b 1c 06 4c 89 f0 65 49 0f c7 08 0f 94 c0 84 c0 74 ac 49 63 rip [<ffffffff811a3b4a>] kmem_cache_alloc+0x7a/0x140 rsp <ffff88000243fd90> cr2: 0000000000000001 --[ end trace 70cb9fd1b164a0e8 ]-- CC: stable@vger.kernel.org Signed-off-by: Edwin Török <edvin.torok@citrix.com> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
Bhumika Goyal	417f7c59ed	dlm: constify kset_uevent_ops structure Declare kset_uevent_ops structure as const as it is only passed as an argument to the function kset_create_and_add. This argument is of type const, so declare the structure as const. Signed-off-by: Bhumika Goyal <bhumirks@gmail.com> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
Zhu Lingshan	3b0e761ba8	dlm: print log message when cluster name is not set Print a message when a cluster name is not specified by the caller. In this case the cluster name configured for the dlm is used without any validation that it is the cluster expected by the application. Signed-off-by: Zhu Lingshan <lszhu@suse.com> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
Markus Elfring	2ab93ae138	dlm: Delete an unnecessary variable initialisation in dlm_ls_start() The local variable "rv" is reassigned by a statement at the beginning. Thus omit the explicit initialisation. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
Markus Elfring	d12ad1a964	dlm: Improve a size determination in two functions Replace the specification of two data structures by pointer dereferences as the parameter for the operator "sizeof" to make the corresponding size determination a bit safer according to the Linux coding style convention. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
Markus Elfring	2f48e06102	dlm: Use kcalloc() in two functions * Multiplications for the size determination of memory allocations indicated that array data structures should be processed. Thus reuse the corresponding function "kcalloc". This issue was detected by using the Coccinelle software. * Replace the specification of data structures by pointer dereferences to make the corresponding size determinations a bit safer according to the Linux coding style convention. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
Markus Elfring	790854becc	dlm: Use kmalloc_array() in make_member_array() * A multiplication for the size determination of a memory allocation indicated that an array data structure should be processed. Thus use the corresponding function "kmalloc_array". This issue was detected by using the Coccinelle software. * Replace the specification of a data type by a pointer dereference to make the corresponding size determination a bit safer according to the Linux coding style convention. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
Markus Elfring	0d37eca752	dlm: Delete an error message for a failed memory allocation in dlm_recover_waiters_pre() Omit an extra message for a memory allocation failure in this function. Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
Markus Elfring	102e67d4e3	dlm: Improve a size determination in dlm_recover_waiters_pre() Replace the specification of a data structure by a pointer dereference as the parameter for the operator "sizeof" to make the corresponding size determination a bit safer according to the Linux coding style convention. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
Markus Elfring	fbb1008151	dlm: Use kcalloc() in dlm_scan_waiters() A multiplication for the size determination of a memory allocation indicated that an array data structure should be processed. Thus use the corresponding function "kcalloc". This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
Markus Elfring	2c257e96df	dlm: Improve a size determination in table_seq_start() Replace the specification of a data structure by a pointer dereference as the parameter for the operator "sizeof" to make the corresponding size determination a bit safer according to the Linux coding style convention. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
Markus Elfring	41922ce831	dlm: Add spaces for better code readability The script "checkpatch.pl" pointed information out like the following. CHECK: spaces preferred around that '+' (ctx:VxV) Thus fix the affected source code places. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
Markus Elfring	653996ca8d	dlm: Replace six seq_puts() calls by seq_putc() Six single characters (line breaks) should be put into a sequence. Thus use the corresponding function "seq_putc". This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
Gang He	8e1743748b	dlm: Make dismatch error message more clear This change will try to make this error message more clear, since the upper applications (e.g. ocfs2) invoke dlm_new_lockspace to create a new lockspace with passing a cluster name. Sometimes, dlm_new_lockspace return failure while two cluster names dismatch, the user is a little confused since this line error message is not enough obvious. Signed-off-by: Gang He <ghe@suse.com> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
Vlad Tsyrklevich	8286d6b14c	dlm: Fix kernel memory disclosure Clear the 'unused' field and the uninitialized padding in 'lksb' to avoid leaking memory to userland in copy_result_to_user(). Signed-off-by: Vlad Tsyrklevich <vlad@tsyrklevich.net> Signed-off-by: David Teigland <teigland@redhat.com>	2017-08-07 11:23:09 -05:00
zhangyi (F)	41e327b586	quota: correct space limit check Currently we compare total space (curspace + rsvspace) with space limit in quota-tools when setting grace time and also in check_bdq(), but we missing rsvspace in somewhere else, correct them. This patch also fix incorrect zero dqb_btime and grace time updating failure when we use rsvspace(e.g. ext4 dalloc feature). Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>	2017-08-07 16:51:28 +02:00
Trond Myklebust	bfab281721	NFSv4: Cleanup setting of the migration flags. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-07 09:32:55 -04:00
Trond Myklebust	937e3133cd	NFSv4.1: Ensure we clear the SP4_MACH_CRED flags in nfs4_sp4_select_mode() If the server changes, so that it no longer supports SP4_MACH_CRED, or that it doesn't support the same set of SP4_MACH_CRED functionality, then we want to ensure that we clear the unsupported flags. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-07 09:32:55 -04:00
Trond Myklebust	9c760d1fd5	NFSv4: Refactor _nfs4_proc_exchange_id() Tease apart the functionality in nfs4_exchange_id_done() so that it is easier to debug exchange id vs trunking issues by moving all the processing out of nfs4_exchange_id_done() and into the callers. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2017-08-07 09:32:55 -04:00
Linus Torvalds	ed66da1104	A large number of ext4 bug fixes and cleanups for v4.13 -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlmHbBAACgkQ8vlZVpUN gaMu3gf+LpI5bI1XA3R8KbXB2snnz6wM7OzArfqvreX+m+xP1CK6nVpAIgpkZqfw QkQ1xPJk7Q25vex/pPcsgLO0Vxf0i4vpydK+fYnf30S4WvGQVq6OHZWFFv2zM2YB 7TWxjG+KryM7j6JSXdUiSTKP3nX84TW/IMIWuZMR1nuOa8N5M4yD3uc+3EBTjSbq P/dxfmkp2hQKnlZVBWqCjJDhtxwUYTF4iZ/pbSVeGbgHCh1674ml+airb4K9ltNU 0vR0JChD12YJaafjaAyIrqqKwDGvnN+H5wyhCodEV9w8jthbcU04Jfmi1auB9UxT y7/sgbV64W2o5hBwxY3RXjZkVLpDsw== =Mtr7 -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "A large number of ext4 bug fixes and cleanups for v4.13" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix copy paste error in ext4_swap_extents() ext4: fix overflow caused by missing cast in ext4_resize_fs() ext4, project: expand inode extra size if possible ext4: cleanup ext4_expand_extra_isize_ea() ext4: restructure ext4_expand_extra_isize ext4: fix forgetten xattr lock protection in ext4_expand_extra_isize ext4: make xattr inode reads faster ext4: inplace xattr block update fails to deduplicate blocks ext4: remove unused mode parameter ext4: fix warning about stack corruption ext4: fix dir_nlink behaviour ext4: silence array overflow warning ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesize ext4: release discard bio after sending discard commands ext4: convert swap_inode_data() over to use swap() on most of the fields ext4: error should be cleared if ea_inode isn't added to the cache ext4: Don't clear SGID when inheriting ACLs ext4: preserve i_mode if __ext4_set_acl() fails ext4: remove unused metadata accounting variables ext4: correct comment references to ext4_ext_direct_IO()	2017-08-06 12:31:17 -07:00
Maninder Singh	4e56201321	ext4: fix copy paste error in ext4_swap_extents() This bug was found by a static code checker tool for copy paste problems. Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Vaneet Narang <v.narang@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2017-08-06 01:33:07 -04:00
Jerry Lee	aec51758ce	ext4: fix overflow caused by missing cast in ext4_resize_fs() On a 32-bit platform, the value of n_blcoks_count may be wrong during the file system is resized to size larger than 2^32 blocks. This may caused the superblock being corrupted with zero blocks count. Fixes: `1c6bd7173d` Signed-off-by: Jerry Lee <jerrylee@qnap.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org # 3.7+	2017-08-06 01:18:31 -04:00
Miao Xie	c03b45b853	ext4, project: expand inode extra size if possible When upgrading from old format, try to set project id to old file first time, it will return EOVERFLOW, but if that file is dirtied(touch etc), changing project id will be allowed, this might be confusing for users, we could try to expand @i_extra_isize here too. Reported-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2017-08-06 01:00:49 -04:00
Miao Xie	b640b2c51b	ext4: cleanup ext4_expand_extra_isize_ea() Clean up some goto statement, make ext4_expand_extra_isize_ea() clearer. Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Wang Shilong <wshilong@ddn.com>	2017-08-06 00:55:48 -04:00
Miao Xie	cf0a5e818f	ext4: restructure ext4_expand_extra_isize Current ext4_expand_extra_isize just tries to expand extra isize, if someone is holding xattr lock or some check fails, it will give up. So rename its name to ext4_try_to_expand_extra_isize. Besides that, we clean up unnecessary check and move some relative checks into it. Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Wang Shilong <wshilong@ddn.com>	2017-08-06 00:40:01 -04:00
Miao Xie	3b10fdc6d8	ext4: fix forgetten xattr lock protection in ext4_expand_extra_isize We should avoid the contention between the i_extra_isize update and the inline data insertion, so move the xattr trylock in front of i_extra_isize update. Signed-off-by: Miao Xie <miaoxie@huawei.com> Reviewed-by: Wang Shilong <wshilong@ddn.com>	2017-08-06 00:27:38 -04:00
Tahsin Erdogan	9699d4f91d	ext4: make xattr inode reads faster ext4_xattr_inode_read() currently reads each block sequentially while waiting for io operation to complete before moving on to the next block. This prevents request merging in block layer. Add a ext4_bread_batch() function that starts reads for all blocks then optionally waits for them to complete. A similar logic is used in ext4_find_entry(), so update that code to use the new function. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2017-08-06 00:07:01 -04:00
Tahsin Erdogan	ec00022030	ext4: inplace xattr block update fails to deduplicate blocks When an xattr block has a single reference, block is updated inplace and it is reinserted to the cache. Later, a cache lookup is performed to see whether an existing block has the same contents. This cache lookup will most of the time return the just inserted entry so deduplication is not achieved. Running the following test script will produce two xattr blocks which can be observed in "File ACL: " line of debugfs output: mke2fs -b 1024 -I 128 -F -O extent /dev/sdb 1G mount /dev/sdb /mnt/sdb touch /mnt/sdb/{x,y} setfattr -n user.1 -v aaa /mnt/sdb/x setfattr -n user.2 -v bbb /mnt/sdb/x setfattr -n user.1 -v aaa /mnt/sdb/y setfattr -n user.2 -v bbb /mnt/sdb/y debugfs -R 'stat x' /dev/sdb \| cat debugfs -R 'stat y' /dev/sdb \| cat This patch defers the reinsertion to the cache so that we can locate other blocks with the same contents. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>	2017-08-05 22:41:42 -04:00
Tahsin Erdogan	77a2e84d51	ext4: remove unused mode parameter ext4_alloc_file_blocks() does not use its mode parameter. Remove it. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2017-08-05 22:15:45 -04:00
Arnd Bergmann	2df2c3402f	ext4: fix warning about stack corruption After commit 62d1034f53e3 ("fortify: use WARN instead of BUG for now"), we get a warning about possible stack overflow from a memcpy that was not strictly bounded to the size of the local variable: inlined from 'ext4_mb_seq_groups_show' at fs/ext4/mballoc.c:2322:2: include/linux/string.h:309:9: error: '__builtin_memcpy': writing between 161 and 1116 bytes into a region of size 160 overflows the destination [-Werror=stringop-overflow=] We actually had a bug here that would have been found by the warning, but it was already fixed last year in commit `30a9d7afe7` ("ext4: fix stack memory corruption with 64k block size"). This replaces the fixed-length structure on the stack with a variable-length structure, using the correct upper bound that tells the compiler that everything is really fine here. I also change the loop count to check for the same upper bound for consistency, but the existing code is already correct here. Note that while clang won't allow certain kinds of variable-length arrays in structures, this particular instance is fine, as the array is at the end of the structure, and the size is strictly bounded. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2017-08-05 21:57:46 -04:00
Andreas Dilger	c741489206	ext4: fix dir_nlink behaviour The dir_nlink feature has been enabled by default for new ext4 filesystems since e2fsprogs-1.41 in 2008, and was automatically enabled by the kernel for older ext4 filesystems since the dir_nlink feature was added with ext4 in kernel 2.6.28+ when the subdirectory count exceeded EXT4_LINK_MAX-1. Automatically adding the file system features such as dir_nlink is generally frowned upon, since it could cause the file system to not be mountable on older kernel, thus preventing the administrator from rolling back to an older kernel if necessary. In this case, the administrator might also want to disable the feature because glibc's fts_read() function does not correctly optimize directory traversal for directories that use st_nlinks field of 1 to indicate that the number of links in the directory are not tracked by the file system, and could fail to traverse the full directory hierarchy. Fortunately, in the past ten years very few users have complained about incomplete file system traversal by glibc's fts_read(). This commit also changes ext4_inc_count() to allow i_nlinks to reach the full EXT4_LINK_MAX links on the parent directory (including "." and "..") before changing i_links_count to be 1. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=196405 Signed-off-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2017-08-05 19:47:34 -04:00
Dan Carpenter	381cebfe72	ext4: silence array overflow warning I get a static checker warning: fs/ext4/ext4.h:3091 ext4_set_de_type() error: buffer overflow 'ext4_type_by_mode' 15 <= 15 It seems unlikely that we would hit this read overflow in real life, but it's also simple enough to make the array 16 bytes instead of 15. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2017-08-05 19:00:31 -04:00
Jan Kara	fcf5ea1099	ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesize ext4_find_unwritten_pgoff() does not properly handle a situation when starting index is in the middle of a page and blocksize < pagesize. The following command shows the bug on filesystem with 1k blocksize: xfs_io -f -c "falloc 0 4k" \ -c "pwrite 1k 1k" \ -c "pwrite 3k 1k" \ -c "seek -a -r 0" foo In this example, neither lseek(fd, 1024, SEEK_HOLE) nor lseek(fd, 2048, SEEK_DATA) will return the correct result. Fix the problem by neglecting buffers in a page before starting offset. Reported-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> CC: stable@vger.kernel.org # 3.8+	2017-08-05 17:43:24 -04:00
Daeho Jeong	e45105772d	ext4: release discard bio after sending discard commands We've changed the discard command handling into parallel manner. But, in this change, I forgot decreasing the usage count of the bio which was used to send discard request. I'm sorry about that. Fixes: `a015434480` ("ext4: send parallel discards on commit completions") Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2017-08-05 13:11:57 -04:00
Lukas Czerner	56bdf855e6	xfs: Fix per-inode DAX flag inheritance According to the commit that implemented per-inode DAX flag: commit `58f88ca2df` ("xfs: introduce per-inode DAX enablement") the flag is supposed to act as "inherit flag". Currently this only works in the situations where parent directory already has a flag in di_flags set, otherwise inheritance does not work. This is because setting the XFS_DIFLAG2_DAX flag is done in a wrong branch designated for di_flags, not di_flags2. Fix this by moving the code to branch designated for setting di_flags2, which does test for flags in di_flags2. Fixes: `58f88ca2df` ("xfs: introduce per-inode DAX enablement") Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-08-04 13:43:36 -07:00
Jan Kara	ea7bd56fa3	xfs: Fix leak of discard bio The bio describing discard operation is allocated by __blkdev_issue_discard() which returns us a reference to it. That reference is never released and thus we leak this bio. Drop the bio reference once it completes in xlog_discard_endio(). CC: stable@vger.kernel.org Fixes: `4560e78f40` Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-08-04 13:43:36 -07:00
Jaegeuk Kim	a36c106dff	f2fs: use printk_ratelimited for f2fs_msg This patch reduces contention of printks. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-08-03 19:09:52 -07:00
Jaegeuk Kim	bf9e697ecd	f2fs: expose features to sysfs entry This patch exposes what features are supported by current f2fs build to sysfs entry via: /sys/fs/f2fs/features/ /sys/fs/f2fs/dev/features Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-08-03 19:09:51 -07:00
Chao Yu	704956ecf5	f2fs: support inode checksum This patch adds to support inode checksum in f2fs. Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix verification flow] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-08-03 19:09:26 -07:00
Jaegeuk Kim	4f31d26b0c	f2fs: return wrong error number on f2fs_quota_write This must return size, not error number. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-08-03 19:05:05 -07:00
Yunlong Song	401db79f61	f2fs: provide f2fs_balance_fs to __write_node_page Let node writeback also do f2fs_balance_fs to ensure there are always enough free segments. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-08-03 19:05:04 -07:00
Linus Torvalds	995d03ae26	Merge branch 'akpm' (patches from Andrew) Merge misc fixes from Andrew Morton: "15 fixes" [ This does not merge the "fortify: use WARN instead of BUG for now" patch, which needs a bit of extra work to build cleanly with all configurations. Arnd is on it. - Linus ] * emailed patches from Andrew Morton <akpm@linux-foundation.org>: ocfs2: don't clear SGID when inheriting ACLs mm: allow page_cache_get_speculative in interrupt context userfaultfd: non-cooperative: flush event_wqh at release time ipc: add missing container_of()s for randstruct cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() userfaultfd_zeropage: return -ENOSPC in case mm has gone mm: take memory hotplug lock within numa_zonelist_order_handler() mm/page_io.c: fix oops during block io poll in swapin path zram: do not free pool->size_class kthread: fix documentation build warning kasan: avoid -Wmaybe-uninitialized warning userfaultfd: non-cooperative: notify about unmap of destination during mremap mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries pid: kill pidhash_size in pidhash_init() mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors	2017-08-03 14:58:13 -07:00
Jeff Layton	6d4b512413	ecryptfs: convert to file_write_and_wait in ->fsync This change is mainly for documentation/completeness, as ecryptfs never calls mapping_set_error, and so will never return a previous writeback error. Signed-off-by: Jeff Layton <jlayton@redhat.com>	2017-08-03 14:20:22 -04:00
Ashish Samant	61c12b49e1	fuse: Dont call set_page_dirty_lock() for ITER_BVEC pages for async_dio Commit `8fba54aebb` ("fuse: direct-io: don't dirty ITER_BVEC pages") fixes the ITER_BVEC page deadlock for direct io in fuse by checking in fuse_direct_io(), whether the page is a bvec page or not, before locking it. However, this check is missed when the "async_dio" mount option is enabled. In this case, set_page_dirty_lock() is called from the req->end callback in request_end(), when the fuse thread is returning from userspace to respond to the read request. This will cause the same deadlock because the bvec condition is not checked in this path. Here is the stack of the deadlocked thread, while returning from userspace: [13706.656686] INFO: task glusterfs:3006 blocked for more than 120 seconds. [13706.657808] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [13706.658788] glusterfs D ffffffff816c80f0 0 3006 1 0x00000080 [13706.658797] ffff8800d6713a58 0000000000000086 ffff8800d9ad7000 ffff8800d9ad5400 [13706.658799] ffff88011ffd5cc0 ffff8800d6710008 ffff88011fd176c0 7fffffffffffffff [13706.658801] 0000000000000002 ffffffff816c80f0 ffff8800d6713a78 ffffffff816c790e [13706.658803] Call Trace: [13706.658809] [<ffffffff816c80f0>] ? bit_wait_io_timeout+0x80/0x80 [13706.658811] [<ffffffff816c790e>] schedule+0x3e/0x90 [13706.658813] [<ffffffff816ca7e5>] schedule_timeout+0x1b5/0x210 [13706.658816] [<ffffffff81073ffb>] ? gup_pud_range+0x1db/0x1f0 [13706.658817] [<ffffffff810668fe>] ? kvm_clock_read+0x1e/0x20 [13706.658819] [<ffffffff81066909>] ? kvm_clock_get_cycles+0x9/0x10 [13706.658822] [<ffffffff810f5792>] ? ktime_get+0x52/0xc0 [13706.658824] [<ffffffff816c6f04>] io_schedule_timeout+0xa4/0x110 [13706.658826] [<ffffffff816c8126>] bit_wait_io+0x36/0x50 [13706.658828] [<ffffffff816c7d06>] __wait_on_bit_lock+0x76/0xb0 [13706.658831] [<ffffffffa0545636>] ? lock_request+0x46/0x70 [fuse] [13706.658834] [<ffffffff8118800a>] __lock_page+0xaa/0xb0 [13706.658836] [<ffffffff810c8500>] ? wake_atomic_t_function+0x40/0x40 [13706.658838] [<ffffffff81194d08>] set_page_dirty_lock+0x58/0x60 [13706.658841] [<ffffffffa054d968>] fuse_release_user_pages+0x58/0x70 [fuse] [13706.658844] [<ffffffffa0551430>] ? fuse_aio_complete+0x190/0x190 [fuse] [13706.658847] [<ffffffffa0551459>] fuse_aio_complete_req+0x29/0x90 [fuse] [13706.658849] [<ffffffffa05471e9>] request_end+0xd9/0x190 [fuse] [13706.658852] [<ffffffffa0549126>] fuse_dev_do_write+0x336/0x490 [fuse] [13706.658854] [<ffffffffa054963e>] fuse_dev_write+0x6e/0xa0 [fuse] [13706.658857] [<ffffffff812a9ef3>] ? security_file_permission+0x23/0x90 [13706.658859] [<ffffffff81205300>] do_iter_readv_writev+0x60/0x90 [13706.658862] [<ffffffffa05495d0>] ? fuse_dev_splice_write+0x350/0x350 [fuse] [13706.658863] [<ffffffff812062a1>] do_readv_writev+0x171/0x1f0 [13706.658866] [<ffffffff810b3d00>] ? try_to_wake_up+0x210/0x210 [13706.658868] [<ffffffff81206361>] vfs_writev+0x41/0x50 [13706.658870] [<ffffffff81206496>] SyS_writev+0x56/0xf0 [13706.658872] [<ffffffff810257a1>] ? syscall_trace_leave+0xf1/0x160 [13706.658874] [<ffffffff816cbb2e>] system_call_fastpath+0x12/0x71 Fix this by making should_dirty a fuse_io_priv parameter that can be checked in fuse_aio_complete_req(). Reported-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Ashish Samant <ashish.samant@oracle.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-08-03 17:55:58 +02:00
Linus Torvalds	19ec50a438	Two more NFS client bugfixes for 4.13 Stable fix: - Fix EXCHANGE_ID corrupt verifier issue Other fix: - Fix double frees in nfs4_test_session_trunk() -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlmCNJAACgkQ18tUv7Cl QOvlKhAAjjKMtdwYDV7B3SJvDfa5KNH9nNaMjKoD/OzWavDL7jRVRUWayeJiy4sf wJAMsG280ztag1Mr/OnClESZThWNrHTnKjq/P8kuefJz+pPvWP+AaQ/8QSM/rulM Y7qfh+FZb2t7lK80VAZTXEi4bvQzlkIV2285a34VIUm3VTpwl3HvltkBMLFztDSb sdaQh+N48FOYdG9RMncxpzFfRvUILzP9GFMg6WAFAcbr1wSsT9MfLqd+ANgOidE8 X2Ch9l0jN51m1dFGmMUZ5HMmuhUh2m50SXhmUOTpS7fj+k11OWsV2IIWHKPk/LjI 5E0aUPDU2TOE9noLSfkguAyxvqAGeAHPDIE7QM9vTGgktzXWGzh++ODeSzkIDp+5 iZ+cYLkme1lOg1nEqj1/SrIZXHP7ZCvK8iqNgoqT8rIp6zE1tLPnNAbRDmnwC/VH PrfINATV8llN/3vFojWJfD1enDe3+dALPINLVQOjwjafq6d5/hzRdCGqWpCDjmlU esDoCkoWGUjadIr6EZGtDAzSZgafsUx6d6QnGtkIUsz1d0FIXH6holB5EKaQpOLf dVb9iO5R2EcVP+WvB3KjA5y6MCNvoVqebxMvLBPCYsu3fI8fQ0sgSjgSJXi0fRzg G7DUsKcBGVKarDMXTUReL2G6lN7h0f5EsM1WnA9KF0TV9aah/ZE= =Ddq9 -----END PGP SIGNATURE----- Merge tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client fixes from Anna Schumaker: "Two fixes from Trond this time, now that he's back from his vacation. The first is a stable fix for the EXCHANGE_ID issue on the mailing list, and the other fixes a double-free situation that he found at the same time. Stable fix: - Fix EXCHANGE_ID corrupt verifier issue Other fix: - Fix double frees in nfs4_test_session_trunk()" * tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFSv4: Fix double frees in nfs4_test_session_trunk() NFSv4: Fix EXCHANGE_ID corrupt verifier issue	2017-08-02 20:56:44 -07:00
Jan Kara	19ec8e4858	ocfs2: don't clear SGID when inheriting ACLs When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit set, DIR1 is expected to have SGID bit set (and owning group equal to the owning group of 'DIR0'). However when 'DIR0' also has some default ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on 'DIR1' to get cleared if user is not member of the owning group. Fix the problem by moving posix_acl_update_mode() out of ocfs2_set_acl() into ocfs2_iop_set_acl(). That way the function will not be called when inheriting ACLs which is what we want as it prevents SGID bit clearing and the mode has been properly set by posix_acl_create() anyway. Also posix_acl_chmod() that is calling ocfs2_set_acl() takes care of updating mode itself. Fixes: `073931017b` ("posix_acl: Clear SGID bit when setting file permissions") Link: http://lkml.kernel.org/r/20170801141252.19675-3-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-08-02 17:16:13 -07:00
Mike Rapoport	5a18b64e3f	userfaultfd: non-cooperative: flush event_wqh at release time There may still be threads waiting on event_wqh at the time the userfault file descriptor is closed. Flush the events wait-queue to prevent waiting threads from hanging. Link: http://lkml.kernel.org/r/1501398127-30419-1-git-send-email-rppt@linux.vnet.ibm.com Fixes: `9cd75c3cd4` ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor") Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-08-02 17:16:13 -07:00
Mike Rapoport	9d95aa4bad	userfaultfd_zeropage: return -ENOSPC in case mm has gone In the non-cooperative userfaultfd case, the process exit may race with outstanding mcopy_atomic called by the uffd monitor. Returning -ENOSPC instead of -EINVAL when mm is already gone will allow uffd monitor to distinguish this case from other error conditions. Unfortunately I overlooked userfaultfd_zeropage when updating userfaultd_copy(). Link: http://lkml.kernel.org/r/1501136819-21857-1-git-send-email-rppt@linux.vnet.ibm.com Fixes: `96333187ab` ("userfaultfd_copy: return -ENOSPC in case mm has gone") Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-08-02 17:16:12 -07:00
Trond Myklebust	d9cb73300a	NFSv4: Fix double frees in nfs4_test_session_trunk() rpc_clnt_add_xprt() expects the callback function to be synchronous, and expects to release the transport and switch references itself. Fixes: `04fa2c6bb5` ("NFS pnfs data server multipath session trunking") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-08-02 09:45:55 -04:00
J. Bruce Fields	0020939f20	nfsd4: move some nfsd4 op definitions to xdr4.h I want code in nfs4xdr.c to have access to this stuff. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-08-01 17:36:35 -04:00
Trond Myklebust	fd40559c86	NFSv4: Fix EXCHANGE_ID corrupt verifier issue The verifier is allocated on the stack, but the EXCHANGE_ID RPC call was changed to be asynchronous by commit `8d89bd70bc`. If we interrrupt the call to rpc_wait_for_completion_task(), we can therefore end up transmitting random stack contents in lieu of the verifier. Fixes: `8d89bd70bc` ("NFS setup async exchange_id") Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-08-01 16:28:55 -04:00
Kees Cook	fe8993b3a0	exec: Consolidate pdeath_signal clearing Instead of an additional secureexec check for pdeath_signal, just move it up into the initial secureexec test. Neither perf nor arch code touches pdeath_signal, so the relocation shouldn't change anything. Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge@hallyn.com>	2017-08-01 12:03:14 -07:00
Kees Cook	64701dee41	exec: Use sane stack rlimit under secureexec For a secureexec, before memory layout selection has happened, reset the stack rlimit to something sane to avoid the caller having control over the resulting layouts. $ ulimit -s 8192 $ ulimit -s unlimited $ /bin/sh -c 'ulimit -s' unlimited $ sudo /bin/sh -c 'ulimit -s' 8192 Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: James Morris <james.l.morris@oracle.com> Acked-by: Serge Hallyn <serge@hallyn.com>	2017-08-01 12:03:14 -07:00
Kees Cook	473d89639d	exec: Consolidate dumpability logic Since it's already valid to set dumpability in the early part of setup_new_exec(), we can consolidate the logic into a single place. The BINPRM_FLAGS_ENFORCE_NONDUMP is set during would_dump() calls before setup_new_exec(), so its test is safe to move as well. Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: James Morris <james.l.morris@oracle.com>	2017-08-01 12:03:13 -07:00
Kees Cook	a70423dfbc	exec: Use secureexec for clearing pdeath_signal Like dumpability, clearing pdeath_signal happens both in setup_new_exec() and later in commit_creds(). The test in setup_new_exec() is different from all other privilege comparisons, though: it is checking the new cred (bprm) uid vs the old cred (current) euid. This appears to be a bug, introduced by commit `a6f76f23d2` ("CRED: Make execve() take advantage of copy-on-write credentials"): - if (bprm->e_uid != current_euid() \|\| - bprm->e_gid != current_egid()) { - set_dumpable(current->mm, suid_dumpable); + if (bprm->cred->uid != current_euid() \|\| + bprm->cred->gid != current_egid()) { It was bprm euid vs current euid (and egids), but the effective got dropped. Nothing in the exec flow changes bprm->cred->uid (nor gid). The call traces are: prepare_bprm_creds() prepare_exec_creds() prepare_creds() memcpy(new_creds, old_creds, ...) security_prepare_creds() (unimplemented by commoncap) ... prepare_binprm() bprm_fill_uid() resets euid/egid to current euid/egid sets euid/egid on bprm based on set*id file bits security_bprm_set_creds() cap_bprm_set_creds() handle all caps-based manipulations so this test is effectively a test of current_uid() vs current_euid(), which is wrong, just like the prior dumpability tests were wrong. The commit log says "Clear pdeath_signal and set dumpable on certain circumstances that may not be covered by commit_creds()." This may be meaning the earlier old euid vs new euid (and egid) test that got changed. Luckily, as with dumpability, this is all masked by commit_creds() which performs old/new euid and egid tests and clears pdeath_signal. And again, like dumpability, we should include LSM secureexec logic for pdeath_signal clearing. For example, Smack goes out of its way to clear pdeath_signal when it finds a secureexec condition. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: James Morris <james.l.morris@oracle.com>	2017-08-01 12:03:11 -07:00
Kees Cook	e37fdb785a	exec: Use secureexec for setting dumpability The examination of "current" to decide dumpability is wrong. This was a check of and euid/uid (or egid/gid) mismatch in the existing process, not the newly created one. This appears to stretch back into even the "history.git" tree. Luckily, dumpability is later set in commit_creds(). In earlier kernel versions before creds existed, similar checks also existed late in the exec flow, covering up the mistake as far back as I could find. Note that because the commit_creds() check examines differences of euid, uid, egid, gid, and capabilities between the old and new creds, it would look like the setup_new_exec() dumpability test could be entirely removed. However, the secureexec test may cover a different set of tests (specific to the LSMs) than what commit_creds() checks for. So, fix this test to use secureexec (the removed euid tests are redundant to the commoncap secureexec checks now). Cc: David Howells <dhowells@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: James Morris <james.l.morris@oracle.com>	2017-08-01 12:03:10 -07:00
Kees Cook	2af6228026	LSM: drop bprm_secureexec hook This removes the bprm_secureexec hook since the logic has been folded into the bprm_set_creds hook for all LSMs now. Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: John Johansen <john.johansen@canonical.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Serge Hallyn <serge@hallyn.com>	2017-08-01 12:03:10 -07:00
Kees Cook	46d98eb4e1	commoncap: Refactor to remove bprm_secureexec hook The commoncap implementation of the bprm_secureexec hook is the only LSM that depends on the final call to its bprm_set_creds hook (since it may be called for multiple files, it ignores bprm->called_set_creds). As a result, it cannot safely _clear_ bprm->secureexec since other LSMs may have set it. Instead, remove the bprm_secureexec hook by introducing a new flag to bprm specific to commoncap: cap_elevated. This is similar to cap_effective, but that is used for a specific subset of elevated privileges, and exists solely to track state from bprm_set_creds to bprm_secureexec. As such, it will be removed in the next patch. Here, set the new bprm->cap_elevated flag when setuid/setgid has happened from bprm_fill_uid() or fscapabilities have been prepared. This temporarily moves the bprm_secureexec hook to a static inline. The helper will be removed in the next patch; this makes the step easier to review and bisect, since this does not introduce any changes to inputs nor outputs to the "elevated privileges" calculation. The new flag is merged with the bprm->secureexec flag in setup_new_exec() since this marks the end of any further prepare_binprm() calls. Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andy Lutomirski <luto@kernel.org> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Serge Hallyn <serge@hallyn.com>	2017-08-01 12:03:08 -07:00
Kees Cook	c425e189ff	binfmt: Introduce secureexec flag The bprm_secureexec hook can be moved earlier. Right now, it is called during create_elf_tables(), via load_binary(), via search_binary_handler(), via exec_binprm(). Nearly all (see exception below) state used by bprm_secureexec is created during the bprm_set_creds hook, called from prepare_binprm(). For all LSMs (except commoncaps described next), only the first execution of bprm_set_creds takes any effect (they all check bprm->called_set_creds which prepare_binprm() sets after the first call to the bprm_set_creds hook). However, all these LSMs also only do anything with bprm_secureexec when they detected a secure state during their first run of bprm_set_creds. Therefore, it is functionally identical to move the detection into bprm_set_creds, since the results from secureexec here only need to be based on the first call to the LSM's bprm_set_creds hook. The single exception is that the commoncaps secureexec hook also examines euid/uid and egid/gid differences which are controlled by bprm_fill_uid(), via prepare_binprm(), which can be called multiple times (e.g. binfmt_script, binfmt_misc), and may clear the euid/egid for the final load (i.e. the script interpreter). However, while commoncaps specifically ignores bprm->cred_prepared, and runs its bprm_set_creds hook each time prepare_binprm() may get called, it needs to base the secureexec decision on the final call to bprm_set_creds. As a result, it will need special handling. To begin this refactoring, this adds the secureexec flag to the bprm struct, and calls the secureexec hook during setup_new_exec(). This is safe since all the cred work is finished (and past the point of no return). This explicit call will be removed in later patches once the hook has been removed. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: John Johansen <john.johansen@canonical.com> Acked-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: James Morris <james.l.morris@oracle.com>	2017-08-01 12:03:05 -07:00
Kees Cook	a9208e42ba	exec: Correct comments about "point of no return" In commit `221af7f87b` ("Split 'flush_old_exec' into two functions"), the comment about the point of no return should have stayed in flush_old_exec() since it refers to "bprm->mm = NULL;" line, but prior changes in commits `c89681ed7d` ("remove steal_locks()"), and `fd8328be87` ("sanitize handling of shared descriptor tables in failing execve()") made it look like it meant the current->sas_ss_sp line instead. The comment was referring to the fact that once bprm->mm is NULL, all failures from a binfmt load_binary hook (e.g. load_elf_binary), will get SEGV raised against current. Move this comment and expand the explanation a bit, putting it above the assignment this time, and add details about the true nature of "point of no return" being the call to flush_old_exec() itself. This also removes an erroneous commet about when credentials are being installed. That has its own dedicated function, install_exec_creds(), which carries a similar (and correct) comment, so remove the bogus comment where installation is not actually happening. Cc: David Howells <dhowells@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Serge Hallyn <serge@hallyn.com>	2017-08-01 12:03:04 -07:00
Kees Cook	ddb4a1442d	exec: Rename bprm->cred_prepared to called_set_creds The cred_prepared bprm flag has a misleading name. It has nothing to do with the bprm_prepare_cred hook, and actually tracks if bprm_set_creds has been called. Rename this flag and improve its comment. Cc: David Howells <dhowells@redhat.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: John Johansen <john.johansen@canonical.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Paul Moore <paul@paul-moore.com> Acked-by: Serge Hallyn <serge@hallyn.com>	2017-08-01 12:02:48 -07:00
David S. Miller	29fda25a2d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Two minor conflicts in virtio_net driver (bug fix overlapping addition of a helper) and MAINTAINERS (new driver edit overlapping revamp of PHY entry). Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-01 10:07:50 -07:00
Jeff Layton	3b49c9a1e9	fs: convert a pile of fsync routines to errseq_t based reporting This patch converts most of the in-kernel filesystems that do writeback out of the pagecache to report errors using the errseq_t-based infrastructure that was recently added. This allows them to report errors once for each open file description. Most filesystems have a fairly straightforward fsync operation. They call filemap_write_and_wait_range to write back all of the data and wait on it, and then (sometimes) sync out the metadata. For those filesystems this is a straightforward conversion from calling filemap_write_and_wait_range in their fsync operation to calling file_write_and_wait_range. Acked-by: Jan Kara <jack@suse.cz> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2017-08-01 08:39:29 -04:00
Jeff Layton	d07a6ac7b6	gfs2: convert to errseq_t based writeback error reporting for fsync Also, fix a place where a writeback error might get dropped in the gfs2_is_jdata case. Signed-off-by: Jeff Layton <jlayton@redhat.com>	2017-08-01 08:39:29 -04:00
Jeff Layton	6454568d96	fs: convert sync_file_range to use errseq_t based error-tracking sync_file_range doesn't call down into the filesystem directly at all. It only kicks off writeback of pagecache pages and optionally waits on the result. Convert sync_file_range to use errseq_t based error tracking, under the assumption that most users will prefer this behavior when errors occur. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2017-08-01 08:39:29 -04:00
Chao Yu	ddc34e328d	f2fs: introduce f2fs_statfs_project This patch introduces f2fs_statfs_project, it enables to show usage status of directory tree which is limited with project quota. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-31 16:48:36 -07:00
Chao Yu	2c1d030569	f2fs: support F2FS_IOC_FS{GET,SET}XATTR This patch adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR ioctl interface support for f2fs. The interface is kept consistent with the one of ext4/xfs. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-31 16:48:35 -07:00
Jaegeuk Kim	b6a245eb34	f2fs: don't need to wait for node writes for atomic write We have a node chain to serialize node block writes, so if any IOs for node block writes are reordered, we'll get broken node chain. IOWs, roll-forward recovery will see all or none node blocks given fsync mark. E.g., Node chain consists of: N1 -> N2 -> N3 -> NFSYNC -> N1' -> N2' -> N'FSYNC Reordered to: 1) N1 -> N2 -> N3 -> N2' -> NFSYNC -> N'FSYNC -> power-cut 2) N1 -> N2 -> N3 -> N1' -> NFSYNC -> power-cut 3) N1 -> N2 -> NFSYNC -> N1' -> N'FSYNC -> N3 -> power-cut 4) N1 -> NFSYNC -> N1' -> N2' -> N'FSYNC -> N3 -> power-cut Roll-forward recovery can proceed to: 1) N1 -> N2 -> N3 -> NFSYNC -> X 2) N1 -> N2 -> N3 -> NFSYNC -> N1' -> X 3) N1 -> N2 -> N3 -> FSYNC -> N1' -> X 4) N1 -> X Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-31 16:48:34 -07:00
Jaegeuk Kim	dc6b205510	f2fs: avoid naming confusion of sysfs init This patch changes the function names of sysfs init to follow ext4. f2fs_init_sysfs <-> f2fs_register_sysfs f2fs_exit_sysfs <-> f2fs_unregister_sysfs Suggested-by: Chao Yu <yuchao0@huawei.com> Reivewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-31 16:48:33 -07:00
Chao Yu	5c57132eaf	f2fs: support project quota This patch adds to support plain project quota. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-31 16:48:32 -07:00
Chao Yu	a6d3a479ae	f2fs: record quota during dot{,dot} recovery In ->lookup(), we will have a try to recover dot or dotdot for corrupted directory, once disk quota is on, if it allocates new block during dotdot recovery, we need to record disk quota info for the allocation, so this patch fixes this issue by adding missing dquot_initialize() in __recover_dot_dentries. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-31 16:48:31 -07:00
Chao Yu	7a2af766af	f2fs: enhance on-disk inode structure scalability This patch add new flag F2FS_EXTRA_ATTR storing in inode.i_inline to indicate that on-disk structure of current inode is extended. In order to extend, we changed the inode structure a bit: Original one: struct f2fs_inode { ... struct f2fs_extent i_ext; __le32 i_addr[DEF_ADDRS_PER_INODE]; __le32 i_nid[DEF_NIDS_PER_INODE]; } Extended one: struct f2fs_inode { ... struct f2fs_extent i_ext; union { struct { __le16 i_extra_isize; __le16 i_padding; __le32 i_extra_end[0]; }; __le32 i_addr[DEF_ADDRS_PER_INODE]; }; __le32 i_nid[DEF_NIDS_PER_INODE]; } Once F2FS_EXTRA_ATTR is set, we will steal four bytes in the head of i_addr field for storing i_extra_isize and i_padding. with i_extra_isize, we can calculate actual size of reserved space in i_addr, available attribute fields included in total extra attribute fields for current inode can be described as below: +--------------------+ \| .i_mode \| \| ... \| \| .i_ext \| +--------------------+ \| .i_extra_isize \|-----+ \| .i_padding \| \| \| .i_prjid \| \| \| .i_atime_extra \| \| \| .i_ctime_extra \| \| \| .i_mtime_extra \|<----+ \| .i_inode_cs \|<----- store blkaddr/inline from here \| .i_xattr_cs \| \| ... \| +--------------------+ \| \| \| block address \| \| \| +--------------------+ \| .i_nid \| +--------------------+ \| node_footer \| \| (nid, ino, offset) \| +--------------------+ Hence, with this patch, we would enhance scalability of f2fs inode for storing more newly added attribute. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-31 16:48:30 -07:00
Chao Yu	f247037120	f2fs: make max inline size changeable This patch tries to make below macros calculating max inline size, inline dentry field size considerring reserving size-changeable space: - MAX_INLINE_DATA - NR_INLINE_DENTRY - INLINE_DENTRY_BITMAP_SIZE - INLINE_RESERVED_SIZE Then, when inline_{data,dentry} options is enabled, it allows us to reserve inline space with different size flexibly for adding newly introduced inode attribute. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-31 16:48:29 -07:00
Jaegeuk Kim	e65ef20781	f2fs: add ioctl to expose current features This patch adds an ioctl to provide feature information to user. For exapmle, SQLite can use this ioctl to detect whether f2fs support atomic write or not. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-31 16:48:28 -07:00
Chao Yu	dc6febb6bc	f2fs: make background threads of f2fs being aware of freezing When ->freeze_fs is called from lvm for doing snapshot, it needs to make sure there will be no more changes in filesystem's data, however, previously, background threads like GC thread wasn't aware of freezing, so in environment with active background threads, data of snapshot becomes unstable. This patch fixes this issue by adding sb_{start,end}_intwrite in below background threads: - GC thread - flush thread - discard thread Note that, don't use sb_start_intwrite() in gc_thread_func() due to: generic/241 reports below bug: ====================================================== WARNING: possible circular locking dependency detected 4.13.0-rc1+ #32 Tainted: G O ------------------------------------------------------ f2fs_gc-250:0/22186 is trying to acquire lock: (&sbi->gc_mutex){+.+...}, at: [<f8fa7f0b>] f2fs_sync_fs+0x7b/0x1b0 [f2fs] but task is already holding lock: (sb_internal#2){++++.-}, at: [<f8fb5609>] gc_thread_func+0x159/0x4a0 [f2fs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (sb_internal#2){++++.-}: __lock_acquire+0x405/0x7b0 lock_acquire+0xae/0x220 __sb_start_write+0x11d/0x1f0 f2fs_evict_inode+0x2d6/0x4e0 [f2fs] evict+0xa8/0x170 iput+0x1fb/0x2c0 f2fs_sync_inode_meta+0x3f/0xf0 [f2fs] write_checkpoint+0x1b1/0x750 [f2fs] f2fs_sync_fs+0x85/0x1b0 [f2fs] f2fs_do_sync_file.isra.24+0x137/0xa30 [f2fs] f2fs_sync_file+0x34/0x40 [f2fs] vfs_fsync_range+0x4a/0xa0 do_fsync+0x3c/0x60 SyS_fdatasync+0x15/0x20 do_fast_syscall_32+0xa1/0x1b0 entry_SYSENTER_32+0x4c/0x7b -> #1 (&sbi->cp_mutex){+.+...}: __lock_acquire+0x405/0x7b0 lock_acquire+0xae/0x220 __mutex_lock+0x4f/0x830 mutex_lock_nested+0x25/0x30 write_checkpoint+0x2f/0x750 [f2fs] f2fs_sync_fs+0x85/0x1b0 [f2fs] sync_filesystem+0x67/0x80 generic_shutdown_super+0x27/0x100 kill_block_super+0x22/0x50 kill_f2fs_super+0x3a/0x40 [f2fs] deactivate_locked_super+0x3d/0x70 deactivate_super+0x40/0x60 cleanup_mnt+0x39/0x70 __cleanup_mnt+0x10/0x20 task_work_run+0x69/0x80 exit_to_usermode_loop+0x57/0x92 do_fast_syscall_32+0x18c/0x1b0 entry_SYSENTER_32+0x4c/0x7b -> #0 (&sbi->gc_mutex){+.+...}: validate_chain.isra.36+0xc50/0xdb0 __lock_acquire+0x405/0x7b0 lock_acquire+0xae/0x220 __mutex_lock+0x4f/0x830 mutex_lock_nested+0x25/0x30 f2fs_sync_fs+0x7b/0x1b0 [f2fs] f2fs_balance_fs_bg+0xb9/0x200 [f2fs] gc_thread_func+0x302/0x4a0 [f2fs] kthread+0xe9/0x120 ret_from_fork+0x19/0x24 other info that might help us debug this: Chain exists of: &sbi->gc_mutex --> &sbi->cp_mutex --> sb_internal#2 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sb_internal#2); lock(&sbi->cp_mutex); lock(sb_internal#2); lock(&sbi->gc_mutex); * DEADLOCK * 1 lock held by f2fs_gc-250:0/22186: #0: (sb_internal#2){++++.-}, at: [<f8fb5609>] gc_thread_func+0x159/0x4a0 [f2fs] stack backtrace: CPU: 2 PID: 22186 Comm: f2fs_gc-250:0 Tainted: G O 4.13.0-rc1+ #32 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 Call Trace: dump_stack+0x5f/0x92 print_circular_bug+0x1b3/0x1bd validate_chain.isra.36+0xc50/0xdb0 ? __this_cpu_preempt_check+0xf/0x20 __lock_acquire+0x405/0x7b0 lock_acquire+0xae/0x220 ? f2fs_sync_fs+0x7b/0x1b0 [f2fs] __mutex_lock+0x4f/0x830 ? f2fs_sync_fs+0x7b/0x1b0 [f2fs] mutex_lock_nested+0x25/0x30 ? f2fs_sync_fs+0x7b/0x1b0 [f2fs] f2fs_sync_fs+0x7b/0x1b0 [f2fs] f2fs_balance_fs_bg+0xb9/0x200 [f2fs] gc_thread_func+0x302/0x4a0 [f2fs] ? preempt_schedule_common+0x2f/0x4d ? f2fs_gc+0x540/0x540 [f2fs] kthread+0xe9/0x120 ? f2fs_gc+0x540/0x540 [f2fs] ? kthread_create_on_node+0x30/0x30 ret_from_fork+0x19/0x24 The deadlock occurs in below condition: GC Thread Thread B - sb_start_intwrite - f2fs_sync_file - f2fs_sync_fs - mutex_lock(&sbi->gc_mutex) - write_checkpoint - block_operations - f2fs_sync_inode_meta - iput - sb_start_intwrite - mutex_lock(&sbi->gc_mutex) Fix this by altering sb_start_intwrite to sb_start_write_trylock. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-31 16:47:27 -07:00
Jeff Layton	7e51fe1dd1	fuse: convert to errseq_t based error tracking for fsync Change to file_write_and_wait_range and file_check_and_advance_wb_err Signed-off-by: Jeff Layton <jlayton@redhat.com>	2017-07-31 19:12:25 -04:00
Jeff Layton	9c5d58fb9e	ext4: convert swap_inode_data() over to use swap() on most of the fields For some odd reason, it forces a byte-by-byte copy of each field. A plain old swap() on most of these fields would be more efficient. We do need to retain the memswap of i_data however as that field is an array. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>	2017-07-31 00:55:34 -04:00
Emoly Liu	191eac3300	ext4: error should be cleared if ea_inode isn't added to the cache For Lustre, if ea_inode fails in hash validation but passes parent inode and generation checks, it won't be added to the cache as well as the error "-EFSCORRUPTED" should be cleared, otherwise it will cause "Structure needs cleaning" when running getfattr command. Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-9723 Cc: stable@vger.kernel.org Fixes: `dec214d00e` Signed-off-by: Emoly Liu <emoly.liu@intel.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: tahsin@google.com	2017-07-31 00:40:22 -04:00
Jan Kara	a3bb2d5587	ext4: Don't clear SGID when inheriting ACLs When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit set, DIR1 is expected to have SGID bit set (and owning group equal to the owning group of 'DIR0'). However when 'DIR0' also has some default ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on 'DIR1' to get cleared if user is not member of the owning group. Fix the problem by moving posix_acl_update_mode() out of __ext4_set_acl() into ext4_set_acl(). That way the function will not be called when inheriting ACLs which is what we want as it prevents SGID bit clearing and the mode has been properly set by posix_acl_create() anyway. Fixes: `073931017b` CC: stable@vger.kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>	2017-07-30 23:33:01 -04:00
Ernesto A. Fernández	397e434176	ext4: preserve i_mode if __ext4_set_acl() fails When changing a file's acl mask, __ext4_set_acl() will first set the group bits of i_mode to the value of the mask, and only then set the actual extended attribute representing the new acl. If the second part fails (due to lack of space, for example) and the file had no acl attribute to begin with, the system will from now on assume that the mask permission bits are actual group permission bits, potentially granting access to the wrong users. Prevent this by only changing the inode mode after the acl has been set. Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2017-07-30 22:43:41 -04:00
Eric Whitney	a627b0a7c1	ext4: remove unused metadata accounting variables Two variables in ext4_inode_info, i_reserved_meta_blocks and i_allocated_meta_blocks, are unused. Removing them saves a little memory per in-memory inode and cleans up clutter in several tracepoints. Adjust tracepoint output from ext4_alloc_da_blocks() for consistency and fix a typo and whitespace near these changes. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2017-07-30 22:30:11 -04:00
Eric Whitney	1e21196c8e	ext4: correct comment references to ext4_ext_direct_IO() Commit `914f82a32d` "ext4: refactor direct IO code" deleted ext4_ext_direct_IO(), but references to that function remain in comments. Update them to refer to ext4_direct_IO_write(). Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Jan Kara <jack@suse.cz>	2017-07-30 22:26:40 -04:00
Shaohua Li	69fd5c3917	blktrace: add an option to allow displaying cgroup path By default we output cgroup id in blktrace. This adds an option to display cgroup path. Since get cgroup path is a relativly heavy operation, we don't enable it by default. with the option enabled, blktrace will output something like this: dd-1353 [007] d..2 293.015252: 8,0 /test/level D R 24 + 8 [dd] Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-29 09:00:03 -06:00
Shaohua Li	aa81882534	kernfs: add exportfs operations Now we have the facilities to implement exportfs operations. The idea is cgroup can export the fhandle info to userspace, then userspace uses fhandle to find the cgroup name. Another example is userspace can get fhandle for a cgroup and BPF uses the fhandle to filter info for the cgroup. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-29 09:00:03 -06:00
Shaohua Li	c53cd490b1	kernfs: introduce kernfs_node_id inode number and generation can identify a kernfs node. We are going to export the identification by exportfs operations, so put ino and generation into a separate structure. It's convenient when later patches use the identification. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-29 09:00:03 -06:00
Shaohua Li	319ba91d35	kernfs: don't set dentry->d_fsdata When working on adding exportfs operations in kernfs, I found it's hard to initialize dentry->d_fsdata in the exportfs operations. Looks there is no way to do it without race condition. Look at the kernfs code closely, there is no point to set dentry->d_fsdata. inode->i_private already points to kernfs_node, and we can get inode from a dentry. So this patch just delete the d_fsdata usage. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-29 09:00:03 -06:00
Shaohua Li	ba16b2846a	kernfs: add an API to get kernfs node from inode number Add an API to get kernfs node from inode number. We will need this to implement exportfs operations. This API will be used in blktrace too later, so it should be as fast as possible. To make the API lock free, kernfs node is freed in RCU context. And we depend on kernfs_node count/ino number to filter out stale kernfs nodes. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-29 09:00:03 -06:00
Shaohua Li	4a3ef68aca	kernfs: implement i_generation Set i_generation for kernfs inode. This is required to implement exportfs operations. The generation is 32-bit, so it's possible the generation wraps up and we find stale files. To reduce the posssibility, we don't reuse inode numer immediately. When the inode number allocation wraps, we increase generation number. In this way generation/inode number consist of a 64-bit number which is unlikely duplicated. This does make the idr tree more sparse and waste some memory. Since idr manages 32-bit keys, idr uses a 6-level radix tree, each level covers 6 bits of the key. In a 100k inode kernfs, the worst case will have around 300k radix tree node. Each node is 576bytes, so the tree will use about ~150M memory. Sounds not too bad, if this really is a problem, we should find better data structure. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-29 09:00:03 -06:00
Shaohua Li	7d35079f82	kernfs: use idr instead of ida to manage inode number kernfs uses ida to manage inode number. The problem is we can't get kernfs_node from inode number with ida. Switching to use idr, next patch will add an API to get kernfs_node from inode number. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-29 09:00:03 -06:00
Jaegeuk Kim	7a10f0177e	f2fs: don't give partially written atomic data from process crash This patch resolves the below scenario. == Process 1 == == Process 2 == open(w) open(rw) begin write(new_#1) process_crash f_op->flush locks_remove_posix f_op>release read (new_#1) In order to avoid corrupted database caused by new_#1, we must do roll-back at process_crash time. In order to check that, this patch keeps task which triggers transaction begin, and does roll-back in f_op->flush before removing file locks. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-28 17:49:01 -07:00
Jaegeuk Kim	640cc18982	f2fs: give a try to do atomic write in -ENOMEM case It'd be better to retry writing atomic pages when we get -ENOMEM. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-28 17:49:00 -07:00
Ernesto A. Fernández	14af20fcb1	f2fs: preserve i_mode if __f2fs_set_acl() fails When changing a file's acl mask, __f2fs_set_acl() will first set the group bits of i_mode to the value of the mask, and only then set the actual extended attribute representing the new acl. If the second part fails (due to lack of space, for example) and the file had no acl attribute to begin with, the system will from now on assume that the mask permission bits are actual group permission bits, potentially granting access to the wrong users. Prevent this by only changing the inode mode after the acl has been set. Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-28 17:48:54 -07:00
Linus Torvalds	286ba844c5	More NFS client bugfixes for 4.13 Stable fixes: - Fix a race where CB_NOTIFY_LOCK fails to wake a waiter - Invalidate file size when taking a lock to prevent corruption Other fixes: - Don't excessively generate tiny writes with fallocate - Use the raw NFS access mask in nfs4_opendata_access() -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAll7l0oACgkQ18tUv7Cl QOuL3w/+M5I5xKKrMOjg2cfzdFAn+syTmXYK0HFrxp76CiaNBcQFK5kG1ebkdrTM EDXWznamWRTCIbg0U7/3X/763yWUyZM8RSC0nJXyt7FNZitg1Hsvw/OawaM2Z8q4 TQQPqelEAhAG7zbgyCCe+SuAoAOTq7HpX2wru8gK6POBOP6gmNEJtchBzqrlsq8d bSH8E7BLhbRIwC3htsPfTW0NZtqpp7u/wKeLtt/ZGIuM/+78iOa1wMwCESVJfd47 2DpDmS0LxVtAjs8lCjMBKyrypNEvh+evgdbxeiXG/T6ykzWhBy96OSOm7ooIjqOr pkptxrKOBGb9/8rMnxjCKRIfjgVz77GfB2jD+/ILzRP1E0xipHXWCc5XqzBP999l zqUVDMPH3zrq8lxmC9FgoY1PJAcRrZ/aEIjozwkVcksTDYx+GJPYMR6Wks9/1cT0 4jLTRsBgckj9b3FcjsCiyavHBweDChCEgzx5CLpEqH1KKCfT6MnLTb/WoJL/s1R8 MLb0MC5PMpLP4OvCRR+mCg+dJD2nXF/Cz2E9r3SSlhZiDsNWmdBXk2A5XoAFzY5l pQeqkogBdiINu7p/G3n7837ThRUGV+04C9D9WDI7IF/dktOyYYO/4DNDVYEiFqKL 9v8Hc4EyGwR2dY5iEKaSuNEk8zTxL1ZGv1H8WTCSDmNRQ+64Q3s= =OHnE -----END PGP SIGNATURE----- Merge tag 'nfs-for-4.13-3' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client fixes from Anna Schumaker: "More NFS client bugfixes for 4.13. Most of these fix locking bugs that Ben and Neil noticed, but I also have a patch to fix one more access bug that was reported after last week. Stable fixes: - Fix a race where CB_NOTIFY_LOCK fails to wake a waiter - Invalidate file size when taking a lock to prevent corruption Other fixes: - Don't excessively generate tiny writes with fallocate - Use the raw NFS access mask in nfs4_opendata_access()" * tag 'nfs-for-4.13-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiter NFS: Optimize fallocate by refreshing mapping when needed. NFS: invalidate file size when taking a lock. NFS: Use raw NFS access mask in nfs4_opendata_access()	2017-07-28 14:44:56 -07:00
Linus Torvalds	19993e7378	Changes since last update: - Fix firstfsb variables that we left uninitialized, which could lead to locking problems. - Check for NULL metadata buffer pointers before using them. - Don't allow btree cursor manipulation if the btree block is corrupt. Better to just shut down. - Fix infinite loop problems in quotacheck. - Fix buffer overrun when validating directory blocks. - Fix deadlock problem in bunmapi. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABCgAGBQJZeXhTAAoJEPh/dxk0SrTrUK8P/RvwLgEflTEUQjjNoakhOuWV yYlXJz2ArWbG/w8vtW4rb6gnExJ6OmJ4EUZWe78g0oTpo9vmS7GuHE0HBXS8AiNe 9Y87GBIQLRi6BOlY9wfiKcLyA3u/buLSAkFhjulA+ARRIS2G3pW/PkzOfl1tIJhl rpL/xJ8TAcNz5LLu/znwebtnIMbaplMdV80b4dOHoNvYC0mYaOFTRiyXANqdCnKx C4tYyKkkQHYDjyXjOwJt8I8CUvcbMrOVQd1E1px+n2L9O81dUP04PhF8N0vPZl/Q ueP83KRqCAm89HMc2P/P0bkBZmbFUtgtMA67oOUxx66crDWEExGRhqZ1+/UgJsAg t5yFg3+QwgaXhAXcZZrvGGMT0b3L6ew5//dhY8XcMq2xKpKrxls/RtQrw3Lux+qx lHhGIAyd15LBKHWARwGXC315gOMfLnUuWhG63pOygL4PrVvOY22Axj5YdRJD5J6E Z4oRzqhQngeLrbfbj73DQFcGxdeEUodB+Pz8uTQ+6pfy5JU3dMzdI16ekX1bgZV3 qFFMRR77a4RpAWYy27LYeaa8NTAEQEahKdRWXofjgKjfsvgnxe+cqhoSdkExAX0c MM0DtXMo2dMjpsajNCo971jPK89a07dY+6Y9COwSMV2vD8Ml43v2F9nS9kcQlUAP H6kdr19vd5p/BBH1P+lu =TKiL -----END PGP SIGNATURE----- Merge tag 'xfs-4.13-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Darrick Wong: - fix firstfsb variables that we left uninitialized, which could lead to locking problems. - check for NULL metadata buffer pointers before using them. - don't allow btree cursor manipulation if the btree block is corrupt. Better to just shut down. - fix infinite loop problems in quotacheck. - fix buffer overrun when validating directory blocks. - fix deadlock problem in bunmapi. * tag 'xfs-4.13-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix multi-AG deadlock in xfs_bunmapi xfs: check that dir block entries don't off the end of the buffer xfs: fix quotacheck dquot id overflow infinite loop xfs: check _alloc_read_agf buffer pointer before using xfs: set firstfsb to NULLFSBLOCK before feeding it to _bmapi_write xfs: check _btree_check_block value	2017-07-28 14:29:48 -07:00
Benjamin Coddington	b7dbcc0e43	NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiter nfs4_retry_setlk() sets the task's state to TASK_INTERRUPTIBLE within the same region protected by the wait_queue's lock after checking for a notification from CB_NOTIFY_LOCK callback. However, after releasing that lock, a wakeup for that task may race in before the call to freezable_schedule_timeout_interruptible() and set TASK_WAKING, then freezable_schedule_timeout_interruptible() will set the state back to TASK_INTERRUPTIBLE before the task will sleep. The result is that the task will sleep for the entire duration of the timeout. Since we've already set TASK_INTERRUPTIBLE in the locked section, just use freezable_schedule_timout() instead. Fixes: `a1d617d8f1` ("nfs: allow blocking locks to be awoken by lock callbacks") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-07-28 15:35:30 -04:00
Linus Torvalds	0a2a1330d2	Merge branch 'for-4.13-part3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Fixes addressing problems reported by users, and there's one more regression fix" * 'for-4.13-part3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: round down size diff when shrinking/growing device Btrfs: fix early ENOSPC due to delalloc btrfs: fix lockup in find_free_extent with read-only block groups Btrfs: fix dir item validation when replaying xattr deletes	2017-07-28 12:26:59 -07:00
NeilBrown	6ba80d4348	NFS: Optimize fallocate by refreshing mapping when needed. posix_fallocate() will allocate space in an NFS file by considering the last byte of every 4K block. If it is before EOF, it will read the byte and if it is zero, a zero is written out. If it is after EOF, the zero is unconditionally written. For the blocks beyond EOF, if NFS believes its cache is valid, it will expand these writes to write full pages, and then will merge the pages. This results if (typically) 1MB writes. If NFS believes its cache is not valid (particularly if NFS_INO_INVALID_DATA or NFS_INO_REVAL_PAGECACHE are set - see nfs_write_pageuptodate()), it will send the individual 1-byte writes. This results in (typically) 256 times as many RPC requests, and can be substantially slower. Currently nfs_revalidate_mapping() is only used when reading a file or mmapping a file, as these are times when the content needs to be up-to-date. Writes don't generally need the cache to be up-to-date, but writes beyond EOF can benefit, particularly in the posix_fallocate() case. So this patch calls nfs_revalidate_mapping() when writing beyond EOF - i.e. when there is a gap between the end of the file and the start of the write. If the cache is thought to be out of date (as happens after taking a file lock), this will cause a GETATTR, and the two flags mentioned above will be cleared. With this, posix_fallocate() on a newly locked file does not generate excessive tiny writes. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-07-27 11:22:42 -04:00
NeilBrown	442ce0499c	NFS: invalidate file size when taking a lock. Prior to commit `ca0daa277a` ("NFS: Cache aggressively when file is open for writing"), NFS would revalidate, or invalidate, the file size when taking a lock. Since that commit it only invalidates the file content. If the file size is changed on the server while wait for the lock, the client will have an incorrect understanding of the file size and could corrupt data. This particularly happens when writing beyond the (supposed) end of file and can be easily be demonstrated with posix_fallocate(). If an application opens an empty file, waits for a write lock, and then calls posix_fallocate(), glibc will determine that the underlying filesystem doesn't support fallocate (assuming version 4.1 or earlier) and will write out a '0' byte at the end of each 4K page in the region being fallocated that is after the end of the file. NFS will (usually) detect that these writes are beyond EOF and will expand them to cover the whole page, and then will merge the pages. Consequently, NFS will write out large blocks of zeroes beyond where it thought EOF was. If EOF had moved, the pre-existing part of the file will be over-written. Locking should have protected against this, but it doesn't. This patch restores the use of nfs_zap_caches() which invalidated the cached attributes. When posix_fallocate() asks for the file size, the request will go to the server and get a correct answer. cc: stable@vger.kernel.org (v4.8+) Fixes: `ca0daa277a` ("NFS: Cache aggressively when file is open for writing") Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-07-27 11:22:42 -04:00
Yunlei He	8790568255	f2fs: alloc new nids for xattr block in recovery recovery file A: recovery file B: -get_dnode_of_data -alloc_nid -recover_xattr_data -set_node_addr(sbi, &ni, NEW_ADDR, false); --->bug_on for nid has been used by file A In recovery process, new allocated node blocks may "reuse" xattr block nids, this patch alloc new nids for xattr blocks in recovery process to avoid this problem. Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-26 19:34:30 -07:00
Chao Yu	76a9dd85d4	f2fs: spread struct f2fs_dentry_ptr for inline path Use f2fs_dentry_ptr structure to indicate inline dentry structure as much as possible, so we can wrap inline dentry with size-fixed fields to the one with size-changeable fields. With this change, we can handle size-changeable inline dentry more easily. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-26 19:34:30 -07:00
Yunlei He	5f4ce6abc2	f2fs: remove unused input parameter This patch remove unused input parameter in function new_node_page. Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Yong Sheng <shengyong1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-07-26 19:34:30 -07:00
Anna Schumaker	1e6f209515	NFS: Use raw NFS access mask in nfs4_opendata_access() Commit `bd8b244174` ("NFS: Store the raw NFS access mask in the inode's access cache") changed how the access results are stored after an access() call. An NFS v4 OPEN might have access bits returned with the opendata, so we should use the NFS4_ACCESS values when determining the return value in nfs4_opendata_access(). Fixes: `bd8b244174` ("NFS: Store the raw NFS access mask in the inode's access cache") Reported-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Tested-by: Takashi Iwai <tiwai@suse.de>	2017-07-26 16:53:57 -04:00
Christoph Hellwig	5b094d6dac	xfs: fix multi-AG deadlock in xfs_bunmapi Just like in the allocator we must avoid touching multiple AGs out of order when freeing blocks, as freeing still locks the AGF and can cause the same AB-BA deadlocks as in the allocation path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-07-26 08:20:03 -07:00
Linus Torvalds	25f6a53799	JFS fixes for 4.13 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAll3YIwACgkQNqiEXrVA jGQE9w//b/MyL9MtvGcKI7u3V7RrqbodzoL42KxV98TI3y7rpmUcmRsqDT045ufD S+AOqnIhvH2XbsF7jvZ0UROUtErBgh+pIRqCctVqQM+GKE7p/KR0rY/4eMPlbqQL Q2L0ZbokHU4mgvo7SqSJkYFRd0PPBaaw4aJaf8gg1g0pCb29jcer4ycKKHCjz0Wz YS2/v7NVWysehihiz/JF6ga5/n6VZeXq3fa48Mmt4TDzKt4IaqgLtAEbPa+eddwW z/t4YIoiQ4JUYPnLNmG1Kd8lACfeqKkr2WIh3143ipof4Tj3ZaNc315wVQUl58LP 6ZUg12tLDVH9OOPdI9VsVLostRbyKH3yUazULBXWIQQt6ZWcFiKFbturMlZevSJB OuGBsfpDDvkd+2n2iNcwane/+Ouw4LChzUvmp9/MuIRxinRVdXAVRjaUb2M54xSW qR/Yfw8qyky9tac1c1Md/bVo7oMEw0Xliv3HxmGKHNPLMOLzwfVTAw0gGXSrGFn8 veUKCl1J1+9NWoNExDkyUjsmD1CdkhQk1gpmU70KWQlCgKCtBhWiX6rEy0l8pw1t G5UmFpgqG8g61ODuW0dbnSdImmS8mcbY4I1lSvQFb3qVtCeBLz9q7OKSCuavbs4M egtaUNVMJ22dVv12loj2OOyVdneUXbUB0WVga3kfEq7QRCtSjF4= =fVR9 -----END PGP SIGNATURE----- Merge tag 'jfs-4.13' of git://github.com/kleikamp/linux-shaggy Pull JFS fixes from David Kleikamp. * tag 'jfs-4.13' of git://github.com/kleikamp/linux-shaggy: jfs: preserve i_mode if __jfs_set_acl() fails jfs: Don't clear SGID when inheriting ACLs jfs: atomically read inode size	2017-07-25 08:51:57 -07:00
Darrick J. Wong	6215894e11	xfs: check that dir block entries don't off the end of the buffer When we're checking the entries in a directory buffer, make sure that the entry length doesn't push us off the end of the buffer. Found via xfs/388 writing ones to the length fields. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2017-07-25 08:36:35 -07:00
Eric W. Biederman	cc731525f2	signal: Remove kernel interal si_code magic struct siginfo is a union and the kernel since 2.4 has been hiding a union tag in the high 16bits of si_code using the values: __SI_KILL __SI_TIMER __SI_POLL __SI_FAULT __SI_CHLD __SI_RT __SI_MESGQ __SI_SYS While this looks plausible on the surface, in practice this situation has not worked well. - Injected positive signals are not copied to user space properly unless they have these magic high bits set. - Injected positive signals are not reported properly by signalfd unless they have these magic high bits set. - These kernel internal values leaked to userspace via ptrace_peek_siginfo - It was possible to inject these kernel internal values and cause the the kernel to misbehave. - Kernel developers got confused and expected these kernel internal values in userspace in kernel self tests. - Kernel developers got confused and set si_code to __SI_FAULT which is SI_USER in userspace which causes userspace to think an ordinary user sent the signal and that it was not kernel generated. - The values make it impossible to reorganize the code to transform siginfo_copy_to_user into a plain copy_to_user. As si_code must be massaged before being passed to userspace. So remove these kernel internal si codes and make the kernel code simpler and more maintainable. To replace these kernel internal magic si_codes introduce the helper function siginfo_layout, that takes a signal number and an si_code and computes which union member of siginfo is being used. Have siginfo_layout return an enumeration so that gcc will have enough information to warn if a switch statement does not handle all of union members. A couple of architectures have a messed up ABI that defines signal specific duplications of SI_USER which causes more special cases in siginfo_layout than I would like. The good news is only problem architectures pay the cost. Update all of the code that used the previous magic __SI_ values to use the new SIL_ values and to call siginfo_layout to get those values. Escept where not all of the cases are handled remove the defaults in the switch statements so that if a new case is missed in the future the lack will show up at compile time. Modify the code that copies siginfo si_code to userspace to just copy the value and not cast si_code to a short first. The high bits are no longer used to hold a magic union member. Fixup the siginfo header files to stop including the __SI_ values in their constants and for the headers that were missing it to properly update the number of si_codes for each signal type. The fixes to copy_siginfo_from_user32 implementations has the interesting property that several of them perviously should never have worked as the __SI_ values they depended up where kernel internal. With that dependency gone those implementations should work much better. The idea of not passing the __SI_ values out to userspace and then not reinserting them has been tested with criu and criu worked without changes. Ref: 2.4.0-test1 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2017-07-24 14:30:28 -05:00
Eric W. Biederman	d08477aa97	fcntl: Don't use ambiguous SIG_POLL si_codes We have a weird and problematic intersection of features that when they all come together result in ambiguous siginfo values, that we can not support properly. - Supporting fcntl(F_SETSIG,...) with arbitrary valid signals. - Using positive values for POLL_IN, POLL_OUT, POLL_MSG, ..., etc that imply they are signal specific si_codes and using the aforementioned arbitrary signal to deliver them. - Supporting injection of arbitrary siginfo values for debugging and checkpoint/restore. The result is that just looking at siginfo si_codes of 1 to 6 are ambigious. It could either be a signal specific si_code or it could be a generic si_code. For most of the kernel this is a non-issue but for sending signals with siginfo it is impossible to play back the kernel signals and get the same result. Strictly speaking when the si_code was changed from SI_SIGIO to POLL_IN and friends between 2.2 and 2.4 this functionality was not ambiguous, as only real time signals were supported. Before 2.4 was released the kernel began supporting siginfo with non realtime signals so they could give details of why the signal was sent. The result is that if F_SETSIG is set to one of the signals with signal specific si_codes then user space can not know why the signal was sent. I grepped through a bunch of userspace programs using debian code search to get a feel for how often people choose a signal that results in an ambiguous si_code. I only found one program doing so and it was using SIGCHLD to test the F_SETSIG functionality, and did not appear to be a real world usage. Therefore the ambiguity does not appears to be a real world problem in practice. Remove the ambiguity while introducing the smallest chance of breakage by changing the si_code to SI_SIGIO when signals with signal specific si_codes are targeted. Fixes: v2.3.40 -- Added support for queueing non-rt signals Fixes: v2.3.21 -- Changed the si_code from SI_SIGIO Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2017-07-24 14:29:23 -05:00
Brian Foster	cfaf2d0343	xfs: fix quotacheck dquot id overflow infinite loop If a dquot has an id of U32_MAX, the next lookup index increment overflows the uint32_t back to 0. This starts the lookup sequence over from the beginning, repeats indefinitely and results in a livelock. Update xfs_qm_dquot_walk() to explicitly check for the lookup overflow and exit the loop. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-07-24 08:33:25 -07:00
Nikolay Borisov	0e4324a4c3	btrfs: round down size diff when shrinking/growing device Further testing showed that the fix introduced in `7dfb8be11b` ("btrfs: Round down values which are written for total_bytes_size") was insufficient and it could still lead to discrepancies between the total_bytes in the super block and the device total bytes. So this patch also ensures that the difference between old/new sizes when shrinking/growing is also rounded down. This ensure that we won't be subtracting/adding a non-sectorsize multiples to the superblock/device total sizees. Fixes: `7dfb8be11b` ("btrfs: Round down values which are written for total_bytes_size") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-07-24 16:05:00 +02:00
Omar Sandoval	17024ad0a0	Btrfs: fix early ENOSPC due to delalloc If a lot of metadata is reserved for outstanding delayed allocations, we rely on shrink_delalloc() to reclaim metadata space in order to fulfill reservation tickets. However, shrink_delalloc() has a shortcut where if it determines that space can be overcommitted, it will stop early. This made sense before the ticketed enospc system, but now it means that shrink_delalloc() will often not reclaim enough space to fulfill any tickets, leading to an early ENOSPC. (Reservation tickets don't care about being able to overcommit, they need every byte accounted for.) Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims all of the metadata it is supposed to. This fixes early ENOSPCs we were seeing when doing a btrfs receive to populate a new filesystem, as well as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs. Fixes: `957780eb27` ("Btrfs: introduce ticketed enospc infrastructure") Tested-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name> Cc: stable@vger.kernel.org Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-07-24 16:04:26 +02:00
Jeff Mahoney	144439376b	btrfs: fix lockup in find_free_extent with read-only block groups If we have a block group that is all of the following: 1) uncached in memory 2) is read-only 3) has a disk cache state that indicates we need to recreate the cache AND the file system has enough free space fragmentation such that the request for an extent of a given size can't be honored; AND have a single CPU core; AND it's the block group with the highest starting offset such that there are no opportunities (like reading from disk) for the loop to yield the CPU; We can end up with a lockup. The root cause is simple. Once we're in the position that we've read in all of the other block groups directly and none of those block groups can honor the request, there are no more opportunities to sleep. We end up trying to start a caching thread which never gets run if we only have one core. This should present as a hung task waiting on the caching thread to make some progress, but it doesn't. Instead, it degrades into a busy loop because of the placement of the read-only check. During the first pass through the loop, block_group->cached will be set to BTRFS_CACHE_STARTED and have_caching_bg will be set. Then we hit the read-only check and short circuit the loop. We're not yet in LOOP_CACHING_WAIT, so we skip that loop back before going through the loop again for other raid groups. Then we move to LOOP_CACHING_WAIT state. During the this pass through the loop, ->cached will still be BTRFS_CACHE_STARTED, which means it's not cached, so we'll enter cache_block_group, do a lot of nothing, and return, and also set have_caching_bg again. Then we hit the read-only check and short circuit the loop. The same thing happens as before except now we DO trigger the LOOP_CACHING_WAIT && have_caching_bg check and loop back up to the top. We do this forever. There are two fixes in this patch since they address the same underlying bug. The first is to add a cond_resched to the end of the loop to ensure that the caching thread always has an opportunity to run. This will fix the soft lockup issue, but find_free_extent will still loop doing nothing until the thread has completed. The second is to move the read-only check to the top of the loop. We're never going to return an allocation within a read-only block group so we may as well skip it early. The check for ->cached == BTRFS_CACHE_ERROR would cause the same problem except that BTRFS_CACHE_ERROR is considered a "done" state and we won't re-set have_caching_bg again. Many thanks to Stephan Kulow <coolo@suse.de> for his excellent help in the testing process. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-07-24 16:04:02 +02:00
Greg Kroah-Hartman	24a81a2c25	Merge 4.13-rc2 into char-misc-next We want the char/misc driver fixes in here as well to handle future changes. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-07-23 19:58:30 -07:00
Linus Torvalds	505d5c1119	NFS client bugfixes for 4.13 Stable bugfixes: - Fix error reporting regression Bugfixes: - Fix setting filelayout ds address race - Fix subtle access bug when using ACLs - Fix setting mnt3_counts array size - Fix a couple of pNFS commit races -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAllyVYYACgkQ18tUv7Cl QOtTwBAA8ek0Ba5wIfwlQe4MvIX1t6v7q7Otrwdombhuw1a6nW410hFwu2SG8kGd fQWr1xnqRsMKwu83UuImtlYYvMa271TZgXxPNWx7KJpi/zocnigJRg5sVpkcRqza AjjPc245pHCooQoaTAlm5WrO9zDm0s7lRGuTB1cvPRsElnFVWT0jwhTASIDOU1Zy 9K/hElAnXp/dZv/ydjSePTPPsVQPJWLbJPlSm6vAIQPyeXWUeCgAym0yf2FVvtsd AQeozaq9xq6tofVPZhsYWeBKswjTHs3FxL8vhDOF9gF3QMPm43mwsfrKVidWo4vW 0UJnRZRCFgG0WIxxhA7l1Z9MovAsXlbWmFufgCa4Ev/bC5WuUT4ZEkjBGJw2vXD4 0/lxkhD41PBhCl/LIod9OT6iJ8koifl50JUC4N67D2illFy9a7Btzx3EPxfDz9uG 6jEek9x6B1xf7AC4HhxByN/E8gKX08N4Q/afxTFuAwrzKRKqI4Me3qbCyU86bp+T wiAxgVPVnmb/VBVLU68i7titdsA6U8ZO12FFqu9QOr9wHMXxa6108h2Nia9jVTdk EZhanXa6tJThQQ/QZicuR5hTtoM4BuikaaJSHZ4ODbgrAjsMAyqy4qsf4tf3LLAo tEDu5sJDmHJhhdtqzuz+OtDn5iJ2Nga6a4/fMwt9YU2ZQBm7X/8= =ZcQv -----END PGP SIGNATURE----- Merge tag 'nfs-for-4.13-2' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client bugfixes from Anna Schumaker: "Stable bugfix: - Fix error reporting regression Bugfixes: - Fix setting filelayout ds address race - Fix subtle access bug when using ACLs - Fix setting mnt3_counts array size - Fix a couple of pNFS commit races" * tag 'nfs-for-4.13-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS/filelayout: Fix racy setting of fl->dsaddr in filelayout_check_deviceid() NFS: Be more careful about mapping file permissions NFS: Store the raw NFS access mask in the inode's access cache NFSv3: Convert nfs3_proc_access() to use nfs_access_set_mask() NFS: Refactor NFS access to kernel access mask calculation net/sunrpc/xprt_sock: fix regression in connection error reporting. nfs: count correct array for mnt3_counts array size Revert commit `722f0b8911` ("pNFS: Don't send COMMITs to the DSes if...") pNFS/flexfiles: Handle expired layout segments in ff_layout_initiate_commit() NFS: Fix another COMMIT race in pNFS NFS: Fix a COMMIT race in pNFS mount: copy the port field into the cloned nfs_server structure. NFS: Don't run wake_up_bit() when nobody is waiting... nfs: add export operations	2017-07-21 16:26:01 -07:00
Linus Torvalds	99313414dd	Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "This fixes a crash with SELinux and several other old and new bugs" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: check for bad and whiteout index on lookup ovl: do not cleanup directory and whiteout index entries ovl: fix xattr get and set with selinux ovl: remove unneeded check for IS_ERR() ovl: fix origin verification of index dir ovl: mark parent impure on ovl_link() ovl: fix random return value on mount	2017-07-21 16:24:22 -07:00
Trond Myklebust	1ebf980127	NFS/filelayout: Fix racy setting of fl->dsaddr in filelayout_check_deviceid() We must set fl->dsaddr once, and once only, even if there are multiple processes calling filelayout_check_deviceid() for the same layout segment. Reported-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-07-21 14:08:45 -04:00
Benjamin Coddington	3953704fde	locks: restore a warn for leaked locks on close When locks.c moved to using file_lock_context, the check for any locks that were not released was moved from the __fput() to destroy_inode() path in commit `8634b51f6c` ("locks: convert lease handling to file_lock_context"). This warning has been quite useful for catching bugs, particularly in NFS where lock handling still sees some churn. Let's bring back the warning for leaked locks on __fput, as this warning is much more likely to be seen and reported by users. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2017-07-21 13:57:31 -04:00
Trond Myklebust	ecbb903c56	NFS: Be more careful about mapping file permissions When mapping a directory, we want the MAY_WRITE permissions to reflect whether or not we have permission to modify, add and delete the directory entries. MAY_EXEC must map to lookup permissions. On the other hand, for files, we want MAY_WRITE to reflect a permission to modify and extend the file. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-07-21 11:51:19 -04:00
Trond Myklebust	bd8b244174	NFS: Store the raw NFS access mask in the inode's access cache Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-07-21 11:51:19 -04:00
Trond Myklebust	eda3e20847	NFSv3: Convert nfs3_proc_access() to use nfs_access_set_mask() Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-07-21 11:51:19 -04:00
Trond Myklebust	15d4b73ac2	NFS: Refactor NFS access to kernel access mask calculation Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-07-21 11:51:19 -04:00
Bob Peterson	4d7c18c7df	GFS2: Set gl_object in inode lookup only after block type check Before this patch, the inode glock's gl_object was set after a reference was acquired, but before the block type was verified. In cases where the block was unlinked, then freed and reused on another node, a residule delete callback (delete_work) would try to look up the inode, eventually failing the block check, but only after it overwrites gl_object with a pointer to the wrong inode. This patch moves the assignment of gl_object after the block check so it won't be improperly overwritten. Likewise, at the end of the function, gfs2_inode_lookup was clearing gl_object after it unlocked the glock, which meant another process might free the glock in the meantime. This patch guards against that case. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>	2017-07-21 08:20:42 -05:00
Bob Peterson	df3d87bde1	GFS2: Introduce helper for clearing gl_object This patch introduces a new helper function in glock.h that clears gl_object, with an added integrity check. An additional integrity check has been added to glock_set_object, plus comments. This is step 1 in a series to ensure gl_object integrity. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>	2017-07-21 08:20:05 -05:00
Eryu Guan	ecc7b435d2	nfs: count correct array for mnt3_counts array size Array size of mnt3_counts should be the size of array mnt3_procedures, not mnt_procedures, though they're same in size right now. Found this by code inspection. Fixes: `1c5876ddbd` ("sunrpc: move p_count out of struct rpc_procinfo") Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Eryu Guan <eguan@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-07-21 08:49:57 -04:00
Coly Li	e477b24b50	gfs2: add flag REQ_PRIO for metadata I/O When gfs2 does metadata I/O, only REQ_META is used as a metadata hint of the bio. But flag REQ_META is just a hint for block trace, not for block layer code to handle a bio as metadata request. For some of metadata I/Os of gfs2, A REQ_PRIO flag on the metadata bio would be very informative to block layer code. For example, if bcache is used as a I/O cache for gfs2, it will be possible for bcache code to get the hint and cache the pre-fetched metadata blocks on cache device. This behavior may be helpful to improve metadata I/O performance if the following requests hit the cache. Here are the locations in gfs2 code where a REQ_PRIO flag should be added, - All places where REQ_READAHEAD is used, gfs2 code uses this flag for metadata read ahead. - In gfs2_meta_rq() where the first metadata block is read in. - In gfs2_write_buf_to_page(), read in quota metadata blocks to have them up to date. These metadata blocks are probably to be accessed again in future, adding a REQ_PRIO flag may have bcache to keep such metadata in fast cache device. For system without a cache layer, REQ_PRIO can still provide hint to block layer to handle metadata requests more properly. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2017-07-21 07:48:22 -05:00
Wang Xibo	e7cb550d79	GFS2: fix code parameter error in inode_go_lock In inode_go_lock() function, the parameter order of list_add() is error. According to the define of list_add(), the first parameter is new entry and the second is the list head, so ip->i_trunc_list should be the first parameter and the sdp->sd_trunc_list should be second. Signed-off-by: Wang Xibo<wang.xibo@zte.com.cn> Signed-off-by: Xiao Likun<xiao.likun@zte.com.cn> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2017-07-21 07:40:59 -05:00
David Howells	ddc6c70f07	rxrpc: Move the packet.h include file into net/rxrpc/ Move the protocol description header file into net/rxrpc/ and rename it to protocol.h. It's no longer necessary to expose it as packets are no longer exposed to kernel services (such as AFS) that use the facility. The abort codes are transferred to the UAPI header instead as we pass these back to userspace and also to kernel services. Signed-off-by: David Howells <dhowells@redhat.com>	2017-07-21 11:00:20 +01:00
Darrick J. Wong	10479e2dea	xfs: check _alloc_read_agf buffer pointer before using In some circumstances, _alloc_read_agf can return an error code of zero but also a null AGF buffer pointer. Check for this and jump out. Fixes-coverity-id: 1415250 Fixes-coverity-id: 1415320 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2017-07-20 14:42:33 -07:00
Darrick J. Wong	4c1a67bd36	xfs: set firstfsb to NULLFSBLOCK before feeding it to _bmapi_write We must initialize the firstfsb parameter to _bmapi_write so that it doesn't incorrectly treat stack garbage as a restriction on which AGs it can search for free space. Fixes-coverity-id: 1402025 Fixes-coverity-id: 1415167 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2017-07-20 14:42:33 -07:00
Darrick J. Wong	1e86eabe73	xfs: check _btree_check_block value Check the _btree_check_block return value for the firstrec and lastrec functions, since we have the ability to signal that the repositioning did not succeed. Fixes-coverity-id: 114067 Fixes-coverity-id: 114068 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2017-07-20 14:42:32 -07:00
Linus Torvalds	791f2df39b	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull misc filesystem fixes from Jan Kara: "Several ACL related fixes for ext2, reiserfs, and hfsplus. And also one minor isofs cleanup" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: hfsplus: Don't clear SGID when inheriting ACLs isofs: Fix off-by-one in 'session' mount option parsing reiserfs: preserve i_mode if __reiserfs_set_acl() fails ext2: preserve i_mode if ext2_set_acl() fails ext2: Don't clear SGID when inheriting ACLs reiserfs: Don't clear SGID when inheriting ACLs	2017-07-20 10:41:12 -07:00
Linus Torvalds	465b0dbb38	for-f2fs-v4.13-rc2 We've filed some bug fixes: - missing f2fs case in terms of stale SGID big, introduced by Jan - build error for seq_file.h - avoid cpu lockup - wrong inode_unlock in error case -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAllwICsACgkQQBSofoJI UNItLQ/8CrPqw7pOSoH72n79/d5Md7tKe5TNN2qZbjCVGj7qs2opOnGM8hhFtUTe nFzK84evSpIQlgdRJFJU82E55U0coa3ySHgCQSUnHOobTtNsdmwq7p21/xT5LV3s 211zGYDgqtdp5/5ONHeD1ckF0QR9S9nWPuIRt9ef3bp2c7CfDrk+LLMrwSMeUlZo /uk5j32QPdME9ittqZ1bEZPl2FgwgmI4NFjyjGiHDK/ZYGhspHfa7FHjL8PW69UG pquiwlqHTg+i9wSc9byYALnJEs1XN6oW8E5TxO5zGqvfa77tQQb+qGHG9kYGDu64 JMpAXort5ZKNatkLLMXOoojLWutthv70f1IQK3eGUHhiWmsYrWZHjzrDh8hkcgh7 JMwGbYHrQlsAdk6B1r4MM8GW/telLufM3jTp7Fhpn1fLomWSE28JPtql9Ci5kIKX XxUF0y2HbC4ZI5LlY2umRzAfULaEFWEG/8X+wqTl3oE5Jv7Jthd69rpdjJvcQnPx iIz7J6BJopjAUoTUlXdSnWkP7VPkDOtDpAiu7cj16U39XSnIW/ceC+qLeP1J2R2c +hTg2pfYvh4eJGnNdxv4kZOxFFhjaEBReBPPgYOyCr7IPTtA+sucXO/zqWN6RH95 tu8+Efl60eQbCt2Gh+JlBR7hXNsgk56ksZ8XaYhBM4VRIWZFc/0= =nVpP -----END PGP SIGNATURE----- Merge tag 'for-f2fs-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fixes from Jaegeuk Kim: "We've filed some bug fixes: - missing f2fs case in terms of stale SGID bit, introduced by Jan - build error for seq_file.h - avoid cpu lockup - wrong inode_unlock in error case" * tag 'for-f2fs-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: avoid cpu lockup f2fs: include seq_file.h for sysfs.c f2fs: Don't clear SGID when inheriting ACLs f2fs: remove extra inode_unlock() in error path	2017-07-20 10:30:16 -07:00
Amir Goldstein	0e082555ce	ovl: check for bad and whiteout index on lookup Index should always be of the same file type as origin, except for the case of a whiteout index. A whiteout index should only exist if all lower aliases have been unlinked, which means that finding a lower origin on lookup whose index is a whiteout should be treated as a lookup error. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-07-20 11:08:21 +02:00
Amir Goldstein	61b674710c	ovl: do not cleanup directory and whiteout index entries Directory index entries are going to be used for looking up redirected upper dirs by lower dir fh when decoding an overlay file handle of a merge dir. Whiteout index entries are going to be used as an indication that an exported overlay file handle should be treated as stale (i.e. after unlink of the overlay inode). We don't know the verification rules for directory and whiteout index entries, because they have not been implemented yet, so fail to mount overlay rw if those entries are found to avoid corrupting an index that was created by a newer kernel. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-07-20 11:08:21 +02:00
Miklos Szeredi	1d88f18373	ovl: fix xattr get and set with selinux inode_doinit_with_dentry() in SELinux wants to read the upper inode's xattr to get security label, and ovl_xattr_get() calls ovl_dentry_real(), which depends on dentry->d_inode, but d_inode is null and not initialized yet at this point resulting in an Oops. Fix by getting the upperdentry info from the inode directly in this case. Reported-by: Eryu Guan <eguan@redhat.com> Fixes: `09d8b58673` ("ovl: move __upperdentry to ovl_inode") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-07-20 11:08:21 +02:00
Trond Myklebust	213297369c	Revert commit `722f0b8911` ("pNFS: Don't send COMMITs to the DSes if...") Doing the test without taking any locks is racy, and so really it makes more sense to do it in the flexfiles code (which is the only case that cares). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-07-19 15:28:21 -04:00
Trond Myklebust	4b75053e9b	pNFS/flexfiles: Handle expired layout segments in ff_layout_initiate_commit() If the layout has expired due to a fencing event, then we should not attempt to commit to the DS. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-07-19 15:28:21 -04:00
Trond Myklebust	4118188645	NFS: Fix another COMMIT race in pNFS We must make sure that cinfo->ds->ncommitting is in sync with the commit list, since it is checked as part of pnfs_commit_list(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-07-19 15:28:21 -04:00
Trond Myklebust	e39928f942	NFS: Fix a COMMIT race in pNFS We must make sure that cinfo->ds->nwritten is in sync with the commit list, since it is checked as part of pnfs_scan_commit_lists(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-07-19 15:28:21 -04:00
Steve Dickson	89a6814d9b	mount: copy the port field into the cloned nfs_server structure. Doing this copy eliminates the "port=0" entry in the /proc/mounts entries Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=69241 Signed-off-by: Steve Dickson <steved@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-07-19 15:28:21 -04:00
Filipe Manana	e33bf72361	Btrfs: fix dir item validation when replaying xattr deletes We were passing an incorrect slot number to the function that validates directory items when we are replaying xattr deletes from a log tree. The correct slot is stored at variable 'i' and not at 'path->slots[0]', so the call to the validation function was only correct for the first iteration of the loop, when 'i == path->slots[0]'. After this fix, the fstest generic/066 passes again. Fixes: `8ee8c2d62d` ("btrfs: Verify dir_item in replay_xattr_deletes") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-07-19 20:38:16 +02:00
Andreas Gruenbacher	98e5a91a61	gfs2: Fixup to "Get rid of flush_delayed_work in gfs2_evict_inode" When commit `4fd1a57952` moved the call to flush_delayed_work from gfs2_evict_inode to gfs2_inode_lookup to avoid calling into DLM during evict, a similar call should have been added to gfs2_create_inode: that's another code path in which glocks of previous inodes may be reused. The flush of the iopen glock work queue added by `4fd1a57952`, on the other hand, is unnecessary and can be removed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2017-07-19 11:10:19 -05:00
Jan Kara	914cea93dd	gfs2: Don't clear SGID when inheriting ACLs When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit set, DIR1 is expected to have SGID bit set (and owning group equal to the owning group of 'DIR0'). However when 'DIR0' also has some default ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on 'DIR1' to get cleared if user is not member of the owning group. Fix the problem by moving posix_acl_update_mode() out of __gfs2_set_acl() into gfs2_set_acl(). That way the function will not be called when inheriting ACLs which is what we want as it prevents SGID bit clearing and the mode has been properly set by posix_acl_create() anyway. Fixes: `073931017b` Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2017-07-19 10:58:54 -05:00
Linus Torvalds	e06fdaf40a	Now that IPC and other changes have landed, enable manual markings for randstruct plugin, including the task_struct. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Kees Cook <kees@outflux.net> iQIcBAABCgAGBQJZbRgGAAoJEIly9N/cbcAmk2AQAIL60aQ+9RIcFAXriFhnd7Z2 x9Jqi9JNc8NgPFXx8GhE4J4eTZ5PwcjgXBpNRWY/laBkRyoBHn24ku09YxrJjmHz ZSUsP+/iO9lVeEfbmU9Tnk50afkfwx6bHXBwkiVGQWHtybNVUqA19JbqkHeg8ubx myKLGeUv5PPCodRIcBDD0+HaAANcsqtgbDpgmWU8s+IXWwvWCE2p7PuBw7v3HHgH qzlPDHYQCRDw+LWsSqPaHj+9mbRO18P/ydMoZHGH4Hl3YYNtty8ZbxnraI3A7zBL 6mLUVcZ+/l88DqHc5I05T8MmLU1yl2VRxi8/jpMAkg9wkvZ5iNAtlEKIWU6eqsvk vaImNOkViLKlWKF+oUD1YdG16d8Segrc6m4MGdI021tb+LoGuUbkY7Tl4ee+3dl/ 9FM+jPv95HjJnyfRNGidh2TKTa9KJkh6DYM9aUnktMFy3ca1h/LuszOiN0LTDiHt k5xoFURk98XslJJyXM8FPwXCXiRivrXMZbg5ixNoS4aYSBLv7Cn1M6cPnSOs7UPh FqdNPXLRZ+vabSxvEg5+41Ioe0SHqACQIfaSsV5BfF2rrRRdaAxK4h7DBcI6owV2 7ziBN1nBBq2onYGbARN6ApyCqLcchsKtQfiZ0iFsvW7ZawnkVOOObDTCgPl3tdkr 403YXzphQVzJtpT5eRV6 =ngAW -----END PGP SIGNATURE----- Merge tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull structure randomization updates from Kees Cook: "Now that IPC and other changes have landed, enable manual markings for randstruct plugin, including the task_struct. This is the rest of what was staged in -next for the gcc-plugins, and comes in three patches, largest first: - mark "easy" structs with __randomize_layout - mark task_struct with an optional anonymous struct to isolate the __randomize_layout section - mark structs to opt _out_ of automated marking (which will come later) And, FWIW, this continues to pass allmodconfig (normal and patched to enable gcc-plugins) builds of x86_64, i386, arm64, arm, powerpc, and s390 for me" * tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: randstruct: opt-out externally exposed function pointer structs task_struct: Allow randomized layout randstruct: Mark various structs for randomization	2017-07-19 08:55:18 -07:00
Linus Torvalds	a90c6ac2b5	A number of small fixes for -rc1 Luminous changes plus a readdir race fix, marked for stable. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABCAAGBQJZb3AkAAoJEEp/3jgCEfOLMAUH/RRRxbY4KL/PUhDXVPf+a+Pf groC365undvuCmHCkT1ufrlrh56KE0XUvEKgXJp+r84WS4SC6lxaebD6QvzVtyMM KPVnbpCNfKw5KtLB1upMteYY6MGfTk4VTPCav69aNGPrvUxJQB8obvWenPi0rWk/ knALvlJZbSiZeUDK3Id9cjntTGkClYuUHYJQ1JaZeieB/Xwnr+ZvV4on8ul7gkGX B6zdqaM43ZomSl/rJrV/G/MOMNV5uVjBNJmVpfH7KkZQGipW7O+8aDwFaMFAAN7r 4TQcLf+d3SDjcjVspikCMYr0r0VnbL8hLPGkd7Cus/3jei9GWQHGaQqbZZmcKl8= =TPyV -----END PGP SIGNATURE----- Merge tag 'ceph-for-4.13-rc2' of git://github.com/ceph/ceph-client Pull ceph fixes from Ilya Dryomov: "A number of small fixes for -rc1 Luminous changes plus a readdir race fix, marked for stable" * tag 'ceph-for-4.13-rc2' of git://github.com/ceph/ceph-client: libceph: potential NULL dereference in ceph_msg_data_create() ceph: fix race in concurrent readdir libceph: don't call encode_request_finish() on MOSDBackoff messages libceph: use alloc_pg_mapping() in __decode_pg_upmap_items() libceph: set -EINVAL in one place in crush_decode() libceph: NULL deref on osdmap_apply_incremental() error path libceph: fix old style declaration warnings	2017-07-19 08:49:46 -07:00
Ernesto A. Fernández	f070e5ac9b	jfs: preserve i_mode if __jfs_set_acl() fails When changing a file's acl mask, __jfs_set_acl() will first set the group bits of i_mode to the value of the mask, and only then set the actual extended attribute representing the new acl. If the second part fails (due to lack of space, for example) and the file had no acl attribute to begin with, the system will from now on assume that the mask permission bits are actual group permission bits, potentially granting access to the wrong users. Prevent this by only changing the inode mode after the acl has been set. Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2017-07-18 14:28:06 -05:00
Jan Kara	9bcf66c72d	jfs: Don't clear SGID when inheriting ACLs When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit set, DIR1 is expected to have SGID bit set (and owning group equal to the owning group of 'DIR0'). However when 'DIR0' also has some default ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on 'DIR1' to get cleared if user is not member of the owning group. Fix the problem by moving posix_acl_update_mode() out of __jfs_set_acl() into jfs_set_acl(). That way the function will not be called when inheriting ACLs which is what we want as it prevents SGID bit clearing and the mode has been properly set by posix_acl_create() anyway. Fixes: `073931017b` CC: stable@vger.kernel.org CC: jfs-discussion@lists.sourceforge.net Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2017-07-18 14:28:03 -05:00
Linus Torvalds	15b0a8d1d4	One fix for a problem introduced in the most recent merge window and found by Dave Jones and KASAN. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJZbhjwAAoJECebzXlCjuG+kx4P/3iBtL7+dBIcz3Lj6oetbOM0 6myybTmC6ZWLN9cmNN6GIFk82TUYvo4EQRAOt7GcSdF6X5GtMM0LNMMaquAxM/gc ZwCeQ0K33PKL3qIlaOF1DYT/DN3uwVdRl+VfueMzQjOAfm7kMtQaa1xjaGJtqT13 Vssn5S4LWIo9SaXFp9/kgnPZF9kIl1v8Rfwowhn8g90KNChcKnMSi4ia0jrMPK/w VLuOr8VLPlUa3GEKmcwMQqf+0WPryvVtahHIblNVTHyFeYP/j5u7TCp3T5GoechR /Rk2MPzPMtNk850QkrwEegYuJIRsMvrhSMaV/FOZCNKi728haP9ruOSQryMRKUBt BJMJrE+0hclJ2KPPbjp1cp2P1fn+dZbt+V22VGRD1ZTgKQKj7+DgSfLxUgA6/X4L 4QjfmO5N6suLfl8XaXyncliQxFoyWWV4s+FKQDjk5/dZ0/PZt1zBZ3cBo2Bxnjgj 1yfeVM8zyrnIAJv04U59fQnXaaqgkk0NmOllpTN4exbyHqq7U4wz9gpjFulbBn5X p6ebs6RKdRolfalpTyaRvdDcg8zmSwyeoHpzc3FHr7AGhX0edOXR8b+vWhalFB7T B1Jnoy/qesGjafB7BXgLtaZQO6wMoQv3JdTDrxDlRIXJQfULeR0Zw5kosNzgaBKY B1TSX5Tiit9W7+DxAfz+ =QSoH -----END PGP SIGNATURE----- Merge tag 'nfsd-4.13-1' of git://linux-nfs.org/~bfields/linux Pull nfsd fix from Bruce Fields: "One fix for a problem introduced in the most recent merge window and found by Dave Jones and KASAN" * tag 'nfsd-4.13-1' of git://linux-nfs.org/~bfields/linux: nfsd: Fix a memory scribble in the callback channel	2017-07-18 11:11:13 -07:00
Jan Kara	84969465dd	hfsplus: Don't clear SGID when inheriting ACLs When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit set, DIR1 is expected to have SGID bit set (and owning group equal to the owning group of 'DIR0'). However when 'DIR0' also has some default ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on 'DIR1' to get cleared if user is not member of the owning group. Fix the problem by creating __hfsplus_set_posix_acl() function that does not call posix_acl_update_mode() and use it when inheriting ACLs. That prevents SGID bit clearing and the mode has been properly set by posix_acl_create() anyway. Fixes: `073931017b` CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>	2017-07-18 18:23:39 +02:00
Jan Kara	34363c057b	isofs: Fix off-by-one in 'session' mount option parsing According to ECMA-130 standard maximum valid track number is 99. Since 'session' mount option starts indexing at 0 (and we add 1 to the passed number), we should refuse value 99. Also the condition in isofs_get_last_session() unnecessarily repeats the check - remove it. Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>	2017-07-18 12:33:16 +02:00

... 5 6 7 8 9 ...

50650 Commits