OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Li RongQing	e9f456ca50	nfs: define nfs_inc_fscache_stats and using it as possible Define and use nfs_inc_fscache_stats when plus one, which can save to pass one parameter. Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-24 20:08:47 -05:00
Li RongQing	5a254d08b0	nfs: replace nfs_add_stats with nfs_inc_stats when add one Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-24 20:08:47 -05:00
Markus Elfring	fe0bf1185d	NFS: Deletion of unnecessary checks before the function call "nfs_put_client" The nfs_put_client() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-24 20:07:27 -05:00
Jeff Layton	10b89567db	lockd: eliminate LOCKD_DEBUG LOCKD_DEBUG is always the same value as CONFIG_SUNRPC_DEBUG, so we can just use it instead. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-24 17:24:08 -05:00
Peng Tao	4bd5a980de	nfs41: fix nfs4_proc_layoutget error handling nfs4_layoutget_release() drops layout hdr refcnt. Grab the refcnt early so that it is safe to call .release in case nfs4_alloc_pages fails. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Fixes: `a47970ff78` ("NFSv4.1: Hold reference to layout hdr in layoutget") Cc: stable@vger.kernel.org # 3.9+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-24 17:14:54 -05:00
Weston Andros Adamson	cb1410c71e	NFS: fix subtle change in COMMIT behavior Recent work in the pgio layer made it possible for there to be more than one request per page. This caused a subtle change in commit behavior, because write.c:nfs_commit_unstable_pages compares the number of pages waiting for writeback against the number of requests on a commit list to choose when to send a COMMIT in a non-blocking flush. This is probably hard to hit in normal operation - you have to be using rsize/wsize < PAGE_SIZE, or pnfs with lots of boundaries that are not page aligned to have a noticeable change in behavior. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-24 17:00:42 -05:00
Christoph Hellwig	6a74c0c940	pnfs/blocklayout: fix end calculation in pnfs_num_cont_bytes Use the number of pages in the pagecache mapping instead of the number of pnfs requests which is only slightly related. Reported-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-24 17:00:41 -05:00
Anna Schumaker	878ffa9f85	NFS: Use nfs_server_capable() for checknig NFS_CAP_SEEK This should make the code easier to maintain in the future. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-24 12:49:13 -05:00
Jan Kara	c1b69b1ca1	nfs: Remove dead case from nfs4_map_errors() NFS4ERR_ACCESS has number 13 and thus is matched and returned immediately at the beginning of nfs4_map_errors() and there's no point in checking it later. Coverity-id: 733891 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-24 12:48:14 -05:00
Paul Burton	774c105ed8	binfmt_elf: allow arch code to examine PT_LOPROC ... PT_HIPROC headers MIPS is introducing new variants of its O32 ABI which differ in their handling of floating point, in order to enable a gradual transition towards a world where mips32 binaries can take advantage of new hardware features only available when configured for certain FP modes. In order to do this ELF binaries are being augmented with a new section that indicates, amongst other things, the FP mode requirements of the binary. The presence & location of such a section is indicated by a program header in the PT_LOPROC ... PT_HIPROC range. In order to allow the MIPS architecture code to examine the program header & section in question, pass all program headers in this range to an architecture-specific arch_elf_pt_proc function. This function may return an error if the header is deemed invalid or unsuitable for the system, in which case that error will be returned from load_elf_binary and upwards through the execve syscall. A means is required for the architecture code to make a decision once it is known that all such headers have been seen, but before it is too late to return from an execve syscall. For this purpose the arch_check_elf function is added, and called once, after all PT_LOPROC to PT_HIPROC headers have been passed to arch_elf_pt_proc but before the code which invoked execve has been lost. This enables the architecture code to make a decision based upon all the headers present in an ELF binary and its interpreter, as is required to forbid conflicting FP ABI requirements between an ELF & its interpreter. In order to allow data to be stored throughout the calls to the above functions, struct arch_elf_state is introduced. Finally a variant of the SET_PERSONALITY macro is introduced which accepts a pointer to the struct arch_elf_state, allowing it to act based upon state observed from the architecture specific program headers. Signed-off-by: Paul Burton <paul.burton@imgtec.com> Cc: linux-mips@linux-mips.org Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Patchwork: https://patchwork.linux-mips.org/patch/7679/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>	2014-11-24 07:45:02 +01:00
Paul Burton	a9d9ef133f	binfmt_elf: load interpreter program headers earlier Load the program headers of an ELF interpreter early enough in load_elf_binary that they can be examined before it's too late to return an error from an exec syscall. This patch does not perform any such checking, it merely lays the groundwork for a further patch to do so. No functional change is intended. Signed-off-by: Paul Burton <paul.burton@imgtec.com> Cc: linux-mips@linux-mips.org Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Patchwork: https://patchwork.linux-mips.org/patch/7675/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>	2014-11-24 07:45:02 +01:00
Paul Burton	6a8d38945c	binfmt_elf: Hoist ELF program header loading to a function load_elf_binary & load_elf_interp both load program headers from an ELF executable in the same way, duplicating the code. This patch introduces a helper function (load_elf_phdrs) which performs this common task & calls it from both load_elf_binary & load_elf_interp. In addition to reducing code duplication, this is part of preparing to load the ELF interpreter headers earlier such that they can be examined before it's too late to return an error from an exec syscall. Signed-off-by: Paul Burton <paul.burton@imgtec.com> Cc: linux-mips@linux-mips.org Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Patchwork: https://patchwork.linux-mips.org/patch/7676/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>	2014-11-24 07:45:02 +01:00
Jaegeuk Kim	0341845efc	f2fs: fix livelock calling f2fs_iget during f2fs_evict_inode In f2fs_evict_inode, commit_inmemory_pages f2fs_gc f2fs_iget iget_locked -> wait for inode free Here, if the inode is same as the one to be evicted, f2fs should wait forever. Actually, we should not call f2fs_balance_fs during f2fs_evict_inode to avoid this. But, the commit_inmem_pages calls f2fs_balance_fs by default, even if f2fs_evict_inode wants to free inmemory pages only. Hence, this patch adds to trigger f2fs_balance_fs only when there is something to write. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-23 21:51:57 -08:00
Jaegeuk Kim	9486ba442b	f2fs: introduce f2fs_dentry_kunmap to clean up This patch introduces f2fs_dentry_kunmap to clean up dirty codes. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-23 21:51:53 -08:00
Changman Lee	c9ee00857c	f2fs: fix wrong data structure when create slab It used nat_entry_set when create slab for sit_entry_set. Signed-off-by: Changman Lee <cm224.lee@samsung.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-23 21:48:49 -08:00
Jaegeuk Kim	09b8b3c839	f2fs: call flush_dcache_page when the page was updated Whenever f2fs updates mapped pages, it needs to call flush_dcache_page. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-23 21:48:31 -08:00
Linus Torvalds	d038a63ace	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs deadlock fix from Chris Mason: "This has a fix for a long standing deadlock that we've been trying to nail down for a while. It ended up being a bad interaction with the fair reader/writer locks and the order btrfs reacquires locks in the btree" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix lockups from btrfs_clear_path_blocking	2014-11-23 11:16:36 -08:00
Eric Whitney	0756b908a3	ext4: fix end of region partial cluster handling ext4_ext_remove_space() can incorrectly free a partial_cluster if EAGAIN is encountered while truncating or punching. Extent removal should be retried in this case. It also fails to free a partial cluster when the punched region begins at the start of a file on that unaligned cluster and where the entire file has not been punched. Remove the requirement that all blocks in the file must have been freed in order to free the partial cluster. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2014-11-23 00:59:39 -05:00
Eric Whitney	345ee94748	ext4: miscellaneous partial cluster cleanups Add some casts and rearrange a few statements for improved readability. Some code can also be simplified and made more readable if we set partial_cluster to 0 rather than to a negative value when we can tell we've hit the left edge of the punched region. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2014-11-23 00:59:39 -05:00
Eric Whitney	5bf4376065	ext4: fix end of leaf partial cluster handling The fix in commit `ad6599ab3a` ("ext4: fix premature freeing of partial clusters split across leaf blocks"), intended to avoid dereferencing an invalid extent pointer when determining whether a partial cluster should be freed, wasn't quite good enough. Assure that at least one extent remains at the start of the leaf once the hole has been punched. Otherwise, the pointer to the extent to the right of the hole will be invalid and a partial cluster will be incorrectly freed. Set partial_cluster to 0 when we can tell we've hit the left edge of the punched region within the leaf. This prevents incorrect freeing of a partial cluster when ext4_ext_rm_leaf is called one last time during extent tree traversal after the punched region has been removed. Adjust comments to reflect code changes and a correction. Remove a bit of dead code. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2014-11-23 00:58:11 -05:00
Eric Whitney	f4226d9ea4	ext4: fix partial cluster initialization The partial_cluster variable is not always initialized correctly when hole punching on bigalloc file systems. Although commit `c063449394` ("ext4: fix partial cluster handling for bigalloc file systems") addressed the case where the right edge of the punched region and the next extent to its right were within the same leaf, it didn't handle the case where the next extent to its right is in the next leaf. This causes xfstest generic/300 to fail. Fix this by replacing the code in c0634493922 with a more general solution that can continue the search for the first cluster to the right of the punched region into the next leaf if present. If found, partial_cluster is initialized to this cluster's negative value. There's no need to determine if that cluster is actually shared; we simply record it so its blocks won't be freed in the event it does happen to be shared. Also, minimize the burden on non-bigalloc file systems with some minor code simplification. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2014-11-23 00:55:42 -05:00
Filipe Manana	b38ef71cb1	Btrfs: ensure ordered extent errors aren't missed on fsync When doing a fsync with a fast path we have a time window where we can miss the fact that writeback of some file data failed, and therefore we endup returning success (0) from fsync when we should return an error. The steps that lead to this are the following: 1) We start all ordered extents by calling filemap_fdatawrite_range(); 2) We do some other work like locking the inode's i_mutex, start a transaction, start a log transaction, etc; 3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the ordered extents from inode's ordered tree into a list; 4) But by the time we do ordered extent collection, some ordered extents we started at step 1) might have already completed with an error, and therefore we didn't found them in the ordered tree and had no idea they finished with an error. This makes our fsync return success (0) to userspace, but has no bad effects on the log like for example insertion of file extent items into the log that point to unwritten extents, because the invalid extent maps were removed before the ordered extent completed (in inode.c:btrfs_finish_ordered_io). So after collecting the ordered extents just check if the inode's i_mapping has any error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever writeback fails for a page of an ordered extent, we call mapping_set_error (done in extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage) that sets one of those error flags in the inode's i_mapping flags. This change also has the side effect of fixing the issue where for fast fsyncs we never checked/cleared the error flags from the inode's i_mapping flags, which means that a full fsync performed after a fast fsync could get such errors that belonged to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which calls filemap_fdatawait_range(), and the later checks for and clears those flags, while for fast fsyncs we never call filemap_fdatawait_range() or anything else that checks for and clears the error flags from the inode's i_mapping. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-21 11:59:57 -08:00
Filipe Manana	0870295b23	Btrfs: collect only the necessary ordered extents on ranged fsync Instead of collecting all ordered extents from the inode's ordered tree and then wait for all of them to complete, just collect the ones that overlap the fsync range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-21 11:59:56 -08:00
Filipe Manana	5ab5e44a36	Btrfs: don't ignore log btree writeback errors If an error happens during writeback of log btree extents, make sure the error is returned to the caller (fsync), so that it takes proper action (commit current transaction) instead of writing a superblock that points to log btrees with all or some nodes that weren't durably persisted. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-21 11:59:55 -08:00
Josef Bacik	a28046956c	Btrfs: do not move em to modified list when unpinning We use the modified list to keep track of which extents have been modified so we know which ones are candidates for logging at fsync() time. Newly modified extents are added to the list at modification time, around the same time the ordered extent is created. We do this so that we don't have to wait for ordered extents to complete before we know what we need to log. The problem is when something like this happens log extent 0-4k on inode 1 copy csum for 0-4k from ordered extent into log sync log commit transaction log some other extent on inode 1 ordered extent for 0-4k completes and adds itself onto modified list again log changed extents see ordered extent for 0-4k has already been logged at this point we assume the csum has been copied sync log crash On replay we will see the extent 0-4k in the log, drop the original 0-4k extent which is the same one that we are replaying which also drops the csum, and then we won't find the csum in the log for that bytenr. This of course causes us to have errors about not having csums for certain ranges of our inode. So remove the modified list manipulation in unpin_extent_cache, any modified extents should have been added well before now, and we don't want them re-logged. This fixes my test that I could reliably reproduce this problem with. Thanks, cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-21 11:59:54 -08:00
Josef Bacik	50d9aa99bd	Btrfs: make sure logged extents complete in the current transaction V3 Liu Bo pointed out that my previous fix would lose the generation update in the scenario I described. It is actually much worse than that, we could lose the entire extent if we lose power right after the transaction commits. Consider the following write extent 0-4k log extent in log tree commit transaction < power fail happens here ordered extent completes We would lose the 0-4k extent because it hasn't updated the actual fs tree, and the transaction commit will reset the log so it isn't replayed. If we lose power before the transaction commit we are save, otherwise we are not. Fix this by keeping track of all extents we logged in this transaction. Then when we go to commit the transaction make sure we wait for all of those ordered extents to complete before proceeding. This will make sure that if we lose power after the transaction commit we still have our data. This also fixes the problem of the improperly updated extent generation. Thanks, cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-21 11:58:32 -08:00
Al Viro	3035b675ad	Merge branch 'overlayfs-current' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into for-linus "The biggest change is to rename the filesystem from "overlayfs" to "overlay". This will allow legacy overlayfs to be easily carried by distros alongside the new mainline one. Also fix a couple of copy-up races and allow escaping comma character in filenames." The last bit is about commas in pathname mount options...	2014-11-21 11:51:08 -05:00
Josef Bacik	9dba8cf128	Btrfs: make sure we wait on logged extents when fsycning two subvols If we have two fsync()'s race on different subvols one will do all of its work to get into the log_tree, wait on it's outstanding IO, and then allow the log_tree to finish it's commit. The problem is we were just free'ing that subvols logged extents instead of waiting on them, so whoever lost the race wouldn't really have their data on disk. Fix this by waiting properly instead of freeing the logged extents. Thanks, cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:20:10 -08:00
David Sterba	0d95c1bec9	btrfs: fix wrong accounting of raid1 data profile in statfs The sizes that are obtained from space infos are in raw units and have to be adjusted according to the raid factor. This was missing for f_bavail and df reported doubled size for raid1. Reported-by: Martin Steigerwald <Martin@lichtvoll.de> Fixes: `ba7b6e62f4` ("btrfs: adjust statfs calculations according to raid profiles") CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:20:09 -08:00
Gui Hecheng	321592427c	btrfs: fix dead lock while running replace and defrag concurrently This can be reproduced by fstests: btrfs/070 The scenario is like the following: replace worker thread defrag thread --------------------- ------------- copy_nocow_pages_worker btrfs_defrag_file copy_nocow_pages_for_inode ... btrfs_writepages \|A\| lock_extent_bits extent_write_cache_pages \|B\| lock_page __extent_writepage ... writepage_delalloc find_lock_delalloc_range \|B\| lock_extent_bits find_or_create_page pagecache_get_page \|A\| lock_page This leads to an ABBA pattern deadlock. To fix it, o we just change it to an AABB pattern which means to @unlock_extent_bits() before we @lock_page(), and in this way the @extent_read_full_page_nolock() is no longer in an locked context, so change it back to @extent_read_full_page() to regain protection. o Since we @unlock_extent_bits() earlier, then before @write_page_nocow(), the extent may not really point at the physical block we want, so we have to check it before write. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:20:08 -08:00
Filipe Manana	5f5bc6b1e2	Btrfs: make xattr replace operations atomic Replacing a xattr consists of doing a lookup for its existing value, delete the current value from the respective leaf, release the search path and then finally insert the new value. This leaves a time window where readers (getxattr, listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs, so this has security implications. This change also fixes 2 other existing issues which were: ) Deleting the old xattr value without verifying first if the new xattr will fit in the existing leaf item (in case multiple xattrs are packed in the same item due to name hash collision); ) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't exist but we have have an existing item that packs muliple xattrs with the same name hash as the input xattr. In this case we should return ENOSPC. A test case for xfstests follows soon. Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace implementation. Reported-by: Alexandre Oliva <oliva@gnu.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:20:07 -08:00
Filipe Manana	c7bc6319c5	Btrfs: avoid premature -ENOMEM in clear_extent_bit() We try to allocate an extent state structure before acquiring the extent state tree's spinlock as we might need a new one later and therefore avoid doing later an atomic allocation while holding the tree's spinlock. However we returned -ENOMEM if that initial non-atomic allocation failed, which is a bit excessive since we might end up not needing the pre-allocated extent state at all - for the case where the tree doesn't have any extent states that cover the input range and cover too any other range. Therefore don't return -ENOMEM if that pre-allocation fails. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:20:06 -08:00
Josef Bacik	7e33fd993a	Btrfs: don't take the chunk_mutex/dev_list mutex in statfs V2 Our gluster boxes get several thousand statfs() calls per second, which begins to suck hardcore with all of the lock contention on the chunk mutex and dev list mutex. We don't really need to hold these things, if we have transient weirdness with statfs() because of the chunk allocator we don't care, so remove this locking. We still need the dev_list lock if you mount with -o alloc_start however, which is a good argument for nuking that thing from orbit, but that's a patch for another day. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:20:05 -08:00
Josef Bacik	633c0aad4c	Btrfs: move read only block groups onto their own list V2 Our gluster boxes were spending lots of time in statfs because our fs'es are huge. The problem is statfs loops through all of the block groups looking for read only block groups, and when you have several terabytes worth of data that ends up being a lot of block groups. Move the read only block groups onto a read only list and only proces that list in btrfs_account_ro_block_groups_free_space to reduce the amount of churn. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:20:04 -08:00
David Sterba	cd743fac42	btrfs: fix typos in btrfs_check_super_valid Copy&paste errors in some messages and add few more missing macro accessors. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:20:03 -08:00
Stefan Behrens	cf90c59e68	Btrfs: check-int: don't complain about balanced blocks The xfstest btrfs/014 which tests the balance operation caused that the check_int module complained that known blocks changed their physical location. Since this is not an error in this case, only print such message if the verbose mode was enabled. Reported-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Tested-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:30 -08:00
Stefan Behrens	f382e4653f	Btrfs: check_int: use the known block location The xfstest btrfs/014 which tests the balance operation caused issues with the check_int module. The attempt was made to use btrfs_map_block() to find the physical location for a written block. However, this was not at all needed since the location of the written block was known since a hook to submit_bio() was the reason for entering the check_int module. Additionally, after a block relocation it happened that btrfs_map_block() failed causing misleading error messages afterwards. This patch changes the check_int module to use the known information of the physical location from the bio. Reported-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Tested-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:29 -08:00
Filipe Manana	c8fd3de79f	Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early We try to allocate an extent state before acquiring the tree's spinlock just in case we end up needing to split an existing extent state into two. If that allocation failed, we would return -ENOMEM. However, our only single caller (transaction/log commit code), passes in an extent state that was cached from a call to find_first_extent_bit() and that has a very high chance to match exactly the input range (always true for a transaction commit and very often, but not always, true for a log commit) - in this case we end up not needing at all that initial extent state used for an eventual split. Therefore just don't return -ENOMEM if we can't allocate the temporary extent state, since we might not need it at all, and if we end up needing one, we'll do it later anyway. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:29 -08:00
Filipe Manana	e38e2ed701	Btrfs: make find_first_extent_bit be able to cache any state Right now the only caller of find_first_extent_bit() that is interested in caching extent states (transaction or log commit), never gets an extent state cached. This is because find_first_extent_bit() only caches states that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and the transaction/log commit caller always passes a tree that doesn't have ever extent states with any of those flags (they can only have one of the following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT). This change together with the following one in the patch series (titled "Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early") will help reduce significantly the chances of calls to convert_extent_bit() fail with -ENOMEM when called from the transaction/log commit code. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:29 -08:00
Filipe Manana	663dfbb077	Btrfs: deal with convert_extent_bit errors to avoid fs corruption When committing a transaction or a log, we look for btree extents that need to be durably persisted by searching for ranges in a io tree that have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear those bits and set the EXTENT_NEED_WAIT bit, with calls to the function convert_extent_bit, and then start writeback for the extents. That function however can return an error (at the moment only -ENOMEM is possible, specially when it does GFP_ATOMIC allocation requests through alloc_extent_state_atomic) - that means the ranges didn't got the EXTENT_NEED_WAIT bit set (or at least not for the whole range), which in turn means a call to btrfs_wait_marked_extents() won't find those ranges for which we started writeback, causing a transaction commit or a log commit to persist a new superblock without waiting for the writeback of extents in that range to finish first. Therefore if a crash happens after persisting the new superblock and before writeback finishes, we have a superblock pointing to roots that weren't fully persisted or roots that point to nodes or leafs that weren't fully persisted, causing all sorts of unexpected/bad behaviour as we endup reading garbage from disk or the content of some node/leaf from a past generation that got cowed or deleted and is no longer valid (for this later case we end up getting error messages like "parent transid verify failed on X wanted Y found Z" when reading btree nodes/leafs from disk). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:29 -08:00
Eryu Guan	2fc9f6baa2	Btrfs: return failure if btrfs_dev_replace_finishing() failed device replace could fail due to another running scrub process or any other errors btrfs_scrub_dev() may hit, but this failure doesn't get returned to userspace. The following steps could reproduce this issue mkfs -t btrfs -f /dev/sdb1 /dev/sdb2 mount /dev/sdb1 /mnt/btrfs while true; do btrfs scrub start -B /mnt/btrfs >/dev/null 2>&1; done & btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs # if this replace succeeded, do the following and repeat until # you see this log in dmesg # BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115 #btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs # once you see the error log in dmesg, check return value of # replace echo $? Introduce a new dev replace result BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS to catch -EINPROGRESS explicitly and return other errors directly to userspace. Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:28 -08:00
Shilong Wang	6b3a4d60db	Btrfs: fix allocationg memory failure for btrfsic_state structure size of @btrfsic_state needs more than 2M, it is very likely to fail allocating memory using kzalloc(). see following mesage: [91428.902148] Call Trace: [<ffffffff816f6e0f>] dump_stack+0x4d/0x66 [<ffffffff811b1c7f>] warn_alloc_failed+0xff/0x170 [<ffffffff811b66e1>] __alloc_pages_nodemask+0x951/0xc30 [<ffffffff811fd9da>] alloc_pages_current+0x11a/0x1f0 [<ffffffff811b1e0b>] ? alloc_kmem_pages+0x3b/0xf0 [<ffffffff811b1e0b>] alloc_kmem_pages+0x3b/0xf0 [<ffffffff811d1018>] kmalloc_order+0x18/0x50 [<ffffffff811d1074>] kmalloc_order_trace+0x24/0x140 [<ffffffffa06c097b>] btrfsic_mount+0x8b/0xae0 [btrfs] [<ffffffff810af555>] ? check_preempt_curr+0x85/0xa0 [<ffffffff810b2de3>] ? try_to_wake_up+0x103/0x430 [<ffffffffa063d200>] open_ctree+0x1bd0/0x2130 [btrfs] [<ffffffffa060fdde>] btrfs_mount+0x62e/0x8b0 [btrfs] [<ffffffff811fd9da>] ? alloc_pages_current+0x11a/0x1f0 [<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50 [<ffffffff81230429>] mount_fs+0x39/0x1b0 [<ffffffff812509fb>] vfs_kern_mount+0x6b/0x150 [<ffffffff812537fb>] do_mount+0x27b/0xc30 [<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50 [<ffffffff812544f6>] SyS_mount+0x96/0xf0 [<ffffffff81701970>] system_call_fastpath+0x16/0x1b Since we are allocating memory for hash table array, so it will be good if we could allocate continuous pages here. Fix this problem by firstly trying kzalloc(), if we fail, use vzalloc() instead. Signed-off-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:28 -08:00
Filipe Manana	e6eb43142a	Btrfs: report error after failure inlining extent in compressed write path If cow_file_range_inline() failed, when called from compress_file_range(), we were tagging the locked page for writeback, end its writeback and unlock it, but not marking it with an error nor setting AS_EIO in inode's mapping flags. This made it impossible for a caller of filemap_fdatawrite_range (writepages) or filemap_fdatawait_range() to know that an error happened. And the return value of compress_file_range() is useless because it's returned to a workqueue task and not to the task calling filemap_fdatawrite_range (writepages). This change applies on top of the previous patchset starting at the patch titled: "[1/5] Btrfs: set page and mapping error on compressed write failure" Which changed extent_clear_unlock_delalloc() to use SetPageError and mapping_set_error(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:28 -08:00
Filipe Manana	728404dacf	Btrfs: add helper btrfs_fdatawrite_range To avoid duplicating this double filemap_fdatawrite_range() call for inodes with async extents (compressed writes) so often. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:28 -08:00
Filipe Manana	075bdbdbe9	Btrfs: correctly flush compressed data before/after direct IO For compressed writes, after doing the first filemap_fdatawrite_range() we don't get the pages tagged for writeback immediately. Instead we create a workqueue task, which is run by other kthread, and keep the pages locked. That other kthread compresses data, creates the respective ordered extent/s, tags the pages for writeback and unlocks them. Therefore we need a second call to filemap_fdatawrite_range() if we have compressed writes, as this second call will wait for the pages to become unlocked, then see they became tagged for writeback and finally wait for the writeback to finish. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:27 -08:00
Filipe Manana	c44f649e28	Btrfs: make inode.c:compress_file_range() return void Its return value is useless, its single caller ignores it and can't do anything with it anyway, since it's a workqueue task and not the task calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range(). Failure is communicated to such functions via start and end of writeback with the respective pages tagged with an error and AS_EIO flag set in the inode's imapping. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:27 -08:00
Shilong Wang	4bcbb33255	Btrfs: fix incorrect compression ratio detection Steps to reproduce: # mkfs.btrfs -f /dev/sdb # mount -t btrfs /dev/sdb /mnt -o compress=lzo # dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1 after previous steps, inode will be detected as bad compression ratio, and NOCOMPRESS flag will be set for that inode. Reason is that compress have a max limit pages every time(128K), if a 132k write in, it will be splitted into two write(128k+4k), this bug is a leftover for commit 68bb462d42a(Btrfs: don't compress for a small write) Fix this problem by checking every time before compression, if it is a small write(<=blocksize), we bail out and fall into nocompression directly. Signed-off-by: Wang Shilong <wangshilong1991@gmail.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:27 -08:00
Filipe Manana	7bdcefc103	Btrfs: don't ignore compressed bio write errors Our compressed bio write end callback was essentially ignoring the error parameter. When a write error happens, it must pass a value of 0 to the inode's write_page_end_io_hook callback, SetPageError on the respective pages and set AS_EIO in the inode's mapping flags, so that a call to filemap_fdatawait_range() / filemap_fdatawait() can find out that errors happened (we surely don't want silent failures on fsync for example). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:26 -08:00
Filipe Manana	dec8f17563	Btrfs: make inode.c:submit_compressed_extents() return void Its return value is completely ignored by its single caller and it's useless anyway, since errors are indicated through SetPageError and the bit AS_EIO set in the flags of the inode's mapping. The caller can't do anything with the value, as it's invoked from a workqueue task and not by the task calling filemap_fdatawrite_range (which calls the writepages address space callback, which in turn calls the inode's fill_delalloc callback). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:26 -08:00
Filipe Manana	3d7a820f71	Btrfs: process all async extents on compressed write failure If we had an error when processing one of the async extents from our list, we were not processing the remaining async extents, meaning we would leak those async_extent structs, never release the pages with the compressed data and never unlock and clear the dirty flag from the inode's pages (those that correspond to the uncompressed content). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:26 -08:00
Filipe Manana	40ae837b43	Btrfs: don't leak pages and memory on compressed write error In inode.c:submit_compressed_extents(), if we fail before calling btrfs_submit_compressed_write(), or when that function fails, we were freeing the async_extent structure without releasing its pages and freeing the pages array. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:26 -08:00
Filipe Manana	fce2a4e6b2	Btrfs: fix hang on compressed write error In inode.c:submit_compressed_extents(), before calling btrfs_submit_compressed_write() we start writeback for all pages, clear their dirty flag, unlock them, etc, but if btrfs_submit_compressed_write() fails (at the moment it can only fail with -ENOMEM), we never end the writeback on the pages, so any filemap_fdatawait_range() call will hang forever. We were also not calling the writepage end io hook, which means the corresponding ordered extent will never complete and all its waiters will block forever, such as a full fsync (via btrfs_wait_ordered_range()). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:25 -08:00
Filipe Manana	704de49d2b	Btrfs: set page and mapping error on compressed write failure If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(), we start and end the writeback for the pages (clear their dirty flag, unlock them, etc) but we don't tag the pages, nor the inode's mapping, with an error. This makes it impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit for e.g.) know that there was an error. Note that the return value of submit_compressed_extents() is useless, as that function is executed by a workqueue task and not directly by the fill_delalloc callback. This means the writepage/s callbacks of the inode's address space operations don't get that return value. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:25 -08:00
Al Viro	b93b41d4c7	ext4: kill ext4_kvfree() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2014-11-20 12:19:11 -05:00
Miklos Szeredi	7676895f47	ovl: ovl_dir_fsync() cleanup Check against !OVL_PATH_LOWER instead of OVL_PATH_MERGE. For a copied up directory the two are currently equivalent. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-11-20 16:40:02 +01:00
Miklos Szeredi	c9f00fdb9a	ovl: pass dentry into ovl_dir_read_merged() Pass dentry into ovl_dir_read_merged() insted of upperpath and lowerpath. This cleans up callers and paves the way for multi-layer directory reads. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-11-20 16:40:01 +01:00
Miklos Szeredi	71d509280f	ovl: use lockless_dereference() for upperdentry Don't open code lockless_dereference() in ovl_upperdentry_dereference(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-11-20 16:40:01 +01:00
Miklos Szeredi	91c7794713	ovl: allow filenames with comma Allow option separator (comma) to be escaped with backslash. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-11-20 16:40:00 +01:00
Miklos Szeredi	521484639e	ovl: fix race in private xattr checks Xattr operations can race with copy up. This does not matter as long as we consistently fiter out "trunsted.overlay.opaque" attribute on upper directories. Previously we checked parent against OVL_PATH_MERGE. This is too general, and prone to race with copy-up. I.e. we found the parent to be on the lower layer but ovl_dentry_real() would return the copied-up dentry, possibly with the "opaque" attribute. So instead use ovl_path_real() and decide to filter the attributes based on the actual type of the dentry we'll use. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-11-20 16:40:00 +01:00
Miklos Szeredi	a105d685a8	ovl: fix remove/copy-up race ovl_remove_and_whiteout() needs to check if upper dentry exists or not after having locked upper parent directory. Previously we used a "type" value computed before locking the upper parent directory, which is susceptible to racing with copy-up. There's a similar check in ovl_check_empty_and_clear(). This one is not actually racy, since copy-up doesn't change the "emptyness" property of a directory. Add a comment to this effect, and check the existence of upper dentry locally to make the code cleaner. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-11-20 16:39:59 +01:00
Miklos Szeredi	ef94b1864d	ovl: rename filesystem type to "overlay" Some distributions carry an "old" format of overlayfs while mainline has a "new" format. The distros will possibly want to keep the old overlayfs alongside the new for compatibility reasons. To make it possible to differentiate the two versions change the name of the new one from "overlayfs" to "overlay". Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reported-by: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Andy Whitcroft <apw@canonical.com>	2014-11-20 16:39:59 +01:00
Masanari Iida	6774def642	treewide: fix typo in printk and Kconfig This patch fix spelling typo in printk and Kconfig within various part of kernel sources. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2014-11-20 14:56:11 +01:00
Al Viro	ec7d879c45	GFS2: gfs2_atomic_open(): simplify the use of finish_no_open() In ->atomic_open(inode, dentry, file, opened) calling finish_no_open(file, NULL) is equivalent to dget(dentry); return finish_no_open(file, dentry); No need to open-code that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-11-20 11:18:08 +00:00
Al Viro	9265f1d0c7	GFS2: gfs2_dir_get_hash_table(): avoiding deferred vfree() is easy here... vfree() is allowed under spinlock these days, but it's cheaper when it doesn't step into deferred case and here it's very easy to avoid. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-11-20 10:29:44 +00:00
Al Viro	3cdcf63ed2	GFS2: use kvfree() instead of open-coding it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-11-20 10:29:14 +00:00
Al Viro	44bb31bac5	GFS2: gfs2_create_inode(): don't bother with d_splice_alias() dentry is always hashed and negative, inode - non-error, non-NULL and non-directory. In such conditions d_splice_alias() is equivalent to "d_instantiate(dentry, inode) and return NULL", which simplifies the downstream code and is consistent with the "have to create a new object" case. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-11-20 10:17:39 +00:00
Al Viro	571a4b5797	GFS2: bugger off early if O_CREAT open finds a directory Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-11-20 10:16:58 +00:00
Jaegeuk Kim	857dc4e059	f2fs: write SSA pages under memory pressure Under memory pressure, we don't need to skip SSA page writes. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-19 22:49:33 -08:00
Jaegeuk Kim	27c6bd60ac	f2fs: submit bio for node blocks in the reclaim path If a node page is request to be written during the reclaiming path, we should submit the bio to avoid pending to recliam it. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-19 22:49:32 -08:00
Chao Yu	67298804f3	f2fs: introduce struct inode_management to wrap inner fields Now in f2fs, we have three inode cache: ORPHAN_INO, APPEND_INO, UPDATE_INO, and we manage fields related to inode cache separately in struct f2fs_sb_info for each inode cache type. This makes codes a bit messy, so that this patch intorduce a new struct inode_management to wrap inner fields as following which make codes more neat. /* for inner inode cache management / struct inode_management { struct radix_tree_root ino_root; / ino entry array / spinlock_t ino_lock; / for ino entry lock / struct list_head ino_list; / inode list head / unsigned long ino_num; / number of entries / }; struct f2fs_sb_info { ... struct inode_management im[MAX_INO_ENTRY]; / manage inode cache */ ... } Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-19 22:49:32 -08:00
Chao Yu	aba291b3d8	f2fs: remove unneeded check code with option in f2fs_remount Because we have checked the contrary condition in case of "if" judgment, we do not need to check the condition again in case of "else" judgment. Let's remove it. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-19 22:49:31 -08:00
Chao Yu	6c02993203	f2fs: avoid unable to restart gc thread in remount In f2fs_remount, we will stop gc thread and set need_restart_gc as true when new option is set without BG_GC, then if any error occurred in the following procedure, we can restore to start the gc thread. But after that, We will fail to restore gc thread in start_gc_thread as BG_GC is not set in new option, so we'd better move this condition judgment out of start_gc_thread to fix this issue. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-19 22:49:30 -08:00
Markus Elfring	fdf2657bc8	udf: One function call less in udf_fill_super() after error detection The iput() function was called in up to three cases by the udf_fill_super() function during error handling even if the passed data structure element contained still a null pointer. This implementation detail could be improved by the introduction of another jump label. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-19 21:56:06 +01:00
Markus Elfring	0d454e4a44	udf: Deletion of unnecessary checks before the function call "iput" The iput() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-19 21:55:45 +01:00
David Teigland	2ab4bd8ea3	dlm: adopt orphan locks A process may exit, leaving an orphan lock in the lockspace. This adds the capability for another process to acquire the orphan lock. Acquiring the orphan just moves the lock from the orphan list onto the acquiring process's list of locks. An adopting process must specify the resource name and mode of the lock it wants to adopt. If a matching lock is found, the lock is moved to the caller's 's list of locks, and the lkid of the lock is returned like the lkid of a new lock. If an orphan with a different mode is found, then -EAGAIN is returned. If no orphan lock is found on the resource, then -ENOENT is returned. No async completion is used because the result is immediately available. Also, when orphans are purged, allow a zero nodeid to refer to the local nodeid so the caller does not need to look up the local nodeid. Signed-off-by: David Teigland <teigland@redhat.com>	2014-11-19 14:48:02 -06:00
Trond Myklebust	c6c15e1ed3	nfsd: Fix slot wake up race in the nfsv4.1 callback code The currect code for nfsd41_cb_get_slot() and nfsd4_cb_done() has no locking in order to guarantee atomicity, and so allows for races of the form. Task 1 Task 2 ====== ====== if (test_and_set_bit(0) != 0) { clear_bit(0) rpc_wake_up_next(queue) rpc_sleep_on(queue) return false; } This patch breaks the race condition by adding a retest of the bit after the call to rpc_sleep_on(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-11-19 15:45:44 -05:00
Chris Mason	f82c458a2c	btrfs: fix lockups from btrfs_clear_path_blocking The fair reader/writer locks mean that btrfs_clear_path_blocking needs to strictly follow lock ordering rules even when we already have blocking locks on a given path. Before we can clear a blocking lock on the path, we need to make sure all of the locks have been converted to blocking. This will remove lock inversions against anyone spinning in write_lock() against the buffers we're trying to get read locks on. These inversions didn't exist before the fair read/writer locks, but now we need to be more careful. We papered over this deadlock in the past by changing btrfs_try_read_lock() to be a true trylock against both the spinlock and the blocking lock. This was slower, and not sufficient to fix all the deadlocks. This patch adds a btrfs_tree_read_lock_atomic(), which basically means get the spinlock but trylock on the blocking lock. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Reported-by: Patrick Schmid <schmid@phys.ethz.ch> cc: stable@vger.kernel.org #v3.15+	2014-11-19 10:34:35 -08:00
Arnd Bergmann	7ca2f23440	isofs: avoid unused function warning With the isofs_hash() function removed, isofs_hash_ms() is the only user of isofs_hash_common(), but it's defined inside of an #ifdef, which triggers this gcc warning in ARM axm55xx_defconfig starting with v3.18-rc3: fs/isofs/inode.c:177:1: warning: 'isofs_hash_common' defined but not used [-Wunused-function] This patch moves the function inside of the same #ifdef section to avoid that warning, which seems the best compromise of a relatively harmless patch for a late -rc. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `b0afd8e5db` ("isofs: don't bother with ->d_op for normal case") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:09:37 -05:00
Yan, Zheng	4a7795d35e	vfs: fix reference leak in d_prune_aliases() In "d_prune_alias(): just lock the parent and call __dentry_kill()" the old dget + d_drop + dput has been replaced with lock_parent + __dentry_kill; unfortunately, dput() does more than just killing dentry - it also drops the reference to parent. New variant leaks that reference and needs dput(parent) after killing the child off. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:07:20 -05:00
Al Viro	8ce74dd605	Merge tag 'trace-seq-file-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into for-next Pull the beginning of seq_file cleanup from Steven: "I'm looking to clean up the seq_file code and to eventually merge the trace_seq code with seq_file as well, since they basically do the same thing. Part of this process is to remove the return code of seq_printf() and friends as they are rather inconsistent. It is better to use the new function seq_has_overflowed() if you want to stop processing when the buffer is full. Note, if the buffer is full, the seq_file code will throw away the contents, allocate a bigger buffer, and then call your code again to fill in the data. The only thing that breaking out of the function early does is to save a little time which is probably never noticed. I started with patches from Joe Perches and modified them as well. There's many more places that need to be updated before we can convert seq_printf() and friends to return void. But this patch set introduces the seq_has_overflowed() and does some initial updates."	2014-11-19 13:02:53 -05:00
Mikulas Patocka	08d4f77222	dcache: fix kmemcheck warning in switch_names This patch fixes kmemcheck warning in switch_names. The function switch_names swaps inline names of two dentries. It swaps full arrays d_iname, no matter how many bytes are really used by the strings. Reading data beyond string ends results in kmemcheck warning. We fix the bug by marking both arrays as fully initialized. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v3.15 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:26 -05:00
Al Viro	9f45f5bf30	new helper: audit_file() ... for situations when we don't have any candidate in pathnames - basically, in descriptor-based syscalls. [Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:26 -05:00
Al Viro	6f4e0d5aaa	nfsd_vfs_write(): use file_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:26 -05:00
Al Viro	a67f797db6	ncpfs: use file_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:25 -05:00
Al Viro	b583043e99	kill f_dentry uses Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:25 -05:00
Al Viro	30e46aba8f	lockd: get rid of ->f_path.dentry->d_sb Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:24 -05:00
Al Viro	3aa3377fbc	procfs: get rid of ->f_dentry Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:24 -05:00
Al Viro	ef8a1a10e9	nfsd: get rid of ->f_dentry Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:23 -05:00
Al Viro	32a59234ae	rpc_pipefs.c: get rid of f_dentry Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:23 -05:00
Al Viro	3c981bfc57	afs_fsync: don't bother with ->f_path.dentry Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:22 -05:00
Al Viro	7119e220a7	cifs: get rid of ->f_path.dentry->d_sb uses, add a new helper Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:22 -05:00
Al Viro	ddb52f4fd2	btrfs: get rid of f_dentry use Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:21 -05:00
Al Viro	244c7d444b	nfsd/nfsctl.c: new helper ... to get from opened file on nfsctl to relevant struct net * Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:21 -05:00
Al Viro	a455589f18	assorted conversions to %p[dD] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:20 -05:00
Al Viro	41d28bca2d	switch d_materialise_unique() users to d_splice_alias() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:20 -05:00
Al Viro	b5ae6b15bd	merge d_materialise_unique() into d_splice_alias() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:01:19 -05:00
Al Viro	154e80e4c3	Merge branch 'for-gfs2' into for-next	2014-11-19 13:00:57 -05:00
Al Viro	427c77d461	d_add_ci() should just accept a hashed exact match if it finds one Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 13:00:10 -05:00
Al Viro	845409b49b	gfs2_atomic_open(): simplify the use of finish_no_open() In ->atomic_open(inode, dentry, file, opened) calling finish_no_open(file, NULL) is equivalent to dget(dentry); return finish_no_open(file, dentry); No need to open-code that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 12:57:21 -05:00
Al Viro	81295ce635	gfs2_create_inode(): don't bother with d_splice_alias() dentry is always hashed and negative, inode - non-error, non-NULL and non-directory. In such conditions d_splice_alias() is equivalent to "d_instantiate(dentry, inode) and return NULL", which simplifies the downstream code and is consistent with the "have to create a new object" case. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 12:57:21 -05:00
Al Viro	986cdb862e	gfs2: bugger off early if O_CREAT open finds a directory Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-19 12:57:14 -05:00
J. Bruce Fields	56429e9b3b	merge nfs bugfixes into nfsd for-3.19 branch In addition to nfsd bugfixes, there are some fixes in -rc5 for client bugs that can interfere with my testing.	2014-11-19 12:06:30 -05:00
Christoph Hellwig	6d0ba0432a	nfsd: correctly define v4.2 support attributes Even when security labels are disabled we support at least the same attributes as v4.1. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-11-19 12:03:19 -05:00
James Morris	a6aacbde40	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity into next	2014-11-19 21:36:07 +11:00
James Morris	b10778a00d	Merge commit 'v3.17' into next	2014-11-19 21:32:12 +11:00
Jaegeuk Kim	8cdcb71322	f2fs: put the inode page when error was occurred We should put the inode page when error was occurred. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-18 17:04:33 -08:00
Jaegeuk Kim	6d20aff83c	f2fs: fix to call put_page at the error handling routine The locked page should be released before returning the function. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-18 17:02:47 -08:00
Markus Elfring	30badc9543	GFS2: Deletion of unnecessary checks before two function calls The functions iput() and put_pid() test whether their argument is NULL and then return immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-11-18 10:57:58 +00:00
Markus Elfring	11cc9f56a1	jbd: Deletion of an unnecessary check before the function call "iput" The iput() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-18 10:15:29 +01:00
Dmitry Kasatkin	6fb5032ebb	VFS: refactor vfs_read() integrity_kernel_read() duplicates the file read operations code in vfs_read(). This patch refactors vfs_read() code creating a helper function __vfs_read(). It is used by both vfs_read() and integrity_kernel_read(). Signed-off-by: Dmitry Kasatkin <d.kasatkin@samsung.com> Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>	2014-11-17 23:14:22 -05:00
Dave Hansen	abe1e395f6	fs: Do not include mpx.h in exec.c We no longer need mpx.h in exec.c. This will obviously also break the build for non-x86 builds. We get the MPX includes that we need from mmu_context.h now. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141118003608.837015B3@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2014-11-18 02:01:40 +01:00
Dave Hansen	fe3d197f84	x86, mpx: On-demand kernel allocation of bounds tables This is really the meat of the MPX patch set. If there is one patch to review in the entire series, this is the one. There is a new ABI here and this kernel code also interacts with userspace memory in a relatively unusual manner. (small FAQ below). Long Description: This patch adds two prctl() commands to provide enable or disable the management of bounds tables in kernel, including on-demand kernel allocation (See the patch "on-demand kernel allocation of bounds tables") and cleanup (See the patch "cleanup unused bound tables"). Applications do not strictly need the kernel to manage bounds tables and we expect some applications to use MPX without taking advantage of this kernel support. This means the kernel can not simply infer whether an application needs bounds table management from the MPX registers. The prctl() is an explicit signal from userspace. PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to require kernel's help in managing bounds tables. PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel won't allocate and free bounds tables even if the CPU supports MPX. PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds directory out of a userspace register (bndcfgu) and then cache it into a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT will set "bd_addr" to an invalid address. Using this scheme, we can use "bd_addr" to determine whether the management of bounds tables in kernel is enabled. Also, the only way to access that bndcfgu register is via an xsaves, which can be expensive. Caching "bd_addr" like this also helps reduce the cost of those xsaves when doing table cleanup at munmap() time. Unfortunately, we can not apply this optimization to #BR fault time because we need an xsave to get the value of BNDSTATUS. ==== Why does the hardware even have these Bounds Tables? ==== MPX only has 4 hardware registers for storing bounds information. If MPX-enabled code needs more than these 4 registers, it needs to spill them somewhere. It has two special instructions for this which allow the bounds to be moved between the bounds registers and some new "bounds tables". They are similar conceptually to a page fault and will be raised by the MPX hardware during both bounds violations or when the tables are not present. This patch handles those #BR exceptions for not-present tables by carving the space out of the normal processes address space (essentially calling the new mmap() interface indroduced earlier in this patch set.) and then pointing the bounds-directory over to it. The tables need to be accessed and controlled by userspace because the instructions for moving bounds in and out of them are extremely frequent. They potentially happen every time a register pointing to memory is dereferenced. Any direct kernel involvement (like a syscall) to access the tables would obviously destroy performance. ==== Why not do this in userspace? ==== This patch is obviously doing this allocation in the kernel. However, MPX does not strictly require anything in the kernel. It can theoretically be done completely from userspace. Here are a few ways this could be done. I don't think any of them are practical in the real-world, but here they are. Q: Can virtual space simply be reserved for the bounds tables so that we never have to allocate them? A: As noted earlier, these tables are HUGE. An X-GB virtual area needs 4X GB of virtual space, plus 2GB for the bounds directory. If we were to preallocate them for the 128TB of user virtual address space, we would need to reserve 512TB+2GB, which is larger than the entire virtual address space today. This means they can not be reserved ahead of time. Also, a single process's pre-popualated bounds directory consumes 2GB of virtual AND* physical memory. IOW, it's completely infeasible to prepopulate bounds directories. Q: Can we preallocate bounds table space at the same time memory is allocated which might contain pointers that might eventually need bounds tables? A: This would work if we could hook the site of each and every memory allocation syscall. This can be done for small, constrained applications. But, it isn't practical at a larger scale since a given app has no way of controlling how all the parts of the app might allocate memory (think libraries). The kernel is really the only place to intercept these calls. Q: Could a bounds fault be handed to userspace and the tables allocated there in a signal handler instead of in the kernel? A: (thanks to tglx) mmap() is not on the list of safe async handler functions and even if mmap() would work it still requires locking or nasty tricks to keep track of the allocation state there. Having ruled out all of the userspace-only approaches for managing bounds tables that we could think of, we create them on demand in the kernel. Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2014-11-18 00:58:53 +01:00
Qiaowei Ren	4aae7e436f	x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific MPX-enabled applications using large swaths of memory can potentially have large numbers of bounds tables in process address space to save bounds information. These tables can take up huge swaths of memory (as much as 80% of the memory on the system) even if we clean them up aggressively. In the worst-case scenario, the tables can be 4x the size of the data structure being tracked. IOW, a 1-page structure can require 4 bounds-table pages. Being this huge, our expectation is that folks using MPX are going to be keen on figuring out how much memory is being dedicated to it. So we need a way to track memory use for MPX. If we want to specifically track MPX VMAs we need to be able to distinguish them from normal VMAs, and keep them from getting merged with normal VMAs. A new VM_ flag set only on MPX VMAs does both of those things. With this flag, MPX bounds-table VMAs can be distinguished from other VMAs, and userspace can also walk /proc/$pid/smaps to get memory usage for MPX. In addition to this flag, we also introduce a special ->vm_ops specific to MPX VMAs (see the patch "add MPX specific mmap interface"), but currently different ->vm_ops do not by themselves prevent VMA merging, so we still need this flag. We understand that VM_ flags are scarce and are open to other options. Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2014-11-18 00:58:53 +01:00
Benjamin Marzinski	2e60d7683c	GFS2: update freeze code to use freeze/thaw_super on all nodes The current gfs2 freezing code is considerably more complicated than it should be because it doesn't use the vfs freezing code on any node except the one that begins the freeze. This is because it needs to acquire a cluster glock before calling the vfs code to prevent a deadlock, and without the new freeze_super and thaw_super hooks, that was impossible. To deal with the issue, gfs2 had to do some hacky locking tricks to make sure that a frozen node couldn't be holding on a lock it needed to do the unfreeze ioctl. This patch makes use of the new hooks to simply the gfs2 locking code. Now, all the nodes in the cluster freeze and thaw in exactly the same way. Every node in the cluster caches the freeze glock in the shared state. The new freeze_super hook allows the freezing node to grab this freeze glock in the exclusive state without first calling the vfs freeze_super function. All the nodes in the cluster see this lock change, and call the vfs freeze_super function. The vfs locking code guarantees that the nodes can't get stuck holding the glocks necessary to unfreeze the system. To unfreeze, the freezing node uses the new thaw_super hook to drop the freeze glock. Again, all the nodes notice this, reacquire the glock in shared mode and call the vfs thaw_super function. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-11-17 10:36:39 +00:00
Benjamin Marzinski	48b6bca6b7	fs: add freeze_super/thaw_super fs hooks Currently, freezing a filesystem involves calling freeze_super, which locks sb->s_umount and then calls the fs-specific freeze_fs hook. This makes it hard for gfs2 (and potentially other cluster filesystems) to use the vfs freezing code to do freezes on all the cluster nodes. In order to communicate that a freeze has been requested, and to make sure that only one node is trying to freeze at a time, gfs2 uses a glock (sd_freeze_gl). The problem is that there is no hook for gfs2 to acquire this lock before calling freeze_super. This means that two nodes can attempt to freeze the filesystem by both calling freeze_super, acquiring the sb->s_umount lock, and then attempting to grab the cluster glock sd_freeze_gl. Only one will succeed, and the other will be stuck in freeze_super, making it impossible to finish freezing the node. To solve this problem, this patch adds the freeze_super and thaw_super hooks. If a filesystem implements these hooks, they are called instead of the vfs freeze_super and thaw_super functions. This means that every filesystem that implements these hooks must call the vfs freeze_super and thaw_super functions itself within the hook function to make use of the vfs freezing code. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-11-17 10:35:17 +00:00
Ingo Molnar	e9ac5f0fa8	Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying more changes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-11-16 10:50:25 +01:00
Ingo Molnar	595247f61f	* Support module unload for efivarfs - Mathias Krause * Another attempt at moving x86 to libstub taking advantage of the __pure attribute - Ard Biesheuvel * Add EFI runtime services section to ptdump - Mathias Krause -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJUZnQGAAoJEC84WcCNIz1VfyUP/1MCVt4vepl7+0JzdUP/eVPs CwPM6gBOvgx1PviWrtvSU8UyjtYDqZx7jnCvyvbmlgixAqqIoFm80x5sd9DfyJBj vSrmavXaJgQomJN3N+fvaIpGJXp8NQmeNT87++UMb6VE5nYvx7suDcwfTqOaxcYt yDwKatTXTvQxDLlGgtymp2UhgVKBICs9WVo8weevB5LPmpt4TFCi1GDSimJkfg+0 JTvkKF+QxPmVqgwY7bgdFFcfhsYCux5VgtbD4DJKS3LgfLJLAMKPOt83DbeOSmIa 8zqtlF3eMwHgecKrMLSfIZH3XIl2gsIdPdvT6iBQkwwGZuSzG93JgwJ90HYiKoDm yNlffnhmdgI2RXO97UZJpzqGor+eNc0auuS4485PcE8NtZ1tbo20A/OCpfIzK8j8 Vk7sfZxaaHKF5PUtRe6vo3myRlUCofMIuSEWSF8d709R6AEEuia6RZ3Y45EJPROn fKOiLsf7Og1Mk43Iy2lb7kFT766OsUnZZHU/xiIZj/v94HPWFWoFPtxPC0IURvPx 24oiJxCnyXWGtoyn+SSprl+NAPuPsxVFYriTwaq2RBuoY0NAdy7NIXKe2HTp/WI0 oSTtYkCRcwiHv0aSrg+yQHmwH7y7m39S3yIS4t5LXenn2G4ObUUjhcAQdF2Ft0Pr MT/l+stTt390cpfpcQbE =he2O -----END PGP SIGNATURE----- Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/efi Pull EFI updates for v3.19 from Matt Fleming: - Support module unload for efivarfs - Mathias Krause - Another attempt at moving x86 to libstub taking advantage of the __pure attribute - Ard Biesheuvel - Add EFI runtime services section to ptdump - Mathias Krause Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-11-16 10:48:53 +01:00
Linus Torvalds	1afcb6ed0d	NFS client bugfixes for Linux 3.18 Highlights include: - Stable patches to fix NFSv4.x delegation reclaim error paths - Fix a bug whereby we were advertising NFSv4.1 but using NFSv4.2 features - Fix a use-after-free problem with pNFS block layouts - Fix a memory leak in the pNFS files O_DIRECT code - Replace an intrusive and Oops-prone performance fix in the NFSv4 atomic open code with a safer one-line version and revert the two original patches. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJUZol9AAoJEGcL54qWCgDyNQQQALnngvpPR51BoO/iTz9ruXol fGZy0SRIlTUKm1ArsQsQ+HGbV5K0hgP3Tg+z2AtEEZ8u/2Fi2Bqdl6+eNY12tKHd uUctDdM5TXLrETAn1UULrnd2eX1cvPMBfOlXlAdNHHsGEgC7w7YQ+rzGwnls+HDy LYXzY7Y3jYGdTMaRgZc5YRdtd8JBpCxciRvPEQLDIobwP0JnZC1afTLe1XInqB2I TZ4NTHT+DEWA+Ou1P2deL7+RuJNEAeWWBvULJy76n4BqKvN/HNedOO5HyBYXrwSd 3UX3wbx9CWRxN1F0EqNKxjxZ/597JwqBeNoTDRcofLsqumUfAOtlbym1EahcD3Ls pykopNfgUhGuhxolStmuHdS6CnyQPERpR5lFZcDp7XtcwSq4FcwD8DRzLJMZW5dg N1lkfFlwQN3rqdk/NEHL+IxS41Hlk4HXjMoP6MNbRtqzIN6tW9tvC4MtAWd1aYxO YuUW281pbWxXQ731s0kTIrMUdQ9vGSRBMcbnO9rL3o+xkh8y5SPVkx9lhdhJN0UD VbQ5Ws/xZ54bD1PfyYb+Yx659lI8MSFOsDuMuLmDtfYnVicHwCA3H63StvQ3ihf/ q0gu8Iex9YbNNjf7IfYGuWPmPn3gwPBoURPC0bcZvMPdY6DXodU6Oj4BRTQ5VCie 9N0pt2wp2eRjaSzD7r5A =8YN6 -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: "Highlights include: - stable patches to fix NFSv4.x delegation reclaim error paths - fix a bug whereby we were advertising NFSv4.1 but using NFSv4.2 features - fix a use-after-free problem with pNFS block layouts - fix a memory leak in the pNFS files O_DIRECT code - replace an intrusive and Oops-prone performance fix in the NFSv4 atomic open code with a safer one-line version and revert the two original patches" * tag 'nfs-for-3.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: sunrpc: fix sleeping under rcu_read_lock in gss_stringify_acceptor NFS: Don't try to reclaim delegation open state if recovery failed NFSv4: Ensure that we call FREE_STATEID when NFSv4.x stateids are revoked NFSv4: Fix races between nfs_remove_bad_delegation() and delegation return NFSv4.1: nfs41_clear_delegation_stateid shouldn't trust NFS_DELEGATED_STATE NFSv4: Ensure that we remove NFSv4.0 delegations when state has expired NFS: SEEK is an NFS v4.2 feature nfs: Fix use of uninitialized variable in nfs_getattr() nfs: Remove bogus assignment nfs: remove spurious WARN_ON_ONCE in write path pnfs/blocklayout: serialize GETDEVICEINFO calls nfs: fix pnfs direct write memory leak Revert "NFS: nfs4_do_open should add negative results to the dcache." Revert "NFS: remove BUG possibility in nfs4_open_and_get_state" NFSv4: Ensure nfs_atomic_open set the dentry verifier on ENOENT	2014-11-15 14:15:16 -08:00
Andrew Price	98f1a696a1	GFS2: Update timestamps on fallocate gfs2_fallocate() wasn't updating ctime and mtime when modifying the inode. Add a call to file_update_time() to do that. Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-11-14 14:16:33 +00:00
Andrew Price	1885867b84	GFS2: Update i_size properly on fallocate This addresses an issue caught by fsx where the inode size was not being updated to the expected value after fallocate(2) with mode 0. The problem was caused by the offset and len parameters being converted to multiples of the file system's block size, so i_size would be rounded up to the nearest block size multiple instead of the requested size. This replaces the per-chunk i_size updates with a single i_size_write on successful completion of the operation. With this patch gfs2 gets through a complete run of fsx. For clarity, the check for (error == 0) following the loop is removed as all failures before that point jump to out_* labels or return. Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-11-14 14:15:04 +00:00
Andrew Price	9c9f1159a5	GFS2: Use inode_newsize_ok and get_write_access in fallocate gfs2_fallocate wasn't checking inode_newsize_ok nor get_write_access. Split out the context setup and inode locking pieces into a separate function to make it more clear and add these missing calls. inode_newsize_ok is called conditional on FALLOC_FL_KEEP_SIZE as there is no need to enforce a file size limit if it isn't going to change. Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-11-14 14:14:30 +00:00
Linus Torvalds	971ad4e4d6	Merge branch 'akpm' (fixes from Andrew Morton) Merge misc fixes from Andrew Morton: "15 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: MAINTAINERS: add IIO include files kernel/panic.c: update comments for print_tainted mem-hotplug: reset node present pages when hot-adding a new pgdat mem-hotplug: reset node managed pages when hot-adding a new pgdat mm/debug-pagealloc: correct freepage accounting and order resetting fanotify: fix notification of groups with inode & mount marks mm, compaction: prevent infinite loop in compact_zone mm: alloc_contig_range: demote pages busy message from warn to info mm/slab: fix unalignment problem on Malta with EVA due to slab merge mm/page_alloc: restrict max order of merging on isolated pageblock mm/page_alloc: move freepage counting logic to __free_one_page() mm/page_alloc: add freepage on isolate pageblock to correct buddy list mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype mm/compaction: skip the range until proper target pageblock is met zram: avoid kunmap_atomic() of a NULL pointer	2014-11-13 16:57:25 -08:00
Jan Kara	8edc6e1688	fanotify: fix notification of groups with inode & mount marks fsnotify() needs to merge inode and mount marks lists when notifying groups about events so that ignore masks from inode marks are reflected in mount mark notifications and groups are notified in proper order (according to priorities). Currently the sorting of the lists done by fsnotify_add_inode_mark() / fsnotify_add_vfsmount_mark() and fsnotify() differed which resulted ignore masks not being used in some cases. Fix the problem by always using the same comparison function when sorting / merging the mark lists. Thanks to Heinrich Schuchardt for improvements of my patch. Link: https://bugzilla.kernel.org/show_bug.cgi?id=87721 Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-11-13 16:17:06 -08:00
Yan, Zheng	3231300bb9	ceph: fix flush tid comparision TID of cap flush ack is 64 bits, but ceph_inode_info::flushing_cap_tid is only 16 bits. 16 bits should be plenty to let the cap flush updates pipeline appropriately, but we need to cast in the proper direction when comparing these differently-sized versions. So downcast the 64-bits one to 16 bits. Reflects ceph.git commit a5184cf46a6e867287e24aeb731634828467cd98. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>	2014-11-13 22:19:05 +03:00
Trond Myklebust	f8ebf7a8ca	NFS: Don't try to reclaim delegation open state if recovery failed If state recovery failed, then we should not attempt to reclaim delegated state. http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-12 17:19:04 -05:00
Trond Myklebust	c606bb8857	NFSv4: Ensure that we call FREE_STATEID when NFSv4.x stateids are revoked NFSv4.x (x>0) requires us to call TEST_STATEID+FREE_STATEID if a stateid is revoked. We will currently fail to do this if the stateid is a delegation. http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-12 17:19:04 -05:00
Trond Myklebust	869f9dfa4d	NFSv4: Fix races between nfs_remove_bad_delegation() and delegation return Any attempt to call nfs_remove_bad_delegation() while a delegation is being returned is currently a no-op. This means that we can end up looping forever in nfs_end_delegation_return() if something causes the delegation to be revoked. This patch adds a mechanism whereby the state recovery code can communicate to the delegation return code that the delegation is no longer valid and that it should not be used when reclaiming state. It also changes the return value for nfs4_handle_delegation_recall_error() to ensure that nfs_end_delegation_return() does not reattempt the lock reclaim before state recovery is done. http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-12 17:19:04 -05:00
Trond Myklebust	0c116cadd9	NFSv4.1: nfs41_clear_delegation_stateid shouldn't trust NFS_DELEGATED_STATE This patch removes the assumption made previously, that we only need to check the delegation stateid when it matches the stateid on a cached open. If we believe that we hold a delegation for this file, then we must assume that its stateid may have been revoked or expired too. If we don't test it then our state recovery process may end up caching open/lock state in a situation where it should not. We therefore rename the function nfs41_clear_delegation_stateid as nfs41_check_delegation_stateid, and change it to always run through the delegation stateid test and recovery process as outlined in RFC5661. http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-12 17:01:33 -05:00
Trond Myklebust	4dfd4f7af0	NFSv4: Ensure that we remove NFSv4.0 delegations when state has expired NFSv4.0 does not have TEST_STATEID/FREE_STATEID functionality, so unlike NFSv4.1, the recovery procedure when stateids have expired or have been revoked requires us to just forget the delegation. http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-12 17:00:09 -05:00
Anna Schumaker	e983120e92	NFS: SEEK is an NFS v4.2 feature Somehow the nfs_v4_1_minor_ops had the NFS_CAP_SEEK flag set, enabling SEEK over v4.1. This is wrong, and can make servers crash. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Tested-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-12 14:22:54 -05:00
Jan Kara	16caf5b610	nfs: Fix use of uninitialized variable in nfs_getattr() Variable 'err' needn't be initialized when nfs_getattr() uses it to check whether it should call generic_fillattr() or not. That can result in spurious error returns. Initialize 'err' properly. Signed-off-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-12 14:22:53 -05:00
Jan Kara	b283f94452	nfs: Remove bogus assignment Commit `3a6fd1f004` (pnfs/blocklayout: remove read-modify-write handling in bl_write_pagelist) introduced a bogus assignment pg_index = pg_index in variable initialization. AFAICS it's just a typo so remove it. Spotted by Coverity (id 1248711). CC: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-12 14:22:53 -05:00
Weston Andros Adamson	16c9914069	nfs: remove spurious WARN_ON_ONCE in write path This WARN_ON_ONCE was supposed to catch reference counting bugs, but can trigger in inappropriate situations. This was reproducible using NFSv2 on an architecture with 64K pages -- we verified that it was not a reference counting bug and the warning was safe to ignore. Reported-by: Will Deacon <will.deacon@arm.com> Tested-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-12 14:22:52 -05:00
Christoph Hellwig	e0d4ed71ca	pnfs/blocklayout: serialize GETDEVICEINFO calls The rpc_pipefs code isn't thread safe, leading to occasional use after frees when running xfstests generic/241 (dbench). Signed-off-by: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/1411740170-18611-2-git-send-email-hch@lst.de Cc: stable@vger.kernel.org # 3.17.x Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-12 14:22:52 -05:00
Peng Tao	8c393f9a72	nfs: fix pnfs direct write memory leak For pNFS direct writes, layout driver may dynamically allocate ds_cinfo.buckets. So we need to take care to free them when freeing dreq. Ideally this needs to be done inside layout driver where ds_cinfo.buckets are allocated. But buckets are attached to dreq and reused across LD IO iterations. So I feel it's OK to free them in the generic layer. Cc: stable@vger.kernel.org [v3.4+] Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-12 14:22:51 -05:00
David Sterba	a6f69dc801	btrfs: move commit out of sysfs when changing label Signed-off-by: David Sterba <dsterba@suse.cz>	2014-11-12 16:53:15 +01:00
David Sterba	0eae2747ec	btrfs: move commit out of sysfs when changing features Signed-off-by: David Sterba <dsterba@suse.cz>	2014-11-12 16:53:14 +01:00
David Sterba	d51033d055	btrfs: introduce pending action: commit In some contexts, like in sysfs handlers, we don't want to trigger a transaction commit. It's a heavy operation, we don't know what external locks may be taken. Instead, make it possible to finish the operation through sync syscall or SYNC_FS ioctl. Signed-off-by: David Sterba <dsterba@suse.cz>	2014-11-12 16:53:14 +01:00
David Sterba	7e1876aca8	btrfs: switch inode_cache option handling to pending changes The pending mount option(s) now share namespace and bits with the normal options, and the existing one for (inode_cache) is unset unconditionally at each transaction commit. Introduce a separate namespace for pending changes and enhance the descriptions of the intended change to use separate bits for each action. Signed-off-by: David Sterba <dsterba@suse.cz>	2014-11-12 16:53:13 +01:00
David Sterba	6b5fe46dfa	btrfs: do commit in sync_fs if there are pending changes If a pending change is requested, it's not processed unless there is a transaction commit about to happen, not even after sync or SYNC_FS ioctl. For example a remount that toggles the inode_cache option will not take effect after sync on a quiescent filesystem. Signed-off-by: David Sterba <dsterba@suse.cz>	2014-11-12 16:53:13 +01:00
David Sterba	572d9ab784	btrfs: add support for processing pending changes There are some actions that modify global filesystem state but cannot be performed at the time of request, but later at the transaction commit time when the filesystem is in a known state. For example enabling new incompat features on-the-fly or issuing transaction commit from unsafe contexts (sysfs handlers). Signed-off-by: David Sterba <dsterba@suse.cz>	2014-11-12 16:53:12 +01:00
Mathias Krause	af5a29aee4	efivarfs: Allow unloading when build as module There is no need to keep the module loaded when it serves no function in case the EFI runtime services are disabled. Return an error in this case so loading the module will fail. Also supply a module_exit function to allow unloading the module. Last, but not least, set the owner of the file_system_type struct. Cc: Jeremy Kerr <jk@ozlabs.org> Cc: Matthew Garrett <matthew.garrett@nebula.com> Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com>	2014-11-11 22:22:27 +00:00
Jaegeuk Kim	92dffd0179	f2fs: convert inline_data when i_size becomes large If i_size becomes large outside of MAX_INLINE_DATA, we shoud convert the inode. Otherwise, we can make some dirty pages during the truncation, and those pages will be written through f2fs_write_data_page. At that moment, the inode has still inline_data, so that it tries to write non- zero pages into inline_data area. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-11 14:16:12 -08:00
Jaegeuk Kim	764d2c8040	f2fs: fix deadlock to grab 0'th data page The scenario is like this. One trhead triggers: f2fs_write_data_pages lock_page f2fs_write_data_page f2fs_lock_op <- wait The other thread triggers: f2fs_truncate truncate_blocks f2fs_lock_op truncate_partial_data_page lock_page <- wait for locking the page This patch resolves this bug by relocating truncate_partial_data_page. This function is just to truncate user data page and not related to FS consistency as well. And, we don't need to call truncate_inline_data. Rather than that, f2fs_write_data_page will finally update inline_data later. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-11 14:15:48 -08:00
Jaegeuk Kim	57e2a2c0a6	f2fs: reduce the number of inline_data inode before clearing it The # of inline_data inode is decreased only when it has inline_data. After clearing the flag, we can't decreased the number. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-10 16:29:14 -08:00
Jaegeuk Kim	b7e1d80003	f2fs: implement -o dirsync If a mount option has dirsync, we should call checkpoint for all the directory operations. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-10 06:51:39 -08:00
Jaegeuk Kim	510184c89f	f2fs: do not skip any writes under memory pressure Under memory pressure, let's avoid skipping data writes. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-10 06:51:38 -08:00
Jaegeuk Kim	2f97c326bf	f2fs: write node pages if checkpoint is not doing It needs to write node pages if checkpoint is not doing in order to avoid memory pressure. Reviewed-by: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-10 06:51:28 -08:00
Jan Kara	75cbe701a4	vfs: Remove i_dquot field from inode All filesystems using VFS quotas are now converted to use their private i_dquot fields. Remove the i_dquot field from generic inode structure. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-10 10:06:18 +01:00
Jan Kara	507e1fa697	jfs: Convert to private i_dquot field Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> CC: jfs-discussion@lists.sourceforge.net Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-10 10:06:18 +01:00
Jan Kara	53873638bd	reiserfs: Convert to private i_dquot field CC: reiserfs-devel@vger.kernel.org CC: Jeff Mahoney <jeffm@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-10 10:06:17 +01:00
Jan Kara	1c92ec678f	ocfs2: Convert to private i_dquot field CC: Mark Fasheh <mfasheh@suse.com> CC: Joel Becker <jlbec@evilplan.org> CC: ocfs2-devel@oss.oracle.com Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-10 10:06:11 +01:00
Jan Kara	96c7e0d964	ext4: Convert to private i_dquot field CC: linux-ext4@vger.kernel.org Acked-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-10 10:06:11 +01:00
Jan Kara	4018cfbc8c	ext3: Convert to private i_dquot field CC: linux-ext4@vger.kernel.org Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-10 10:06:10 +01:00
Jan Kara	64241118b7	ext2: Convert to private i_dquot field CC: linux-ext4@vger.kernel.org Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-10 10:06:10 +01:00
Jan Kara	2d0fa46791	quota: Use function to provide i_dquot pointers i_dquot array is used by relatively few filesystems (ext?, ocfs2, jfs, reiserfs) so it is beneficial to move this array to fs-private part of the inode. We cannot just pass quota pointers from filesystems to quota functions because during quotaon and quotaoff we have to traverse list of all inodes and manipulate i_dquot pointers for each inode. So we provide a function which generic quota code can use to get pointer to the i_dquot array from the filesystem. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-10 10:06:09 +01:00
Jan Kara	17ef4fdd37	xfs: Set allowed quota types We support user, group, and project quotas. Tell VFS about it. CC: xfs@oss.sgi.com CC: Dave Chinner <david@fromorbit.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-10 10:06:09 +01:00
Jan Kara	de3b08d3ec	gfs2: Set allowed quota types We support user and group quotas. Tell vfs about it. Acked-by: Steven Whitehouse <swhiteho@redhat.com> CC: cluster-devel@redhat.com Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-10 10:06:08 +01:00
Jan Kara	2c5f648aa2	quota: Allow each filesystem to specify which quota types it supports Currently all filesystems supporting VFS quota support user and group quotas. With introduction of project quotas this is going to change so make sure filesystem isn't called for quota type it doesn't support by introduction of a bitmask determining which quota types each filesystem supports. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-10 10:06:08 +01:00
Jan Kara	6bab3596bb	quota: Remove const from function declarations We don't use const through VFS too much so just remove it from quota function declarations. Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-10 10:06:07 +01:00
Linus Torvalds	c4c23fb6f2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "It's a one liner for an error cleanup path that leads to crashes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup	2014-11-09 14:30:24 -08:00
Linus Torvalds	661b99e95f	xfs: fixes for v3.18-rc3 This update fixes: - incorrect warnings about i_mutex locking in pagecache_isize_extended() and updates comments to match expected locking - another zero-range bug fix for stray file size updates - a bunch of fixes for regression in the bulkstat code introduced in 3.17. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJUXTquAAoJEK3oKUf0dfodzPQP+wVWm3suRpvfpOeljlwcCB1w r1MjJjEgKRmlRCwTe4IYSSS2ZBSz3f5qTunde1PEAcUyyf2gO5b/gV2WCaVDWIpV 0/1RDaXIRplTvY/i5UAOtqOSUpNwvWz7PCmCAR7RCHFfTyBBbFRlRdg1GPCrBAzv rf/kRm9C6fRHHwLojwNCLEzA0MAdjKVG05A+Xv3MnWcd9fRxtNryQnhTIUJoDCVl 5keebmizquoc88WoQxDX29j2Ce+yjMPj27YhB9Z09mfmFvbLHT46UP2jI5Ty+DX5 rJikXA5Jv6wtig9wsbsenK7fsy1CyAAbmxS8vjyHsdCSAMpN98HkndsSadF8nH5U sEh43OJjJS5LedVNqfV+LtI9ZD9+fGpAETQFVI8TpefFBX2aYw3fomWO0AUbRuMi s4f6iz2sDvgDa+oE38XqKif6CqFsBX0QQSuCDiPaMi79vy2VLE/I5SU7QAj42+BW sEyGVVcdJDsDpkUe6SicIpbNNwhXCR4GYc4jI4QYjDdcK6rrliCnsI8hHk5pPeKk Qvt6ERiP5dLvp9f6KEzvPAkdxBmDKOtHKMSEMJzBBDtxFJXStJNuYZVtMYb8JwJq nV0WKSoXR0xT/IeMw8336HF4GO04VCHjY3QnNh0qNdHYtfXJerZmpzVhae7dJmZZ nLFimb3q2TlfgEvkPqmi =8S0c -----END PGP SIGNATURE----- Merge tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs fixes from Dave Chinner: "This update fixes a warning in the new pagecache_isize_extended() and updates some related comments, another fix for zero-range misbehaviour, and an unforntuately large set of fixes for regressions in the bulkstat code. The bulkstat fixes are large but necessary. I wouldn't normally push such a rework for a -rcX update, but right now xfsdump can silently create incomplete dumps on 3.17 and it's possible that even xfsrestore won't notice that the dumps were incomplete. Hence we need to get this update into 3.17-stable kernels ASAP. In more detail, the refactoring work I committed in 3.17 has exposed a major hole in our QA coverage. With both xfsdump (the major user of bulkstat) and xfsrestore silently ignoring missing files in the dump/restore process, incomplete dumps were going unnoticed if they were being triggered. Many of the dump/restore filesets were so small that they didn't evenhave a chance of triggering the loop iteration bugs we introduced in 3.17, so we didn't exercise the code sufficiently, either. We have already taken steps to improve QA coverage in xfstests to avoid this happening again, and I've done a lot of manual verification of dump/restore on very large data sets (tens of millions of inodes) of the past week to verify this patch set results in bulkstat behaving the same way as it does on 3.16. Unfortunately, the fixes are not exactly simple - in tracking down the problem historic API warts were discovered (e.g xfsdump has been working around a 20 year old bug in the bulkstat API for the past 10 years) and so that complicated the process of diagnosing and fixing the problems. i.e. we had to fix bugs in the code as well as discover and re-introduce the userspace visible API bugs that we unwittingly "fixed" in 3.17 that xfsdump relied on to work correctly. Summary: - incorrect warnings about i_mutex locking in pagecache_isize_extended() and updates comments to match expected locking - another zero-range bug fix for stray file size updates - a bunch of fixes for regression in the bulkstat code introduced in 3.17" * tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: track bulkstat progress by agino xfs: bulkstat error handling is broken xfs: bulkstat main loop logic is a mess xfs: bulkstat chunk-formatter has issues xfs: bulkstat chunk formatting cursor is broken xfs: bulkstat btree walk doesn't terminate mm: Fix comment before truncate_setsize() xfs: rework zero range to prevent invalid i_size updates mm: Remove false WARN_ON from pagecache_isize_extended() xfs: Check error during inode btree iteration in xfs_bulkstat() xfs: bulkstat doesn't release AGI buffer on error	2014-11-07 14:08:13 -08:00
Jeff Layton	5b095e9992	nfsd: convert nfs4_file searches to use RCU The global state_lock protects the file_hashtbl, and that has the potential to be a scalability bottleneck. Address this by making the file_hashtbl use RCU. Add a rcu_head to the nfs4_file and use that when freeing ones that have been hashed. In order to conserve space, we union the fi_rcu field with the fi_delegations list_head which must be clear by the time the last reference to the file is dropped. Convert find_file_locked to use RCU lookup primitives and not to require that the state_lock be held, and convert find_file to do a lockless lookup. Convert find_or_add_file to attempt a lockless lookup first, and then fall back to doing a locked search and insert if that fails to find anything. Also, minimize the number of times we need to calculate the hash value by passing it in as an argument to the search and insert functions, and optimize the order of arguments in nfsd4_init_file. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-11-07 16:56:11 -05:00
Anna Schumaker	b0cb908523	nfsd: Add DEALLOCATE support DEALLOCATE only returns a status value, meaning we can use the noop() xdr encoder to reply to the client. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-11-07 16:20:15 -05:00
Anna Schumaker	95d871f03c	nfsd: Add ALLOCATE support The ALLOCATE operation is used to preallocate space in a file. I can do this by using vfs_fallocate() to do the actual preallocation. ALLOCATE only returns a status indicator, so we don't need to write a special encode() function. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-11-07 16:19:49 -05:00
Anna Schumaker	72c72bdf7b	VFS: Rename do_fallocate() to vfs_fallocate() This function needs to be exported so it can be used by the NFSD module when responding to the new ALLOCATE and DEALLOCATE operations in NFS v4.2. Christoph Hellwig suggested renaming the function to stay consistent with how other vfs functions are named. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-11-07 16:17:44 -05:00
NeilBrown	4ef67a8c95	sysfs/kernfs: make read requests on pre-alloc files use the buffer. To match the previous patch which used the pre-alloc buffer for writes, this patch causes reads to use the same buffer. This is not strictly necessary as the current seq_read() will allocate on first read, so user-space can trigger the required pre-alloc. But consistency is valuable. The read function is somewhat simpler than seq_read() and, for example, does not support reading from an offset into the file: reads must be at the start of the file. As seq_read() does not use the prealloc buffer, ->seq_show is incompatible with ->prealloc and caused an EINVAL return from open(). sysfs code which calls into kernfs always chooses the correct function. As the buffer is shared with writes and other reads, the mutex is extended to cover the copy_to_user. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-11-07 10:54:38 -08:00
NeilBrown	2b75869bba	sysfs/kernfs: allow attributes to request write buffer be pre-allocated. md/raid allows metadata management to be performed in user-space. A various times, particularly on device failure, the metadata needs to be updated before further writes can be permitted. This means that the user-space program which updates metadata much not block on writeout, and so must not allocate memory. mlockall(MCL_CURRENT\|MCL_FUTURE) and pre-allocation can avoid all memory allocation issues for user-memory, but that does not help kernel memory. Several kernel objects can be pre-allocated. e.g. files opened before any writes to the array are permitted. However some kernel allocation happens in places that cannot be pre-allocated. In particular, writes to sysfs files (to tell md that it can now allow writes to the array) allocate a buffer using GFP_KERNEL. This patch allows attributes to be marked as "PREALLOC". In that case the maximal buffer is allocated when the file is opened, and then used on each write instead of allocating a new buffer. As the same buffer is now shared for all writes on the same file description, the mutex is extended to cover full use of the buffer including the copy_from_user(). The new __ATTR_PREALLOC() 'or's a new flag in to the 'mode', which is inspected by sysfs_add_file_mode_ns() to determine if the file should be marked as requiring prealloc. Despite the comment, we do use ->seq_show together with ->prealloc in this patch. The next patch fixes that. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-11-07 10:53:25 -08:00
Vladimir Zapolskiy	0936896056	fs: sysfs: return EGBIG on write if offset is larger than file size According to the user expectations common utilities like dd or sh redirection operator > should work correctly over binary files from sysfs. At the moment doing excessive write can not be completed: write(1, "\0\0\0\0\0\0\0\0", 8) = 4 write(1, "\0\0\0\0", 4) = 0 write(1, "\0\0\0\0", 4) = 0 write(1, "\0\0\0\0", 4) = 0 ... Fix the problem by returning EFBIG described in man 2 write. Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-11-07 10:52:20 -08:00
Subodh Nijsure	a76284e6f8	UBIFS: fix a couple bugs in UBIFS xattr length calculation The journal update function did not work for extended attributes properly, because extended attribute inodes carry the xattr data, and the size of this data was not taken into account. Artem: improved commit message, amended the patch a bit. Signed-off-by: Subodh Nijsure <snijsure@grid-net.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Ben Shelton <ben.shelton@ni.com> Acked-by: Brad Mouring <brad.mouring@ni.com> Acked-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>	2014-11-07 12:32:22 +02:00
Artem Bityutskiy	789c89935c	UBIFS: fix budget leak in error path We forgot to free the budget in 'write_begin_slow()' when 'do_readpage()' fails. This patch fixes the issue. Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>	2014-11-07 12:08:50 +02:00
Jaegeuk Kim	e5e7ea3c86	f2fs: control the memory footprint used by ino entries This patch adds to control the memory footprint used by ino entries. This will conduct best effort, not strictly. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-06 15:24:46 -08:00
Jaegeuk Kim	8c402946f0	f2fs: introduce the number of inode entries This patch adds to monitor the number of ino entries. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-06 15:17:43 -08:00
Dave Chinner	0027589926	xfs: track bulkstat progress by agino The bulkstat main loop progress is tracked by the "lastino" variable, which is a full 64 bit inode. However, the loop actually works on agno/agino pairs, and so there's a significant disconnect between the rest of the loop and the main cursor. Convert this to use the agino, and pass the agino into the chunk formatting function and convert it too. This gets rid of the inconsistency in the loop processing, and finally makes it simple for us to skip inodes at any point in the loop simply by incrementing the agino cursor. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2014-11-07 08:33:52 +11:00
Dave Chinner	febe3cbe38	xfs: bulkstat error handling is broken The error propagation is a horror - xfs_bulkstat() returns a rval variable which is only set if there are formatter errors. Any sort of btree walk error or corruption will cause the bulkstat walk to terminate but will not pass an error back to userspace. Worse is the fact that formatter errors will also be ignored if any inodes were correctly formatted into the user buffer. Hence bulkstat can fail badly yet still report success to userspace. This causes significant issues with xfsdump not dumping everything in the filesystem yet reporting success. It's not until a restore fails that there is any indication that the dump was bad and tha bulkstat failed. This patch now triggers xfsdump to fail with bulkstat errors rather than silently missing files in the dump. This now causes bulkstat to fail when the lastino cookie does not fall inside an existing inode chunk. The pre-3.17 code tolerated that error by allowing the code to move to the next inode chunk as the agino target is guaranteed to fall into the next btree record. With the fixes up to this point in the series, xfsdump now passes on the troublesome filesystem image that exposes all these bugs. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2014-11-07 08:31:15 +11:00
Dave Chinner	6e57c542cb	xfs: bulkstat main loop logic is a mess There are a bunch of variables tha tare more wildy scoped than they need to be, obfuscated user buffer checks and tortured "next inode" tracking. This all needs cleaning up to expose the real issues that need fixing. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2014-11-07 08:31:13 +11:00
Dave Chinner	2b831ac6bc	xfs: bulkstat chunk-formatter has issues The loop construct has issues: - clustidx is completely unused, so remove it. - the loop tries to be smart by terminating when the "freecount" tells it that all inodes are free. Just drop it as in most cases we have to scan all inodes in the chunk anyway. - move the "user buffer left" condition check to the only point where we consume space int eh user buffer. - move the initialisation of agino out of the loop, leaving just a simple loop control logic using the clusteridx. Also, double handling of the user buffer variables leads to problems tracking the current state - use the cursor variables directly rather than keeping local copies and then having to update the cursor before returning. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2014-11-07 08:30:58 +11:00
Dave Chinner	bf4a5af20d	xfs: bulkstat chunk formatting cursor is broken The xfs_bulkstat_agichunk formatting cursor takes buffer values from the main loop and passes them via the structure to the chunk formatter, and the writes the changed values back into the main loop local variables. Unfortunately, this complex dance is full of corner cases that aren't handled correctly. The biggest problem is that it is double handling the information in both the main loop and the chunk formatting function, leading to inconsistent updates and endless loops where progress is not made. To fix this, push the struct xfs_bulkstat_agichunk outwards to be the primary holder of user buffer information. this removes the double handling in the main loop. Also, pass the last inode processed by the chunk formatter as a separate parameter as it purely an output variable and is not related to the user buffer consumption cursor. Finally, the chunk formatting code is not shared by anyone, so make it local to xfs_itable.c. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2014-11-07 08:30:30 +11:00
Dave Chinner	afa947cb52	xfs: bulkstat btree walk doesn't terminate The bulkstat code has several different ways of detecting the end of an AG when doing a walk. They are not consistently detected, and the code that checks for the end of AG conditions is not consistently coded. Hence the are conditions where the walk code can get stuck in an endless loop making no progress and not triggering any termination conditions. Convert all the "tmp/i" status return codes from btree operations to a common name (stat) and apply end-of-ag detection to these operations consistently. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2014-11-07 08:29:57 +11:00
Jeff Layton	9af94fc4e4	lockd: ratelimit "lockd: cannot monitor" messages When lockd can't talk to a remote statd, it'll spew a warning message to the ring buffer. If the application is really hammering on locks however, it's possible for that message to spam the logs. Ratelimit it to minimize the potential for harm. Reported-by: Ian Collier <imc@cs.ox.ac.uk> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-11-06 14:47:33 -05:00
Gu Zheng	835f252c6d	aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer https://bugzilla.kernel.org/show_bug.cgi?id=86831 Markus reported that when shutting down mysqld (with AIO support, on a ext3 formatted Harddrive) leads to a negative number of dirty pages (underrun to the counter). The negative number results in a drastic reduction of the write performance because the page cache is not used, because the kernel thinks it is still 2 ^ 32 dirty pages open. Add a warn trace in __dec_zone_state will catch this easily: static inline void __dec_zone_state(struct zone zone, enum zone_stat_item item) { atomic_long_dec(&zone->vm_stat[item]); + WARN_ON_ONCE(item == NR_FILE_DIRTY && atomic_long_read(&zone->vm_stat[item]) < 0); atomic_long_dec(&vm_stat[item]); } [ 21.341632] ------------[ cut here ]------------ [ 21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242 cancel_dirty_page+0x164/0x224() [ 21.355296] Modules linked in: wutbox_cp sata_mv [ 21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80 [ 21.366793] Workqueue: events free_ioctx [ 21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>] (show_stack+0x20/0x24) [ 21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>] (dump_stack+0x24/0x28) [ 21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>] (warn_slowpath_common+0x84/0x9c) [ 21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>] (warn_slowpath_null+0x2c/0x34) [ 21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>] (cancel_dirty_page+0x164/0x224) [ 21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>] (truncate_inode_page+0x8c/0x158) [ 21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>] (truncate_inode_pages_range+0x11c/0x53c) [ 21.429890] [<c00c0a94>] (truncate_inode_pages_range) from [<c00c0f6c>] (truncate_pagecache+0x88/0xac) [ 21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>] (truncate_setsize+0x5c/0x74) [ 21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>] (put_aio_ring_file.isra.14+0x34/0x90) [ 21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from [<c013b424>] (aio_free_ring+0x20/0xcc) [ 21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>] (free_ioctx+0x24/0x44) [ 21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>] (process_one_work+0x134/0x47c) [ 21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>] (worker_thread+0x130/0x414) [ 21.489350] [<c003e988>] (worker_thread) from [<c00448ac>] (kthread+0xd4/0xec) [ 21.496621] [<c00448ac>] (kthread) from [<c000ec18>] (ret_from_fork+0x14/0x20) [ 21.503884] ---[ end trace 79c4bf42c038c9a1 ]--- The cause is that we set the aio ring file pages as DIRTY* via SetPageDirty (bypasses the VFS dirty pages increment) when init, and aio fs uses default_backing_dev_info as the backing dev, which does not disable the dirty pages accounting capability. So truncating aio ring file will contribute to accounting dirty pages (VFS dirty pages decrement), then error occurs. The original goal is keeping these pages in memory (can not be reclaimed or swapped) in life-time via marking it dirty. But thinking more, we have already pinned pages via elevating the page's refcount, which can already achieve the goal, so the SetPageDirty seems unnecessary. In order to fix the issue, using the __set_page_dirty_no_writeback instead of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually set the dirty flags, don't disable set_page_dirty(), rely on default behaviour). With the above change, the dirty pages accounting can work well. But as we known, aio fs is an anonymous one, which should never cause any real write-back, we can ignore the dirty pages (write back) accounting by disabling the dirty pages (write back) accounting capability. So we introduce an aio private backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to replace the default one. Reported-by: Markus Königshaus <m.koenigshaus@wut.de> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: stable <stable@vger.kernel.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>	2014-11-06 14:27:19 -05:00
Jaegeuk Kim	a344b9fda0	f2fs: disable roll-forward when active_logs = 2 The roll-forward mechanism should be activated when the number of active logs is not 2. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-05 20:05:53 -08:00
Al Viro	7e8631e8b9	fix breakage in o2net_send_tcp_msg() uninitialized msghdr. Broken in "ocfs2: don't open-code kernel_recvmsg()" by me ;-/ Cc: stable@vger.kernel.org # 3.15+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-05 15:21:18 -05:00
Joe Perches	9761536e1d	debugfs: Have debugfs_print_regs32() return void The seq_printf() will soon just return void, and seq_has_overflowed() should be used instead to see if the seq can no longer accept input. As the return value of debugfs_print_regs32() has no users and the seq_file descriptor should be checked with seq_has_overflowed() instead of return values of functions, it is better to just have debugfs_print_regs32() also return void. Link: http://lkml.kernel.org/p/2634b19eb1c04a9d31148c1fe6f1f3819be95349.1412031505.git.joe@perches.com Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Joe Perches <joe@perches.com> [ original change only updated seq_printf() return, added return of void to debugfs_print_regs32() as well ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2014-11-05 14:13:38 -05:00
Joe Perches	a3816ab0e8	fs: Convert show_fdinfo functions to void seq_printf functions shouldn't really check the return value. Checking seq_has_overflowed() occasionally is used instead. Update vfs documentation. Link: http://lkml.kernel.org/p/e37e6e7b76acbdcc3bb4ab2a57c8f8ca1ae11b9a.1412031505.git.joe@perches.com Cc: David S. Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Joe Perches <joe@perches.com> [ did a few clean ups ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2014-11-05 14:13:23 -05:00
Joe Perches	f365ef9b79	dlm: Use seq_puts() instead of seq_printf() for constant strings Convert the seq_printf output with constant strings to seq_puts. Link: http://lkml.kernel.org/p/b416b016f4a6e49115ba736cad6ea2709a8bc1c4.1412031505.git.joe@perches.com Cc: Christine Caulfield <ccaulfie@redhat.com> Cc: David Teigland <teigland@redhat.com> Cc: cluster-devel@redhat.com Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2014-11-05 14:13:09 -05:00
Joe Perches	d6d906b234	dlm: Remove seq_printf() return checks and use seq_has_overflowed() The seq_printf() return is going away soon and users of it should check seq_has_overflowed() to see if the buffer is full and will not accept any more data. Convert functions returning int to void where seq_printf() is used. Link: http://lkml.kernel.org/p/43590057bcb83846acbbcc1fe641f792b2fb7773.1412031505.git.joe@perches.com Link: http://lkml.kernel.org/r/20141029220107.939492048@goodmis.org Acked-by: David Teigland <teigland@redhat.com> Cc: Christine Caulfield <ccaulfie@redhat.com> Cc: cluster-devel@redhat.com Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2014-11-05 14:12:38 -05:00
Sebastian Schmidt	68c4a4f8ab	pstore: Honor dmesg_restrict sysctl on dmesg dumps When the kernel.dmesg_restrict restriction is in place, only users with CAP_SYSLOG should be able to access crash dumps (like: attacker is trying to exploit a bug, watchdog reboots, attacker can happily read crash dumps and logs). This puts the restriction on console-* types as well as sensitive information could have been leaked there. Other log types are unaffected. Signed-off-by: Sebastian Schmidt <yath@yath.de> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Luck <tony.luck@intel.com>	2014-11-05 09:59:48 -08:00
Ben Zhang	a28726b4fb	pstore/ram: Strip ramoops header for correct decompression pstore compression/decompression was added during 3.12. The ramoops driver prepends a "====timestamp.timestamp-C\|D\n" header to the compressed record before handing it over to pstore driver which doesn't know about the header. In pstore_decompress(), the pstore driver reads the first "==" as a zlib header, so the decompression always fails. For example, this causes the driver to write /dev/pstore/dmesg-ramoops-0.enc.z instead of /dev/pstore/dmesg-ramoops-0. This patch makes the ramoops driver remove the header before pstore decompression. Signed-off-by: Ben Zhang <benzh@chromium.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Luck <tony.luck@intel.com>	2014-11-05 09:58:17 -08:00
Dmitry Monakhov	88c6b61ff1	ext4: move_extent improve bh vanishing success factor Xiaoguang Wang has reported sporadic EBUSY failures of ext4/302 Unfortunetly there is nothing we can do if some other task holds BH's refenrence. So we must return EBUSY in this case. But we can try kicking the journal to see if the other task releases the bh reference after the commit is complete. Also decrease false positives by properly checking for ENOSPC and retrying the allocation after kicking the journal --- which is done by ext4_should_retry_alloc(). [ Modified by tytso to properly check for ENOSPC. ] Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2014-11-05 11:52:38 -05:00
Miklos Szeredi	3f822c6264	ovl: don't poison cursor ovl_cache_put() can be called from ovl_dir_reset() if the cache needs to be rebuilt. We did list_del() on the cursor, which results in an Oops on the poisoned pointer in ovl_seek_cursor(). Reported-by: Jordi Pujol Palomer <jordipujolp@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Jordi Pujol Palomer <jordipujolp@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-05 08:49:38 -05:00
Trond Myklebust	dca780016d	Revert "NFS: nfs4_do_open should add negative results to the dcache." This reverts commit `4fa2c54b51`.	2014-11-04 19:53:50 -06:00
Trond Myklebust	7488cbc256	Revert "NFS: remove BUG possibility in nfs4_open_and_get_state" This reverts commit `f39c010479`.	2014-11-04 19:53:49 -06:00
Trond Myklebust	809fd143de	NFSv4: Ensure nfs_atomic_open set the dentry verifier on ENOENT If the OPEN rpc call to the server fails with an ENOENT call, nfs_atomic_open will create a negative dentry for that file, however it currently fails to call nfs_set_verifier(), thus causing the dentry to be immediately revalidated on the next call to nfs_lookup_revalidate() instead of following the usual lookup caching rules. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-04 19:53:49 -06:00
Jaegeuk Kim	d5053a34a9	f2fs: introduce -o fastboot for reducing booting time only If a system wants to reduce the booting time as a top priority, now we can use a mount option, -o fastboot. With this option, f2fs conducts a little bit slow write_checkpoint, but it can avoid the node page reads during the next mount time. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-04 17:34:15 -08:00
Jaegeuk Kim	6a8f8ca582	f2fs: avoid race condition in handling wait_io __submit_merged_bio f2fs_write_end_io f2fs_write_end_io wait_io = X wait_io = x complete(X) complete(X) wait_io = NULL wait_for_completion() free(X) spin_lock(X) kernel panic In order to avoid this, this patch removes the wait_io facility. Instead, we can use wait_on_all_pages_writeback(sbi) to wait for end_ios. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-04 17:34:14 -08:00
Jaegeuk Kim	adf4983bde	f2fs: send discard commands in larger extent If there is a chance to make a huge sized discard command, we don't need to split it out, since each blkdev_issue_discard should wait one at a time. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-04 17:34:13 -08:00
Jaegeuk Kim	b3d208f96d	f2fs: revisit inline_data to avoid data races and potential bugs This patch simplifies the inline_data usage with the following rule. 1. inline_data is set during the file creation. 2. If new data is requested to be written ranges out of inline_data, f2fs converts that inode permanently. 3. There is no cases which converts non-inline_data inode to inline_data. 4. The inline_data flag should be changed under inode page lock. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-04 17:34:11 -08:00
Tejun Heo	9c6ac78eb3	writeback: fix a subtle race condition in I_DIRTY clearing After invoking ->dirty_inode(), __mark_inode_dirty() does smp_mb() and tests inode->i_state locklessly to see whether it already has all the necessary I_DIRTY bits set. The comment above the barrier doesn't contain any useful information - memory barriers can't ensure "changes are seen by all cpus" by itself. And it sure enough was broken. Please consider the following scenario. CPU 0 CPU 1 ------------------------------------------------------------------------------- enters __writeback_single_inode() grabs inode->i_lock tests PAGECACHE_TAG_DIRTY which is clear enters __set_page_dirty() grabs mapping->tree_lock sets PAGECACHE_TAG_DIRTY releases mapping->tree_lock leaves __set_page_dirty() enters __mark_inode_dirty() smp_mb() sees I_DIRTY_PAGES set leaves __mark_inode_dirty() clears I_DIRTY_PAGES releases inode->i_lock Now @inode has dirty pages w/ I_DIRTY_PAGES clear. This doesn't seem to lead to an immediately critical problem because requeue_inode() later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when deciding whether the inode needs to be requeued for IO and there are enough unintentional memory barriers inbetween, so while the inode ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the IO list. The lack of explicit barrier may also theoretically affect the other I_DIRTY bits which deal with metadata dirtiness. There is no guarantee that a strong enough barrier exists between I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied inode. Filesystem inode writeout path likely has enough stuff which can behave as full barrier but it's theoretically possible that the writeout may not see all the updates from ->dirty_inode(). Fix it by adding an explicit smp_mb() after I_DIRTY clearing. Note that I_DIRTY_PAGES needs a special treatment as it always needs to be cleared to be interlocked with the lockless test on __mark_inode_dirty() side. It's cleared unconditionally and reinstated after smp_mb() if the mapping still has dirty pages. Also add comments explaining how and why the barriers are paired. Lightly tested. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 10:42:23 -07:00
Chris Mason	6e5aafb274	Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup If we hit any errors in btrfs_lookup_csums_range, we'll loop through all the csums we allocate and free them. But the code was using list_entry incorrectly, and ended up trying to free the on-stack list_head instead. This bug came from commit `0678b6185` btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range() Signed-off-by: Chris Mason <clm@fb.com> Reported-by: Erik Berg <btrfs@slipsprogrammoer.no> cc: stable@vger.kernel.org # 3.3 or newer	2014-11-04 06:59:04 -08:00
Anton Blanchard	19858e7bdc	quota: Add log level to printk JK: Added VFS: prefix to the message when changing it to make it more standard. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Jan Kara <jack@suse.cz>	2014-11-04 12:01:06 +01:00
Greg Kroah-Hartman	a8a93c6f99	Merge branch 'platform/remove_owner' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux into driver-core-next Remove all .owner fields from platform drivers	2014-11-03 19:53:56 -08:00
Jan Kara	1f7732fe6c	f2fs: remove pointless bit testing in f2fs_delete_entry() There's no point in using test_and_clear_bit_le() when we don't use the return value of the function. Just use clear_bit_le() instead. Coverity-id: 1016434 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:38 -08:00
Jaegeuk Kim	e3fb1b794b	f2fs: do not discard data protected by the previous checkpoint We should not discard any data protected by the previous checkpoint all the time. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:38 -08:00
Jaegeuk Kim	427a45c8e2	f2fs: flush_dcache_page for inline data When reading inline data, we should call flush_dcache_page. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:37 -08:00
Jaegeuk Kim	ca4b02eeed	f2fs: call write_checkpoint under disabled gc During the write_checkpoint, we should avoid f2fs_gc trigger to avoid any filesystem consistency. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:37 -08:00
Jan Kara	9234f3190b	f2fs: fix possible data corruption in f2fs_write_begin() f2fs_write_begin() doesn't initialize the 'dn' variable if the inode has inline data. However it uses its contents to decide whether it should just zero out the page or load data to it. Thus if we are unlucky we can zero out page contents instead of loading inline data into a page. CC: stable@vger.kernel.org CC: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:37 -08:00
Gu Zheng	2cc2218611	f2fs: use current_sit_addr to replace the open code Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:37 -08:00
Gu Zheng	52aca07425	f2fs: rename f2fs_set/clear_bit to f2fs_test_and_set/clear_bit Rename f2fs_set/clear_bit to f2fs_test_and_set/clear_bit, which mean set/clear bit and return the old value, for better readability. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:36 -08:00
Gu Zheng	1730663cb7	f2fs: set raw_super default to NULL to avoid compile warning Set raw_super default to NULL to avoid the possibly used uninitialized warning, though we may never hit it in fact. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:36 -08:00
Gu Zheng	c6ac4c0ec4	f2fs: introduce f2fs_change_bit to simplify the change bit logic Introduce f2fs_change_bit to simplify the change bit logic in function set_to_next_nat{sit}. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:36 -08:00
Gu Zheng	fa528722d0	f2fs: remove the redundant function cond_clear_inode_flag Use clear_inode_flag to replace the redundant cond_clear_inode_flag. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:36 -08:00
Gu Zheng	8a2d0ace3a	f2fs: remove the seems unneeded argument 'type' from __get_victim Remove the unneeded argument 'type' from __get_victim, use NO_CHECK_TYPE directly when calling v_ops->get_victim(). Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:35 -08:00
Jan Kara	9bd27ae4aa	f2fs: avoid returning uninitialized value to userspace from f2fs_trim_fs() If user specifies too low end sector for trimming, f2fs_trim_fs() will use uninitialized value as a number of trimmed blocks and returns it to userspace. Initialize number of trimmed blocks early to avoid the problem. Coverity-id: 1248809 CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:35 -08:00
Jaegeuk Kim	d64948a4df	f2fs: declare f2fs_convert_inline_dir as a static function This patch declares f2fs_convert_inline_dir as a static function, which was reported by kbuild test robot. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:35 -08:00
Jaegeuk Kim	f1e33a041e	f2fs: use kmap_atomic instead of kmap For better performance, we need to use kmap_atomic instead of kmap. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:35 -08:00
Jaegeuk Kim	062a3e7ba7	f2fs: reuse make_empty_dir code for inline_dentry This patch introduces do_make_empty_dir to mitigate code redundancy for inline_dentry. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:34 -08:00
Jaegeuk Kim	7b3cd7d6f0	f2fs: introduce f2fs_dentry_ptr structure for code clean-up This patch introduces f2fs_dentry_ptr structure for the use of a function parameter in inline_dentry operations. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:34 -08:00
Jaegeuk Kim	5ab18570b8	f2fs: should not truncate any inline_dentry If the inode has inline_dentry, we should not truncate any block indices. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:34 -08:00
Jaegeuk Kim	38594de767	f2fs: reuse core function in f2fs_readdir for inline_dentry This patch introduces a core function, f2fs_fill_dentries, to remove redundant code in f2fs_readdir and f2fs_read_inline_dir. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:34 -08:00
Jaegeuk Kim	e7a2bf2283	f2fs: fix counting inline_data inode numbers This patch fixes wrongly counting inline_data inode numbers. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:33 -08:00
Jaegeuk Kim	3289c061c5	f2fs: add stat info for inline_dentry inodes This patch adds status information for inline_dentry inodes. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:33 -08:00
Jaegeuk Kim	bce8d11207	f2fs: avoid deadlock on init_inode_metadata Previously, init_inode_metadata does not hold any parent directory's inode page. So, f2fs_init_acl can grab its parent inode page without any problem. But, when we use inline_dentry, that page is grabbed during f2fs_add_link, so that we can fall into deadlock condition like below. INFO: task mknod:11006 blocked for more than 120 seconds. Tainted: G OE 3.17.0-rc1+ #13 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mknod D ffff88003fc94580 0 11006 11004 0x00000000 ffff880007717b10 0000000000000002 ffff88003c323220 ffff880007717fd8 0000000000014580 0000000000014580 ffff88003daecb30 ffff88003c323220 ffff88003fc94e80 ffff88003ffbb4e8 ffff880007717ba0 0000000000000002 Call Trace: [<ffffffff8173dc40>] ? bit_wait+0x50/0x50 [<ffffffff8173d4cd>] io_schedule+0x9d/0x130 [<ffffffff8173dc6c>] bit_wait_io+0x2c/0x50 [<ffffffff8173da3b>] __wait_on_bit_lock+0x4b/0xb0 [<ffffffff811640a7>] __lock_page+0x67/0x70 [<ffffffff810acf50>] ? autoremove_wake_function+0x40/0x40 [<ffffffff811652cc>] pagecache_get_page+0x14c/0x1e0 [<ffffffffa029afa9>] get_node_page+0x59/0x130 [f2fs] [<ffffffffa02a63ad>] read_all_xattrs+0x24d/0x430 [f2fs] [<ffffffffa02a6ca2>] f2fs_getxattr+0x52/0xe0 [f2fs] [<ffffffffa02a7481>] f2fs_get_acl+0x41/0x2d0 [f2fs] [<ffffffff8122d847>] get_acl+0x47/0x70 [<ffffffff8122db5a>] posix_acl_create+0x5a/0x150 [<ffffffffa02a7759>] f2fs_init_acl+0x29/0xcb [f2fs] [<ffffffffa0286a8d>] init_inode_metadata+0x5d/0x340 [f2fs] [<ffffffffa029253a>] f2fs_add_inline_entry+0x12a/0x2e0 [f2fs] [<ffffffffa0286ea5>] __f2fs_add_link+0x45/0x4a0 [f2fs] [<ffffffffa028b5b6>] ? f2fs_new_inode+0x146/0x220 [f2fs] [<ffffffffa028b816>] f2fs_mknod+0x86/0xf0 [f2fs] [<ffffffff811e3ec1>] vfs_mknod+0xe1/0x160 [<ffffffff811e4b26>] SyS_mknod+0x1f6/0x200 [<ffffffff81741d7f>] tracesys+0xe1/0xe6 Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:33 -08:00
Jaegeuk Kim	59a0615540	f2fs: fix to wait correct block type The inode page needs to wait NODE block io. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:33 -08:00
Jaegeuk Kim	4e6ebf6d49	f2fs: reuse find_in_block code for find_in_inline_dir This patch removes redundant copied code in find_in_inline_dir. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:32 -08:00
Jaegeuk Kim	a82afa2019	f2fs: reuse room_for_filename for inline dentry operation This patch introduces to reuse the existing room_for_filename for inline dentry operation. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:32 -08:00
Chao Yu	622f28ae9b	f2fs: enable inline dir handling Add inline dir functions into normal dir ops' function to handle inline ops. Besides, we enable inline dir mode when a new dir inode is created if inline_data option is on. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:32 -08:00
Chao Yu	201a05be96	f2fs: add key function to handle inline dir Adds Functions to implement inline dir init/lookup/insert/delete/convert ops. Signed-off-by: Chao Yu <chao2.yu@samsung.com> [Jaegeuk Kim: remove needless reserved area copy, pointed by Dan Carpenter] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:31 -08:00
Chao Yu	dbeacf02eb	f2fs: export dir operations for inline dir This patch exports some dir operations for inline dir, additionally introduces f2fs_drop_nlink from f2fs_delete_entry for reusing by inline dir function. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:31 -08:00
Chao Yu	5efd3c6f1b	f2fs: add a new mount option for inline dir Adds a new mount option 'inline_dentry' for inline dir. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:31 -08:00
Chao Yu	34d67debe0	f2fs: add infra struct and helper for inline dir This patch defines macro/inline dentry structure, and adds some helpers for inline dir infrastructure. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:31 -08:00
Jaegeuk Kim	af41d3ee00	f2fs: avoid infinite loop at cp_error This patch avoids an infinite loop in sync_dirty_inode_page when -EIO was detected. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:31 -08:00
Jaegeuk Kim	4a257ed677	f2fs: avoid build warning This patch removes build warning. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:30 -08:00
Jaegeuk Kim	13fd8f89f6	f2fs: fix to call f2fs_unlock_op This patch fixes to call f2fs_unlock_op, which was missing before. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:30 -08:00
Jaegeuk Kim	9ba69cf987	f2fs: avoid to allocate when inline_data was written The sceanrio is like this. inline_data i_size page write_begin/vm_page_mkwrite X 30 dirty_page X 30 write to #4096 position X 30 get_dnode_of_data wait for get_dnode_of_data O 30 write inline_data O 30 get_dnode_of_data O 30 reserve data block .. In this case, we have #0 = NEW_ADDR and inline_data as well. We should not allow this condition for further access. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:30 -08:00
Jaegeuk Kim	a78186ebe5	f2fs: use highmem for directory pages This patch fixes to use highmem for directory pages. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:30 -08:00
Jaegeuk Kim	1ce86bf6f8	f2fs: fix race conditon on truncation with inline_data Let's consider the following scenario. blkaddr[0] inline_data i_size i_blocks writepage truncate NEW X 4096 2 dirty page #0 NEW X 0 change i_size NEW X 0 2 f2fs_write_inline_data NEW X 0 2 get_dnode_of_data NEW X 0 2 truncate_data_blocks_range NULL O 0 1 memcpy(inline_data) NULL O 0 1 f2fs_put_dnode NULL O 0 1 f2fs_truncate NULL O 0 1 get_dnode_of_data NULL O 0 1 invalid block addr This patch adds checking inline_data flag during f2fs_truncate not to refer corrupted block indices. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:29 -08:00
Jaegeuk Kim	c08a690b46	f2fs: should truncate any allocated block for inline_data write When trying to write inline_data, we should truncate any data block allocated and pointed by the inode block. We should consider the data index is not 0. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:29 -08:00
Jaegeuk Kim	cbcb2872e3	f2fs: invalidate inmemory page If user truncates file's data, we should truncate inmemory pages too. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:29 -08:00
Jaegeuk Kim	34ba94bac9	f2fs: do not make dirty any inmemory pages This patch let inmemory pages be clean all the time. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-11-03 16:07:29 -08:00
Al Viro	ca5358ef75	deal with deadlock in d_walk() ... by not hitting rename_retry for reasons other than rename having happened. In other words, do _not_ restart when finding that between unlocking the child and locking the parent the former got into __dentry_kill(). Skip the killed siblings instead... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-03 15:22:16 -05:00
Al Viro	946e51f2bf	move d_rcu from overlapping d_child to overlapping d_alias Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-03 15:20:29 -05:00
Bob Peterson	1a8550332a	GFS2: If we use up our block reservation, request more next time If we run out of blocks for a given multi-block allocation, we obviously did not reserve enough. We should reserve more blocks for the next reservation to reduce fragmentation. This patch increases the size hint for reservations when they run out. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-11-03 19:26:54 +00:00
Bob Peterson	33ad5d5428	GFS2: Only increase rs_sizehint If an application does a sequence of (1) big write, (2) little write we don't necessarily want to reset the size hint based on the smaller size. The fact that they did any big writes implies they may do more, and therefore we should try to allocate bigger block reservations, even if the last few were small writes. Therefore this patch changes function gfs2_size_hint so that the size hint can only grow; it cannot shrink. This is especially important where there are multiple writers. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-11-03 19:25:41 +00:00
Bob Peterson	0e27c18c30	GFS2: Set of distributed preferences for rgrps This patch tries to use the journal numbers to evenly distribute which node prefers which resource group for block allocations. This is to help performance. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-11-03 19:24:49 +00:00
Fabian Frederick	37975f1503	GFS2: directly return gfs2_dir_check() No need to store gfs2_dir_check result and test it before returning. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-11-03 19:23:32 +00:00
Linus Torvalds	7e05b807b9	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS fixes from Al Viro: "A bunch of assorted fixes, most of them followups to overlayfs merge" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ovl: initialize ->is_cursor Return short read or 0 at end of a raw device, not EIO isofs: don't bother with ->d_op for normal case isofs_cmp(): we'll never see a dentry for . or .. overlayfs: fix lockdep misannotation ovl: fix check for cursor overlayfs: barriers for opening upper-layer directory rcu: Provide counterpart to rcu_dereference() for non-RCU situations staging: android: logger: Fix log corruption regression	2014-11-02 10:28:43 -08:00
Linus Torvalds	4f4274af70	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Filipe is nailing down some problems with our skinny extent variation, and Dave's patch fixes endian problems in the new super block checks" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items Btrfs: properly clean up btrfs_end_io_wq_cache Btrfs: fix invalid leaf slot access in btrfs_lookup_extent() btrfs: use macro accessors in superblock validation checks	2014-11-01 10:41:26 -07:00
Linus Torvalds	32e8fd2f8e	A set of miscellaneous ext4 bug fixes for 3.18. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJUVAF/AAoJENNvdpvBGATwEbAQALNiAIChEyJTnQDkAQc2wqqn dv8NQmFr5aefc63A/+n/yJJGrQZtKs0ceh29ty5ksYLFXzUdc2ctFg6vBmllQfbz PQawAk2gOkF8zfVuqiQU7X+wTBpGmGXTa8HY+WJTtk0pBfhl+p0PDCYsWXMwZJ1D tAZpxJ4AmPc7A4hApWOvce6r7Xg24vZk/8UA93Tif9AkeY6VoN272Hx5b/UGmBHY RCEgpowuiIY38bghtLh5+T0J98/EQNof46cEHgGI9nIDZeXRzgvDojE5bLI0/IS/ K07MjYlm/WFWsLFkgNJkTiqEXgnji9BNYRF1xxUjMMBAR4+fnFLw9kXXgcETrPCx U7lHOhs8M2FK40cWhUDz/tukvL4S4lQwPEeqBPlRE8J5/twRyXHeZDp4F7LOobwq mk6AajSJlP+05XwXOuCx7Hcf9uxjw/IpqhBS5IZxy8Nn3T2guPlY9wMhYU1RYFws 54FeE76SJ8EDgjVK/txj7rgh11GggWsjsdXvftSElM2DsKsqYEOKAvDzvwmbm7eV dsFOlRB6B/X4UpiAC2MiPJynYg9TJ7LkVBzDZeZ/fbm7JhTqChSJDzapqdrmNPIY SQqwLmFXnHqaw6HNitZ5Bs+fD6nfvKqy85NeImxE3lhLWDuiTt77Y3o80IW30TgN 5bnuXq8Rkukrxs/VDvPq =kI6P -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "A set of miscellaneous ext4 bug fixes for 3.18" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: make ext4_ext_convert_to_initialized() return proper number of blocks ext4: bail early when clearing inode journal flag fails ext4: bail out from make_indexed_dir() on first error jbd2: use a better hash function for the revoke table ext4: prevent bugon on race between write/fcntl ext4: remove extent status procfs files if journal load fails ext4: disallow changing journal_csum option during remount ext4: enable journal checksum when metadata checksum feature enabled ext4: fix oops when loading block bitmap failed ext4: fix overflow when updating superblock backups after resize	2014-10-31 16:22:29 -07:00
Linus Torvalds	e2488ab6ab	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota and ext3 fixes from Jan Kara. * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fs, jbd: use a more generic hash function quota: Properly return errors from dquot_writeback_dquots() ext3: Don't check quota format when there are no quota files	2014-10-31 16:18:47 -07:00
Al Viro	a7400222e3	new helper: is_root_inode() replace open-coded instances Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-31 17:48:54 -04:00
Miklos Szeredi	ac7576f4b1	vfs: make first argument of dir_context.actor typed Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-31 17:48:54 -04:00
Miklos Szeredi	9f2f7d4c8d	ovl: initialize ->is_cursor Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-31 17:47:51 -04:00
David Jeffery	b2de525f09	Return short read or 0 at end of a raw device, not EIO Author: David Jeffery <djeffery@redhat.com> Changes to the basic direct I/O code have broken the raw driver when reading to the end of a raw device. Instead of returning a short read for a read that extends partially beyond the device's end or 0 when at the end of the device, these reads now return EIO. The raw driver needs the same end of device handling as was added for normal block devices. Using blkdev_read_iter, which has the needed size checks, prevents the EIO conditions at the end of the device. Signed-off-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-31 06:33:26 -04:00
Al Viro	b0afd8e5db	isofs: don't bother with ->d_op for normal case we only need it for joliet and case-insensitive mounts Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-31 06:33:17 -04:00
Eric Rannaud	69a91c237a	fs: allow open(dir, O_TMPFILE\|..., 0) with mode 0 The man page for open(2) indicates that when O_CREAT is specified, the 'mode' argument applies only to future accesses to the file: Note that this mode applies only to future accesses of the newly created file; the open() call that creates a read-only file may well return a read/write file descriptor. The man page for open(2) implies that 'mode' is treated identically by O_CREAT and O_TMPFILE. O_TMPFILE, however, behaves differently: int fd = open("/tmp", O_TMPFILE \| O_RDWR, 0); assert(fd == -1); assert(errno == EACCES); int fd = open("/tmp", O_TMPFILE \| O_RDWR, 0600); assert(fd > 0); For O_CREAT, do_last() sets acc_mode to MAY_OPEN only: if (opened & FILE_CREATED) { / Don't check for write permission, don't truncate */ open_flag &= ~O_TRUNC; will_truncate = false; acc_mode = MAY_OPEN; path_to_nameidata(path, nd); goto finish_open_created; } But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to may_open(). This patch lines up the behavior of O_TMPFILE with O_CREAT. After the inode is created, may_open() is called with acc_mode = MAY_OPEN, in do_tmpfile(). A different, but related glibc bug revealed the discrepancy: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 The glibc lazily loads the 'mode' argument of open() and openat() using va_arg() only if O_CREAT is present in 'flags' (to support both the 2 argument and the 3 argument forms of open; same idea for openat()). However, the glibc ignores the 'mode' argument if O_TMPFILE is in 'flags'. On x86_64, for open(), it magically works anyway, as 'mode' is in RDX when entering open(), and is still in RDX on SYSCALL, which is where the kernel looks for the 3rd argument of a syscall. But openat() is not quite so lucky: 'mode' is in RCX when entering the glibc wrapper for openat(), while the kernel looks for the 4th argument of a syscall in R10. Indeed, the syscall calling convention differs from the regular calling convention in this respect on x86_64. So the kernel sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and fails with EACCES. Signed-off-by: Eric Rannaud <e@nanocritical.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-10-30 15:50:13 -07:00
Jan Kara	ae9e9c6aee	ext4: make ext4_ext_convert_to_initialized() return proper number of blocks ext4_ext_convert_to_initialized() can return more blocks than are actually allocated from map->m_lblk in case where initial part of the on-disk extent is zeroed out. Luckily this doesn't have serious consequences because the caller currently uses the return value only to unmap metadata buffers. Anyway this is a data corruption/exposure problem waiting to happen so fix it. Coverity-id: 1226848 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2014-10-30 10:53:17 -04:00
Jan Kara	4f879ca687	ext4: bail early when clearing inode journal flag fails When clearing inode journal flag, we call jbd2_journal_flush() to force all the journalled data to their final locations. Currently we ignore when this fails and continue clearing inode journal flag. This isn't a big problem because when jbd2_journal_flush() fails, journal is likely aborted anyway. But it can still lead to somewhat confusing results so rather bail out early. Coverity-id: 989044 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2014-10-30 10:53:17 -04:00
Jan Kara	6050d47adc	ext4: bail out from make_indexed_dir() on first error When ext4_handle_dirty_dx_node() or ext4_handle_dirty_dirent_node() fail, there's really something wrong with the fs and there's no point in continuing further. Just return error from make_indexed_dir() in that case. Also initialize frames array so that if we return early due to error, dx_release() doesn't try to dereference uninitialized memory (which could happen also due to error in do_split()). Coverity-id: 741300 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2014-10-30 10:53:17 -04:00
Theodore Ts'o	d48458d4a7	jbd2: use a better hash function for the revoke table The old hash function didn't work well for 64-bit block numbers, and used undefined (negative) shift right behavior. Use the generic 64-bit hash function instead. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>	2014-10-30 10:53:17 -04:00
Dmitry Monakhov	a41537e69b	ext4: prevent bugon on race between write/fcntl O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked twice inside ext4_file_write_iter() and __generic_file_write() which result in BUG_ON inside ext4_direct_IO. Let's initialize iocb->private unconditionally. TESTCASE: xfstest:generic/036 https://patchwork.ozlabs.org/patch/402445/ #TYPICAL STACK TRACE: kernel BUG at fs/ext4/inode.c:2960! invalid opcode: 0000 [#1] SMP Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 #161 Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011 task: ffff88080e95a7c0 ti: ffff88080f908000 task.ti: ffff88080f908000 RIP: 0010:[<ffffffff811fabf2>] [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP: 0018:ffff88080f90bb58 EFLAGS: 00010246 RAX: 0000000000000400 RBX: ffff88080fdb2a28 RCX: 00000000a802c818 RDX: 0000040000080000 RSI: ffff88080d8aeb80 RDI: 0000000000000001 RBP: ffff88080f90bbc8 R08: 0000000000000000 R09: 0000000000001581 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88080d8aeb80 R13: ffff88080f90bbf8 R14: ffff88080fdb28c8 R15: ffff88080fdb2a28 FS: 00007f23b2055700(0000) GS:ffff880818400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f23b2045000 CR3: 000000080cedf000 CR4: 00000000000407e0 Stack: ffff88080f90bb98 0000000000000000 7ffffffffffffffe ffff88080fdb2c30 0000000000000200 0000000000000200 0000000000000001 0000000000000200 ffff88080f90bbc8 ffff88080fdb2c30 ffff88080f90be08 0000000000000200 Call Trace: [<ffffffff8112ca9d>] generic_file_direct_write+0xed/0x180 [<ffffffff8112f2b2>] __generic_file_write_iter+0x222/0x370 [<ffffffff811f495b>] ext4_file_write_iter+0x34b/0x400 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810abd94>] ? __lock_acquire+0x274/0x700 [<ffffffff811f4610>] ? ext4_unwritten_wait+0xb0/0xb0 [<ffffffff811bd756>] aio_run_iocb+0x286/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190 [<ffffffff811bc05b>] ? lookup_ioctx+0x4b/0xf0 [<ffffffff811bde3b>] do_io_submit+0x55b/0x740 [<ffffffff811bdcaa>] ? do_io_submit+0x3ca/0x740 [<ffffffff811be030>] SyS_io_submit+0x10/0x20 [<ffffffff815ce192>] system_call_fastpath+0x16/0x1b Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c 24 18 00 75 04 <0f> 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89 RIP [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP <ffff88080f90bb58> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Cc: stable@vger.kernel.org	2014-10-30 10:53:16 -04:00
Darrick J. Wong	50460fe8c6	ext4: remove extent status procfs files if journal load fails If we can't load the journal, remove the procfs files for the extent status information file to avoid leaking resources. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2014-10-30 10:53:16 -04:00
Darrick J. Wong	6b992ff256	ext4: disallow changing journal_csum option during remount ext4 does not permit changing the metadata or journal checksum feature flag while mounted. Until we decide to support that, don't allow a remount to change the journal_csum flag (right now we silently fail to change anything). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2014-10-30 10:53:16 -04:00
Darrick J. Wong	98c1a7593f	ext4: enable journal checksum when metadata checksum feature enabled If metadata checksumming is turned on for the FS, we need to tell the journal to use checksumming too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2014-10-30 10:53:16 -04:00
Jan Kara	599a9b77ab	ext4: fix oops when loading block bitmap failed When we fail to load block bitmap in __ext4_new_inode() we will dereference NULL pointer in ext4_journal_get_write_access(). So check for error from ext4_read_block_bitmap(). Coverity-id: 989065 Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2014-10-30 10:53:16 -04:00
Jan Kara	9378c6768e	ext4: fix overflow when updating superblock backups after resize When there are no meta block groups update_backups() will compute the backup block in 32-bit arithmetics thus possibly overflowing the block number and corrupting the filesystem. OTOH filesystems without meta block groups larger than 16 TB should be rare. Fix the problem by doing the counting in 64-bit arithmetics. Coverity-id: 741252 CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>	2014-10-30 10:52:57 -04:00
Joe Perches	1f33c41c03	seq_file: Rename seq_overflow() to seq_has_overflowed() and make public The return values of seq_printf/puts/putc are frequently misused. Start down a path to remove all the return value uses of these functions. Move the seq_overflow() to a global inlined function called seq_has_overflowed() that can be used by the users of seq_file() calls. Update the documentation to not show return types for seq_printf et al. Add a description of seq_has_overflowed(). Link: http://lkml.kernel.org/p/848ac7e3d1c31cddf638a8526fa3c59fa6fdeb8a.1412031505.git.joe@perches.com Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Joe Perches <joe@perches.com> [ Reworked the original patch from Joe ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2014-10-29 20:26:06 -04:00
Linus Torvalds	a7ca10f263	Merge branch 'akpm' (incoming from Andrew Morton) Merge misc fixes from Andrew Morton: "21 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits) mm/balloon_compaction: fix deflation when compaction is disabled sh: fix sh770x SCIF memory regions zram: avoid NULL pointer access in concurrent situation mm/slab_common: don't check for duplicate cache names ocfs2: fix d_splice_alias() return code checking mm: rmap: split out page_remove_file_rmap() mm: memcontrol: fix missed end-writeback page accounting mm: page-writeback: inline account_page_dirtied() into single caller lib/bitmap.c: fix undefined shift in __bitmap_shift_{left\|right}() drivers/rtc/rtc-bq32k.c: fix register value memory-hotplug: clear pgdat which is allocated by bootmem in try_offline_node() drivers/rtc/rtc-s3c.c: fix initialization failure without rtc source clock kernel/kmod: fix use-after-free of the sub_info structure drivers/rtc/rtc-pm8xxx.c: rework to support pm8941 rtc mm, thp: fix collapsing of hugepages on madvise drivers: of: add return value to of_reserved_mem_device_init() mm: free compound page with correct order gcov: add ARM64 to GCOV_PROFILE_ALL fsnotify: next_i is freed during fsnotify_unmount_inodes. mm/compaction.c: avoid premature range skip in isolate_migratepages_range ...	2014-10-29 16:38:48 -07:00
Brian Foster	5d11fb4b9a	xfs: rework zero range to prevent invalid i_size updates The zero range operation is analogous to fallocate with the exception of converting the range to zeroes. E.g., it attempts to allocate zeroed blocks over the range specified by the caller. The XFS implementation kills all delalloc blocks currently over the aligned range, converts the range to allocated zero blocks (unwritten extents) and handles the partial pages at the ends of the range by sending writes through the pagecache. The current implementation suffers from several problems associated with inode size. If the aligned range covers an extending I/O, said I/O is discarded and an inode size update from a previous write never makes it to disk. Further, if an unaligned zero range extends beyond eof, the page write induced for the partial end page can itself increase the inode size, even if the zero range request is not supposed to update i_size (via KEEP_SIZE, similar to an fallocate beyond EOF). The latter behavior not only incorrectly increases the inode size, but can lead to stray delalloc blocks on the inode. Typically, post-eof preallocation blocks are either truncated on release or inode eviction or explicitly written to by xfs_zero_eof() on natural file size extension. If the inode size increases due to zero range, however, associated blocks leak into the address space having never been converted or mapped to pagecache pages. A direct I/O to such an uncovered range cannot convert the extent via writeback and will BUG(). For example: $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file> ... $ xfs_io -d -c "pread 128k 128k" <file> <BUG> If the entire delalloc extent happens to not have page coverage whatsoever (e.g., delalloc conversion couldn't find a large enough free space extent), even a full file writeback won't convert what's left of the extent and we'll assert on inode eviction. Rework xfs_zero_file_space() to avoid buffered I/O for partial pages. Use the existing hole punch and prealloc mechanisms as primitives for zero range. This implementation is not efficient nor ideal as we writeback dirty data over the range and remove existing extents rather than convert to unwrittern. The former writeback, however, is currently the only mechanism available to ensure consistency between pagecache and extent state. Even a pagecache truncate/delalloc punch prior to hole punch has lead to inconsistencies due to racing with writeback. This provides a consistent, correct implementation of zero range that survives fsstress/fsx testing without assert failures. The implementation can be optimized from this point forward once the fundamental issue of pagecache and delalloc extent state consistency is addressed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2014-10-30 10:35:11 +11:00
Jan Kara	7a19dee116	xfs: Check error during inode btree iteration in xfs_bulkstat() xfs_bulkstat() doesn't check error return from xfs_btree_increment(). In case of specific fs corruption that could result in xfs_bulkstat() entering an infinite loop because we would be looping over the same chunk over and over again. Fix the problem by checking the return value and terminating the loop properly. Coverity-id: 1231338 cc: <stable@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Jie Liu <jeff.u.liu@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2014-10-30 10:34:52 +11:00
Richard Weinberger	d3556babd7	ocfs2: fix d_splice_alias() return code checking d_splice_alias() can return a valid dentry, NULL or an ERR_PTR. Currently the code checks not for ERR_PTR and will cuase an oops in ocfs2_dentry_attach_lock(). Fix this by using IS_ERR_OR_NULL(). Signed-off-by: Richard Weinberger <richard@nod.at> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-10-29 16:33:15 -07:00
Jerry Hoemann	6424babfd6	fsnotify: next_i is freed during fsnotify_unmount_inodes. During file system stress testing on 3.10 and 3.12 based kernels, the umount command occasionally hung in fsnotify_unmount_inodes in the section of code: spin_lock(&inode->i_lock); if (inode->i_state & (I_FREEING\|I_WILL_FREE\|I_NEW)) { spin_unlock(&inode->i_lock); continue; } As this section of code holds the global inode_sb_list_lock, eventually the system hangs trying to acquire the lock. Multiple crash dumps showed: The inode->i_state == 0x60 and i_count == 0 and i_sb_list would point back at itself. As this is not the value of list upon entry to the function, the kernel never exits the loop. To help narrow down problem, the call to list_del_init in inode_sb_list_del was changed to list_del. This poisons the pointers in the i_sb_list and causes a kernel to panic if it transverse a freed inode. Subsequent stress testing paniced in fsnotify_unmount_inodes at the bottom of the list_for_each_entry_safe loop showing next_i had become free. We believe the root cause of the problem is that next_i is being freed during the window of time that the list_for_each_entry_safe loop temporarily releases inode_sb_list_lock to call fsnotify and fsnotify_inode_delete. The code in fsnotify_unmount_inodes attempts to prevent the freeing of inode and next_i by calling __iget. However, the code doesn't do the __iget call on next_i if i_count == 0 or if i_state & (I_FREEING \| I_WILL_FREE) The patch addresses this issue by advancing next_i in the above two cases until we either find a next_i which we can __iget or we reach the end of the list. This makes the handling of next_i more closely match the handling of the variable "inode." The time to reproduce the hang is highly variable (from hours to days.) We ran the stress test on a 3.10 kernel with the proposed patch for a week without failure. During list_for_each_entry_safe, next_i is becoming free causing the loop to never terminate. Advance next_i in those cases where __iget is not done. Signed-off-by: Jerry Hoemann <jerry.hoemann@hp.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ken Helias <kenhelias@firemail.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-10-29 16:33:14 -07:00
Tyler Hicks	831115af5c	eCryptfs: Remove unnecessary casts when parsing packet lengths The elements in the data array are already unsigned chars and do not need to be casted. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>	2014-10-29 18:32:59 -05:00
Linus Torvalds	d506aa68c2	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block layer fixes from Jens Axboe: "A small collection of fixes for the current kernel. This contains: - Two error handling fixes from Jan Kara. One for null_blk on failure to add a device, and the other for the block/scsi_ioctl SCSI_IOCTL_SEND_COMMAND fixing up the error jump point. - A commit added in the merge window for the bio integrity bits unfortunately disabled merging for all requests if CONFIG_BLK_DEV_INTEGRITY wasn't set. Reverse the logic, so that integrity checking wont disallow merges when not enabled. - A fix from Ming Lei for merging and generating too many segments. This caused a BUG in virtio_blk. - Two error handling printk() fixups from Robert Elliott, improving the information given when we rate limit. - Error handling fixup on elevator_init() failure from Sudip Mukherjee. - A fix from Tony Battersby, fixing up a memory leak in the scatterlist handling with scsi-mq" * 'for-linus' of git://git.kernel.dk/linux-block: block: Fix merge logic when CONFIG_BLK_DEV_INTEGRITY is not defined lib/scatterlist: fix memory leak with scsi-mq block: fix wrong error return in elevator_init() scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND null_blk: Cleanup error recovery in null_add_dev() blk-merge: recaculate segment if it isn't less than max segments fs: clarify rate limit suppressed buffer I/O errors fs: merge I/O error prints into one line	2014-10-29 11:57:10 -07:00
Al Viro	f643ff550a	isofs_cmp(): we'll never see a dentry for . or .. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-28 18:37:40 -04:00
Miklos Szeredi	d1b72cc6d8	overlayfs: fix lockdep misannotation In an overlay directory that shadows an empty lower directory, say /mnt/a/empty102, do: touch /mnt/a/empty102/x unlink /mnt/a/empty102/x rmdir /mnt/a/empty102 It's actually harmless, but needs another level of nesting between I_MUTEX_CHILD and I_MUTEX_NORMAL. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-28 18:32:47 -04:00
Miklos Szeredi	c2096537d4	ovl: fix check for cursor ovl_cache_entry.name is now an array not a pointer, so it makes no sense test for it being NULL. Detected by coverity. From: Miklos Szeredi <mszeredi@suse.cz> Fixes: `68bf861107` ("overlayfs: make ovl_cache_entry->name an array instead of +pointer") Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-28 18:31:54 -04:00
Al Viro	d45f00ae43	overlayfs: barriers for opening upper-layer directory make sure that a) all stores done by opening struct file don't leak past storing the reference in od->upperfile b) the lockless side has read dependency barrier Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-28 18:27:28 -04:00
Dave Chinner	a6bbce54ef	xfs: bulkstat doesn't release AGI buffer on error The recent refactoring of the bulkstat code left a small landmine in the code. If a inobt read fails, then the tree walk is aborted and returns without releasing the AGI buffer or freeing the cursor. This can lead to a subsequent bulkstat call hanging trying to grab the AGI buffer again. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2014-10-29 08:22:18 +11:00
Filipe Manana	d05a2b4cd9	Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items We have a race that can lead us to miss skinny extent items in the function btrfs_lookup_extent_info() when the skinny metadata feature is enabled. So basically the sequence of steps is: 1) We search in the extent tree for the skinny extent, which returns > 0 (not found); 2) We check the previous item in the returned leaf for a non-skinny extent, and we don't find it; 3) Because we didn't find the non-skinny extent in step 2), we release our path to search the extent tree again, but this time for a non-skinny extent key; 4) Right after we released our path in step 3), a skinny extent was inserted in the extent tree (delayed refs were run) - our second extent tree search will miss it, because it's not looking for a skinny extent; 5) After the second search returned (with ret > 0), we look for any delayed ref for our extent's bytenr (and we do it while holding a read lock on the leaf), but we won't find any, as such delayed ref had just run and completed after we released out path in step 3) before doing the second search. Fix this by removing completely the path release and re-search logic. This is safe, because if we seach for a metadata item and we don't find it, we have the guarantee that the returned leaf is the one where the item would be inserted, and so path->slots[0] > 0 and path->slots[0] - 1 must be the slot where the non-skinny extent item is if it exists. The only case where path->slots[0] is zero is when there are no smaller keys in the tree (i.e. no left siblings for our leaf), in which case the re-search logic isn't needed as well. This race has been present since the introduction of skinny metadata (change `3173a18f70`). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-10-28 13:59:54 -07:00
Linus Torvalds	9f76628da2	Merge branch 'for-3.18' of git://linux-nfs.org/~bfields/linux Pull two nfsd fixes from Bruce Fields: "One regression from the 3.16 xdr rewrite, one an older bug exposed by a separate bug in the client's new SEEK code" * 'for-3.18' of git://linux-nfs.org/~bfields/linux: nfsd4: fix crash on unknown operation number nfsd4: fix response size estimation for OP_SEQUENCE	2014-10-28 13:32:06 -07:00
Peter Zijlstra	e23738a730	sched, inotify: Deal with nested sleeps inotify_read is a wait loop with sleeps in. Wait loops rely on task_struct::state and sleeps do too, since that's the only means of actually sleeping. Therefore the nested sleeps destroy the wait loop state and the wait loop breaks the sleep functions that assume TASK_RUNNING (mutex_lock). Fix this by using the new woken_wake_function and wait_woken() stuff, which registers wakeups in wait and thereby allows shrinking the task_state::state changes to the actual sleep part. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: tglx@linutronix.de Cc: ilya.dryomov@inktank.com Cc: umgwanakikbuti@gmail.com Cc: Robert Love <rlove@rlove.org> Cc: Eric Paris <eparis@parisplace.org> Cc: John McCutchan <john@johnmccutchan.com> Cc: Robert Love <rlove@rlove.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20140924082242.254858080@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-10-28 10:55:37 +01:00
Josef Bacik	5ed5f58841	Btrfs: properly clean up btrfs_end_io_wq_cache In one of Dave's cleanup commits he forgot to call btrfs_end_io_wq_exit on unload, which makes us unable to unload and then re-load the btrfs module. This fixes the problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-10-27 13:16:53 -07:00
Filipe Manana	1a4ed8fdca	Btrfs: fix invalid leaf slot access in btrfs_lookup_extent() If we couldn't find our extent item, we accessed the current slot (path->slots[0]) to check if it corresponds to an equivalent skinny metadata item. However this slot could be beyond our last item in the leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case we shouldn't process it. Since btrfs_lookup_extent() is only used to find extent items for data extents, fix this by removing completely the logic that looks up for an equivalent skinny metadata item, since it can not exist. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-10-27 13:16:52 -07:00
David Sterba	21e7626b12	btrfs: use macro accessors in superblock validation checks The initial patch `c926093ec5` (btrfs: add more superblock checks) did not properly use the macro accessors that wrap endianness and the code would not work correctly on big endian machines. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-10-27 13:16:52 -07:00
Linus Torvalds	d1e14f1d63	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "overlayfs merge + leak fix for d_splice_alias() failure exits" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: overlayfs: embed middle into overlay_readdir_data overlayfs: embed root into overlay_readdir_data overlayfs: make ovl_cache_entry->name an array instead of pointer overlayfs: don't hold ->i_mutex over opening the real directory fix inode leaks on d_splice_alias() failure exits fs: limit filesystem stacking depth overlay: overlay filesystem documentation overlayfs: implement show_options overlayfs: add statfs support overlay filesystem shmem: support RENAME_WHITEOUT ext4: support RENAME_WHITEOUT vfs: add RENAME_WHITEOUT vfs: add whiteout support vfs: export check_sticky() vfs: introduce clone_private_mount() vfs: export __inode_permission() to modules vfs: export do_splice_direct() to modules vfs: add i_op->dentry_open()	2014-10-26 11:19:18 -07:00
Al Viro	db6ec212b5	overlayfs: embed middle into overlay_readdir_data same story... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-24 20:25:23 -04:00
Al Viro	49be4fb9cc	overlayfs: embed root into overlay_readdir_data no sense having it a pointer - all instances have it pointing to local variable in the same stack frame Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-24 20:25:23 -04:00
Al Viro	68bf861107	overlayfs: make ovl_cache_entry->name an array instead of pointer Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-24 20:25:22 -04:00
Al Viro	3d268c9b13	overlayfs: don't hold ->i_mutex over opening the real directory just use it to serialize the assignment Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-24 20:24:11 -04:00
Al Viro	1be47b387a	Merge branch 'overlayfs.v25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into for-linus	2014-10-23 22:52:55 -04:00
Al Viro	51486b900e	fix inode leaks on d_splice_alias() failure exits d_splice_alias() callers expect it to either stash the inode reference into a new alias, or drop the inode reference. That makes it possible to just return d_splice_alias() result from ->lookup() instance, without any extra housekeeping required. Unfortunately, that should include the failure exits. If d_splice_alias() returns an error, it leaves the dentry it has been given negative and thus it must drop the inode reference. Easily fixed, but it goes way back and will need backporting. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-10-23 22:30:18 -04:00
Miklos Szeredi	69c433ed2e	fs: limit filesystem stacking depth Add a simple read-only counter to super_block that indicates how deep this is in the stack of filesystems. Previously ecryptfs was the only stackable filesystem and it explicitly disallowed multiple layers of itself. Overlayfs, however, can be stacked recursively and also may be stacked on top of ecryptfs or vice versa. To limit the kernel stack usage we must limit the depth of the filesystem stack. Initially the limit is set to 2. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:39 +02:00
Erez Zadok	f45827e841	overlayfs: implement show_options This is useful because of the stacking nature of overlayfs. Users like to find out (via /proc/mounts) which lower/upper directory were used at mount time. AV: even failing ovl_parse_opt() could've done some kstrdup() AV: failure of ovl_alloc_entry() should end up with ENOMEM, not EINVAL Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:38 +02:00
Andy Whitcroft	cc2596392a	overlayfs: add statfs support Add support for statfs to the overlayfs filesystem. As the upper layer is the target of all write operations assume that the space in that filesystem is the space in the overlayfs. There will be some inaccuracy as overwriting a file will copy it up and consume space we were not expecting, but it is better than nothing. Use the upper layer dentry and mount from the overlayfs root inode, passing the statfs call to that filesystem. Signed-off-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:38 +02:00
Miklos Szeredi	e9be9d5e76	overlay filesystem Overlayfs allows one, usually read-write, directory tree to be overlaid onto another, read-only directory tree. All modifications go to the upper, writable layer. This type of mechanism is most often used for live CDs but there's a wide variety of other uses. The implementation differs from other "union filesystem" implementations in that after a file is opened all operations go directly to the underlying, lower or upper, filesystems. This simplifies the implementation and allows native performance in these cases. The dentry tree is duplicated from the underlying filesystems, this enables fast cached lookups without adding special support into the VFS. This uses slightly more memory than union mounts, but dentries are relatively small. Currently inodes are duplicated as well, but it is a possible optimization to share inodes for non-directories. Opening non directories results in the open forwarded to the underlying filesystem. This makes the behavior very similar to union mounts (with the same limitations vs. fchmod/fchown on O_RDONLY file descriptors). Usage: mount -t overlayfs overlayfs -olowerdir=/lower,upperdir=/upper/upper,workdir=/upper/work /overlay The following cotributions have been folded into this patch: Neil Brown <neilb@suse.de>: - minimal remount support - use correct seek function for directories - initialise is_real before use - rename ovl_fill_cache to ovl_dir_read Felix Fietkau <nbd@openwrt.org>: - fix a deadlock in ovl_dir_read_merged - fix a deadlock in ovl_remove_whiteouts Erez Zadok <ezk@fsl.cs.sunysb.edu> - fix cleanup after WARN_ON Sedat Dilek <sedat.dilek@googlemail.com> - fix up permission to confirm to new API Robin Dong <hao.bigrat@gmail.com> - fix possible leak in ovl_new_inode - create new inode in ovl_link Andy Whitcroft <apw@canonical.com> - switch to __inode_permission() - copy up i_uid/i_gid from the underlying inode AV: - ovl_copy_up_locked() - dput(ERR_PTR(...)) on two failure exits - ovl_clear_empty() - one failure exit forgetting to do unlock_rename(), lack of check for udir being the parent of upper, dropping and regaining the lock on udir (which would require _another_ check for parent being right). - bogus d_drop() in copyup and rename [fix from your mail] - copyup/remove and copyup/rename races [fix from your mail] - ovl_dir_fsync() leaving ERR_PTR() in ->realfile - ovl_entry_free() is pointless - it's just a kfree_rcu() - fold ovl_do_lookup() into ovl_lookup() - manually assigning ->d_op is wrong. Just use ->s_d_op. [patches picked from Miklos]: * copyup/remove and copyup/rename races * bogus d_drop() in copyup and rename Also thanks to the following people for testing and reporting bugs: Jordi Pujol <jordipujolp@gmail.com> Andy Whitcroft <apw@canonical.com> Michal Suchanek <hramrach@centrum.cz> Felix Fietkau <nbd@openwrt.org> Erez Zadok <ezk@fsl.cs.sunysb.edu> Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:38 +02:00
Miklos Szeredi	cd808deced	ext4: support RENAME_WHITEOUT Add whiteout support to ext4_rename(). A whiteout inode (chrdev/0,0) is created before the rename takes place. The whiteout inode is added to the old entry instead of deleting it. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:37 +02:00
Miklos Szeredi	0d7a855526	vfs: add RENAME_WHITEOUT This adds a new RENAME_WHITEOUT flag. This flag makes rename() create a whiteout of source. The whiteout creation is atomic relative to the rename. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:37 +02:00
Miklos Szeredi	787fb6bc96	vfs: add whiteout support Whiteout isn't actually a new file type, but is represented as a char device (Linus's idea) with 0/0 device number. This has several advantages compared to introducing a new whiteout file type: - no userspace API changes (e.g. trivial to make backups of upper layer filesystem, without losing whiteouts) - no fs image format changes (you can boot an old kernel/fsck without whiteout support and things won't break) - implementation is trivial Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:36 +02:00
Miklos Szeredi	cbdf35bcb8	vfs: export check_sticky() It's already duplicated in btrfs and about to be used in overlayfs too. Move the sticky bit check to an inline helper and call the out-of-line helper only in the unlikly case of the sticky bit being set. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:36 +02:00
Miklos Szeredi	c771d683a6	vfs: introduce clone_private_mount() Overlayfs needs a private clone of the mount, so create a function for this and export to modules. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:36 +02:00
Miklos Szeredi	bd5d08569c	vfs: export __inode_permission() to modules We need to be able to check inode permissions (but not filesystem implied permissions) for stackable filesystems. Expose this interface for overlayfs. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:35 +02:00
Miklos Szeredi	1c118596a7	vfs: export do_splice_direct() to modules Export do_splice_direct() to modules. Needed by overlay filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:35 +02:00
Miklos Szeredi	4aa7c6346b	vfs: add i_op->dentry_open() Add a new inode operation i_op->dentry_open(). This is for stacked filesystems that want to return a struct file from a different filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2014-10-24 00:14:35 +02:00
Jeff Layton	ccc6398ea5	nfsd: clean up comments over nfs4_file definition They're a bit outdated wrt to some recent changes. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-10-23 14:05:11 -04:00
Chuck Lever	b0d2e42cce	NFSD: Always initialize cl_cb_addr A client may not want to use the back channel on a transport it sent CREATE_SESSION on, in which case it clears SESSION4_BACK_CHAN. However, cl_cb_addr should be populated anyway, to be used if the client binds other connections to this session. If cl_cb_addr is not initialized, rpc_create() fails when the server attempts to set up a back channel on such secondary transports. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-10-23 14:05:11 -04:00
Zach Brown	e77a7b4f01	nfsd: fix inclusive vfs_fsync_range() end The vfs_fsync_range() call during write processing got the end of the range off by one. The range is inclusive, not exclusive. The error has nfsd sync more data than requested -- it's correct but unnecessary overhead. The call during commit processing is correct so I copied that pattern in write processing. Maybe a helper would be nice but I kept it trivial. This is untested. I found it while reviewing code for something else entirely. Signed-off-by: Zach Brown <zab@zabbo.net> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-10-23 14:05:10 -04:00
J. Bruce Fields	51904b0807	nfsd4: fix crash on unknown operation number Unknown operation numbers are caught in nfsd4_decode_compound() which sets op->opnum to OP_ILLEGAL and op->status to nfserr_op_illegal. The error causes the main loop in nfsd4_proc_compound() to skip most processing. But nfsd4_proc_compound also peeks ahead at the next operation in one case and doesn't take similar precautions there. Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-10-23 13:39:51 -04:00
Tyler Hicks	332b122d39	eCryptfs: Force RO mount when encrypted view is enabled The ecryptfs_encrypted_view mount option greatly changes the functionality of an eCryptfs mount. Instead of encrypting and decrypting lower files, it provides a unified view of the encrypted files in the lower filesystem. The presence of the ecryptfs_encrypted_view mount option is intended to force a read-only mount and modifying files is not supported when the feature is in use. See the following commit for more information: `e77a56d` [PATCH] eCryptfs: Encrypted passthrough This patch forces the mount to be read-only when the ecryptfs_encrypted_view mount option is specified by setting the MS_RDONLY flag on the superblock. Additionally, this patch removes some broken logic in ecryptfs_open() that attempted to prevent modifications of files when the encrypted view feature was in use. The check in ecryptfs_open() was not sufficient to prevent file modifications using system calls that do not operate on a file descriptor. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Reported-by: Priya Bansal <p.bansal@samsung.com> Cc: stable@vger.kernel.org # v2.6.21+: `e77a56d` [PATCH] eCryptfs: Encrypted passthrough	2014-10-23 09:11:03 -04:00
Fabian Frederick	7f4028b296	jffs2: fix sparse warning: unexpected unlock fs/jffs2/summary.c:846:5: warning: context imbalance in 'jffs2_sum_write_sumnode' - unexpected unlock Suggested-by: Brian Norris <computersforpeace@gmail.com> Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Brian Norris <computersforpeace@gmail.com>	2014-10-22 01:35:41 -07:00
Sasha Levin	3c9cafe05f	fs, jbd: use a more generic hash function While the hash function used by the revoke hashtable is good somewhere else, it's not really good here. The default hash shift (8) means that one third of the hashing function gets lost (and is undefined anyways (8 - 12 = negative shift)): "(block << (hash_shift - 12))) & (table->hash_size - 1)" Instead, just use the kernel's generic hash function that gets used everywhere else. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>	2014-10-22 10:02:04 +02:00
Jan Kara	474d2605d1	quota: Properly return errors from dquot_writeback_dquots() Due to a switched left and right side of an assignment, dquot_writeback_dquots() never returned error. This could result in errors during quota writeback to not be reported to userspace properly. Fix it. CC: stable@vger.kernel.org Coverity-id: 1226884 Signed-off-by: Jan Kara <jack@suse.cz>	2014-10-22 09:08:03 +02:00
Jan Kara	7938db449b	ext3: Don't check quota format when there are no quota files The check whether quota format is set even though there are no quota files with journalled quota is pointless and it actually makes it impossible to turn off journalled quotas (as there's no way to unset journalled quota format). Just remove the check. CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>	2014-10-22 09:02:48 +02:00
Robert Elliott	432f16e64f	fs: clarify rate limit suppressed buffer I/O errors When quiet_error applies rate limiting to buffer_io_error calls, what the they apply to is unclear because the name is so generic, particularly if the messages are interleaved with others: [ 1936.063572] quiet_error: 664293 callbacks suppressed [ 1936.065297] Buffer I/O error on dev sdr, logical block 257429952, lost async page write [ 1936.067814] Buffer I/O error on dev sdr, logical block 257429953, lost async page write Also, the function uses printk_ratelimit(), although printk.h includes a comment advising "Please don't use... Instead use printk_ratelimited()." Change buffer_io_error to check the BH_Quiet bit itself, drop the printk_ratelimit call, and print using printk_ratelimited. This makes the messages look like: [ 387.208839] buffer_io_error: 676394 callbacks suppressed [ 387.210693] Buffer I/O error on dev sdr, logical block 211291776, lost async page write [ 387.213432] Buffer I/O error on dev sdr, logical block 211291777, lost async page write Signed-off-by: Robert Elliott <elliott@hp.com> Reviewed-by: Webb Scales <webbnh@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-21 13:55:11 -06:00
Robert Elliott	b744c2ac4b	fs: merge I/O error prints into one line buffer.c uses two printk calls to print these messages: [67353.422338] Buffer I/O error on device sdr, logical block 212868488 [67353.422338] lost page write due to I/O error on sdr In a busy system, they may be interleaved with other prints, losing the context for the second message. Merge them into one line with one printk call so the prints are atomic. Also, differentiate between async page writes, sync page writes, and async page reads. Also, shorten "device" to "dev" to match the block layer prints: [67353.467906] blk_update_request: critical target error, dev sdr, sector 1707107328 Also, use %llu rather than %Lu. Resulting prints look like: [ 1356.437006] blk_update_request: critical target error, dev sdr, sector 1719693992 [ 1361.383522] quiet_error: 659876 callbacks suppressed [ 1361.385816] Buffer I/O error on dev sdr, logical block 256902912, lost async page write [ 1361.385819] Buffer I/O error on dev sdr, logical block 256903644, lost async page write Signed-off-by: Robert Elliott <elliott@hp.com> Reviewed-by: Webb Scales <webbnh@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-21 13:55:09 -06:00
Linus Torvalds	848a552893	Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd Pull email address change from Boaz Harrosh. * 'for-linus' of git://git.open-osd.org/linux-open-osd: Boaz Harrosh - fix email in Documentation Boaz Harrosh - Fix broken email address MAINTAINERS: Change Boaz Harrosh's email	2014-10-21 12:53:45 -07:00
J. Bruce Fields	d1d84c9626	nfsd4: fix response size estimation for OP_SEQUENCE We added this new estimator function but forgot to hook it up. The effect is that NFSv4.1 (and greater) won't do zero-copy reads. The estimate was also wrong by 8 bytes. Fixes: `ccae70a9ee` "nfsd4: estimate sequence response size" Cc: stable@vger.kernel.org Reported-by: Chuck Lever <chucklever@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-10-21 09:10:50 -04:00
Linus Torvalds	c2661b8060	A large number of cleanups and bug fixes, with some (minor) journal optimizations. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJUPlLCAAoJENNvdpvBGATwpN8P/jnbDL1RqM9ZEAWfbDhvYumR Fi59b3IDzSJHuuJeP0nTblVbbWclpO9ljCd18ttsHr8gBXA0ViaEU0XvWbpHIwPN 1fr1/Ovd0wvBdIVdLlaLXTR9skH4lbkiXxv/tkfjVCOSpzqiKID98Z72e/gUjB7Z 8xjAn/mTCnXKnhqMGzi8RC2MP1wgY//ErR21bj6so/8RC8zu4P6JuVj/hI6s0y5i IPtAmjhdM7nxnS0wJwj7dLT0yNDftDh69qE6CgIwyK+Xn/SZFgYwE6+l02dj3DET ZcAzTT9ToTMJdWtMu+5Y4LY8ObJ5xqMPbMoUclQ3DWe6nZicvtcBVCjfG/J8pFlY IFD0nfh/OpX9cQMwJ+5Y8P4TrMiqM+FfuLfu+X83gLyrAyIazwoaZls2lxlEyC0w M25oAqeKGUeVakVlmDZlVyBf05cu5m62x1rRvpcwMXMNhJl8/xwsSdhdYGeJfbO0 0MfL1n6GmvHvouMXKNsXlat/w3QVaQWVRzqdF9x7Q730fSHC/zxVGO+Po3jz2fBd fBdfE14BIIU7nkyBVy0CZG5SDmQW4YACocOv/ATmII9j76F9eZQ3zsA8J1x+dLmJ dP1Uxvsn1C3HW8Ua239j0XUJncglb06iEId0ywdkmWcc1rbzsyZ/NzXN/QBdZmqB 9g4GKAXAyh15PeBTJ5K/ =vWic -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A large number of cleanups and bug fixes, with some (minor) journal optimizations" [ This got sent to me before -rc1, but was stuck in my spam folder. - Linus ] * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (67 commits) ext4: check s_chksum_driver when looking for bg csum presence ext4: move error report out of atomic context in ext4_init_block_bitmap() ext4: Replace open coded mdata csum feature to helper function ext4: delete useless comments about ext4_move_extents ext4: fix reservation overflow in ext4_da_write_begin ext4: add ext4_iget_normal() which is to be used for dir tree lookups ext4: don't orphan or truncate the boot loader inode ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT ext4: optimize block allocation on grow indepth ext4: get rid of code duplication ext4: fix over-defensive complaint after journal abort ext4: fix return value of ext4_do_update_inode ext4: fix mmap data corruption when blocksize < pagesize vfs: fix data corruption when blocksize < pagesize for mmaped data ext4: fold ext4_nojournal_sops into ext4_sops ext4: support freezing ext2 (nojournal) file systems ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs() ext4: don't check quota format when there are no quota files jbd2: simplify calling convention around __jbd2_journal_clean_checkpoint_list jbd2: avoid pointless scanning of checkpoint lists ...	2014-10-20 09:50:11 -07:00
Wolfram Sang	75c43e0493	pstore: drop owner assignment from platform_drivers A platform_driver does not need to set an owner, it will be populated by the driver core. Signed-off-by: Wolfram Sang <wsa@the-dreams.de>	2014-10-20 16:21:57 +02:00
Boaz Harrosh	aa281ac631	Boaz Harrosh - Fix broken email address I no longer have access to the Panasas email. So change to an email that can always reach me. Signed-off-by: Boaz Harrosh <ooo@electrozaur.com>	2014-10-19 20:22:32 +03:00
Linus Torvalds	9272f2dc39	Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs/smb3 updates from Steve French: "Improved SMB3 support (symlink and device emulation, and remapping by default the 7 reserved posix characters) and a workaround for cifs mounts to Mac (working around a commonly encountered Mac server bug)" * 'for-linus' of git://git.samba.org/sfrench/cifs-2.6: [CIFS] Remove obsolete comment Check minimum response length on query_network_interface Workaround Mac server problem Remap reserved posix characters by default (part 3/3) Allow conversion of characters in Mac remap range (part 2) Allow conversion of characters in Mac remap range. Part 1 mfsymlinks support for SMB2.1/SMB3. Part 2 query symlink Add mfsymlinks support for SMB2.1/SMB3. Part 1 create symlink Allow mknod and mkfifo on SMB2/SMB3 mounts add defines for two new file attributes	2014-10-18 13:39:19 -07:00
Linus Torvalds	e83e432372	dlm for 3.18 This includes a single commit fixing a missing endian conversion. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJUQT+qAAoJEDgbc8f8gGmqT1wQAMjq4O0wh0DNI08SrHHszwnI e5btCmlMfNIvSfKmNcwstEJH5R86qi+8Q7fAKPj73XXlb1HLYjhQltuuysRSlmZE Ll+eqnOWVJPxlFjyRDKzO7HxejMsqEE29gtmzw9iL43F2M1acmCMgrVLz8VJGHH2 21hK9eo9L51kLVXK2+rSexdJJNDVjuy9N25J1d/r9/9UEIp+gGhD5ZMpBlDmw1md VJgbL/yv99mjicjVnd5nZE7fhcWFPPV5IyeyUQS0Pg92cAYZUNBBAzta3yU4ira4 tJU4V2k3zoK40TXqRnGNFgFNvm8cfzNUX+9HtWyWmr6VySfRVXFhSDcHa2/NQva9 4T1vVO5Dg/4+ELIdtaGMEv+B5KCv9/hWjmhnlLBIJktJdH+1NmRXjZAg6gM51XRB ZwQIf929lELeqGTdfJyO4bR/NPcjrEnEpQFBBAuypko06ZJhXLJZqtKD/3o5lCjX VOgPkuNtylPARdUo/m1EffF4Rm1LREcbMHDMUInaVt6qAeHRqq8QkvBhBQ7VzFX1 Y1dwGWh8m9pA3q48j2GIT1rWbl5Fue8f34RMnn4psRVM0rnQ8WHh3/p5wsSsXlib BCX5ez5O2aF7pe4Lqj3NkGmyB6PvICRbU2ZEe2uPbvMZnoq0NouO4asCLSeycu52 fYlz9mZ3e+Y+sv2JN3vG =eEi+ -----END PGP SIGNATURE----- Merge tag 'dlm-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm fix from David Teigland: "This includes a single commit fixing a missing endian conversion" * tag 'dlm-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: fix missing endian conversion of rcom_status flags	2014-10-18 13:37:19 -07:00
Linus Torvalds	ef161ea1ff	Merge branch 'for-linus-update' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs data corruption fix from Chris Mason: "I'm testing a pull with more fixes, but wanted to get this one out so Greg can pick it up. The corruption isn't easy to hit, you have to do a readonly snapshot and have orphans in the snapshot. But my review and testing missed the bug. Filipe has added a better xfstest to cover it" * 'for-linus-update' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Revert "Btrfs: race free update of commit root for ro snapshots"	2014-10-18 13:32:17 -07:00
Linus Torvalds	8ccf863f09	ensure unique filenames in pstore -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJUP/b6AAoJEKurIx+X31iBnikP/3yUNHty6FpLJ/u0liqnzBIP DgD4BR+EwObVdMG4a4M5QcR9euoNdg532ywyoyWtRKpX7czpicu+Q60uHGx5OKwd fCMVibMTv+kQ/r+UMQ/J9+B3CyZXwRzHOMlcfT/yxZdYG+o+LZkfVBfY5kn9DQFK I+c3LZLwqOJhuz8/yUO9azFSo+r+QvtoC/WApW28iAKXjtCss9Vi2NN9i4yEqsQ2 BelbuiiI6DYYVP7yBWIIJJlUO9xV/JSJx8tfA0Qyp5JYaRApzZ6kZpn1vnlB3JJ3 prkWxqy9RkQQIcsUn/tCxiX8TBoXyNR4Vb8722ACGHIUBQp+kzyRi/9xfeZoi8E3 2PXRIX6QJSy0++YSX6wVVbYSq3DrfQoFByNwR1uvtFxIzzMoRmNuvIwWlHe/Yew5 QQ8U1UiYnBm/I8SH0GCd/vbwB1Ar9QV3JhbgUg6tzFurj6EBRBZqAz9UYThROarZ LSPJ8l+D2x0vkRaFEekO0C+OMXvv3if82cqcYdx042KM07RpjN71ceoY2PZIUJK4 jUJKAfA2h8ymGTTG/tWNiUBkjnPXkMBXMi8YWjYHJ2QdYJauagwbS6jRgPJ3eZ/I muVezr9vHLnIxSv0yLmcLkBepACQgjA0mIXn24XV0W/5/5Uz/BQyAUoz2h71m1ak bxeX5HWwp0/p0kkvSOgk =mFre -----END PGP SIGNATURE----- Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux Pull pstore fix from Tony Luck: "Ensure unique filenames in pstore" * tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: pstore: Fix duplicate {console,ftrace}-efi entries	2014-10-18 13:25:03 -07:00
Linus Torvalds	4869447d21	Merge git://git.kernel.org/pub/scm/linux/kernel/git/aia21/ntfs Pull NTFS update from Anton Altaparmakov: "Here is a small NTFS update notably implementing FIBMAP ioctl for NTFS by adding the bmap address space operation. People seem to still want FIBMAP" * git://git.kernel.org/pub/scm/linux/kernel/git/aia21/ntfs: NTFS: Bump version to 2.1.31. NTFS: Add bmap address space operation needed for FIBMAP ioctl. NTFS: Remove changelog from Documentation/filesystems/ntfs.txt. NTFS: Split ntfs_aops into ntfs_normal_aops and ntfs_compressed_aops in preparation for them diverging.	2014-10-18 12:54:46 -07:00
Linus Torvalds	ead13aee23	NFS client updates for Linux 3.18 Highlights include: Stable fixes: - Fix an uninitialised pointer Oops in the writeback error path - Fix a bogus warning (and early exit from the loop) in nfs_generic_pgio Features: - Add NFSv4.2 SEEK feature and client support for lseek(SEEK_HOLE/SEEK_DATA) Other fixes: - pnfs: replace broken pnfs_put_lseg_async - Remove dead prototype for nfs4_insert_deviceid_node -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJUP4dhAAoJEGcL54qWCgDy1+EQAKpFVKDjGEik5V1ye492mGbS moP3zvJtGTfwoK+eCV4HqIj0oFh06RqoQc5z4vc0MKd3uuxXgPFj/wmJef48CWot 1BKoDWItzZZlBAJsE0+0i6EWZ05tAZv7Fr9ial6VwchpNAih1K/NPcBix/jEwWYl WEFhKyGiN6UNhWfL0nXgvyQPD5tevvkr+5qv0A8dm8HGTEBlGhbzt4sfiUbWHPMD jCVsHBZNOeZQrFwVzMx+LCZaBwb+xeRAs0Pg1U51MetpwnNce9737M9oetbzgkI5 cSACmvo76osaerC2Avxj4t713xeqfur4YRwIajamkUxWbcjXm3vaAQmE744meJOO IOkIrmiGfMnJHw3f5CFnsAXcXmGu16umRqTexzz5iCqZ65ZuKWgIxs9yzBCOK/Or f8nbGsCU2EaX8mJxf+esP4SzstxtqHRtkTj4f8D4wWm/W+PXYPCzCsI1q3rDni0U yNH2Z4gRWcGz6LVfL/9lDHldN5S4WLkJuZQJtyijcy1jYCjwEEUFbX1V41s85GFB QdQvZ7Cr4NqRyqvR7zyIdH1jTWt2N5mJytsmEbSGoXgBFzRDYCFXfSNjjJ2JfThl sGeWlP99uEVbTxCTTBbulFXqYrmiNN7mW61+qedqT2Oy5/YjJTSE4PCp1p2YWlx/ /c17q7jOq6cnbKcyu7cl =ekPm -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - fix an uninitialised pointer Oops in the writeback error path - fix a bogus warning (and early exit from the loop) in nfs_generic_pgio() Features: - Add NFSv4.2 SEEK feature and client support for lseek(SEEK_HOLE/SEEK_DATA) Other fixes: - pnfs: replace broken pnfs_put_lseg_async - Remove dead prototype for nfs4_insert_deviceid_node" * tag 'nfs-for-3.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix a bogus warning in nfs_generic_pgio NFS: Fix an uninitialised pointer Oops in the writeback error path NFSv4.1/pnfs: replace broken pnfs_put_lseg_async NFSv4: Remove dead prototype for nfs4_insert_deviceid_node() NFS: Implement SEEK	2014-10-18 12:52:08 -07:00
Linus Torvalds	d3dc366bba	Merge branch 'for-3.18/core' of git://git.kernel.dk/linux-block Pull core block layer changes from Jens Axboe: "This is the core block IO pull request for 3.18. Apart from the new and improved flush machinery for blk-mq, this is all mostly bug fixes and cleanups. - blk-mq timeout updates and fixes from Christoph. - Removal of REQ_END, also from Christoph. We pass it through the ->queue_rq() hook for blk-mq instead, freeing up one of the request bits. The space was overly tight on 32-bit, so Martin also killed REQ_KERNEL since it's no longer used. - blk integrity updates and fixes from Martin and Gu Zheng. - Update to the flush machinery for blk-mq from Ming Lei. Now we have a per hardware context flush request, which both cleans up the code should scale better for flush intensive workloads on blk-mq. - Improve the error printing, from Rob Elliott. - Backing device improvements and cleanups from Tejun. - Fixup of a misplaced rq_complete() tracepoint from Hannes. - Make blk_get_request() return error pointers, fixing up issues where we NULL deref when a device goes bad or missing. From Joe Lawrence. - Prep work for drastically reducing the memory consumption of dm devices from Junichi Nomura. This allows creating clone bio sets without preallocating a lot of memory. - Fix a blk-mq hang on certain combinations of queue depths and hardware queues from me. - Limit memory consumption for blk-mq devices for crash dump scenarios and drivers that use crazy high depths (certain SCSI shared tag setups). We now just use a single queue and limited depth for that" * 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits) block: Remove REQ_KERNEL blk-mq: allocate cpumask on the home node bio-integrity: remove the needless fail handle of bip_slab creating block: include func name in __get_request prints block: make blk_update_request print prefix match ratelimited prefix blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio block: fix alignment_offset math that assumes io_min is a power-of-2 blk-mq: Make bt_clear_tag() easier to read blk-mq: fix potential hang if rolling wakeup depth is too high block: add bioset_create_nobvec() block: use bio_clone_fast() in blk_rq_prep_clone() block: misplaced rq_complete tracepoint sd: Honor block layer integrity handling flags block: Replace strnicmp with strncasecmp block: Add T10 Protection Information functions block: Don't merge requests if integrity flags differ block: Integrity checksum flag block: Relocate bio integrity flags block: Add a disk flag to block integrity profile block: Add prefix to block integrity profile flags ...	2014-10-18 11:53:51 -07:00
Steve French	ff273cb879	[CIFS] Remove obsolete comment Signed-off-by: Steven French <smfrench@gmail.com>	2014-10-17 17:17:12 -05:00
Chris Mason	d37973082b	Revert "Btrfs: race free update of commit root for ro snapshots" This reverts commit `9c3b306e1c`. Switching only one commit root during a transaction is wrong because it leads the fs into an inconsistent state. All commit roots should be switched at once, at transaction commit time, otherwise backref walking can often miss important references that were only accessible through the old commit root. Plus, the root item for the snapshot's root wasn't getting updated and preventing the next transaction commit to do it. This made several users get into random corruption issues after creation of readonly snapshots. A regression test for xfstests will follow soon. Cc: stable@vger.kernel.org # 3.17 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-10-17 02:40:59 -07:00
Steve French	9ffc541296	Check minimum response length on query_network_interface Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Steve French <smfrench@gmail.com>	2014-10-16 15:20:20 -05:00
Steve French	b5b374eab1	Workaround Mac server problem Mac server returns that they support CIFS Unix Extensions but doesn't actually support QUERY_FILE_UNIX_BASIC so mount fails. Workaround this problem by disabling use of Unix CIFS protocol extensions if server returns an EOPNOTSUPP error on QUERY_FILE_UNIX_BASIC during mount. Signed-off-by: Steve French <smfrench@gmail.com>	2014-10-16 15:20:20 -05:00
Steve French	2baa268253	Remap reserved posix characters by default (part 3/3) This is a bigger patch, but its size is mostly due to a single change for how we check for remapping illegal characters in file names - a lot of repeated, small changes to the way callers request converting file names. The final patch in the series does the following: 1) changes default behavior for cifs to be more intuitive. Currently we do not map by default to seven reserved characters, ie those valid in POSIX but not in NTFS/CIFS/SMB3/Windows, unless a mount option (mapchars) is specified. Change this to by default always map and map using the SFM maping (like the Mac uses) unless the server negotiates the CIFS Unix Extensions (like Samba does when mounting with the cifs protocol) when the remapping of the characters is unnecessary. This should help SMB3 mounts in particular since Samba will likely be able to implement this mapping with its new "vfs_fruit" module as it will be doing for the Mac. 2) if the user specifies the existing "mapchars" mount option then use the "SFU" (Microsoft Services for Unix, SUA) style mapping of the seven characters instead. 3) if the user specifies "nomapposix" then disable SFM/MAC style mapping (so no character remapping would be used unless the user specifies "mapchars" on mount as well, as above). 4) change all the places in the code that check for the superblock flag on the mount which is set by mapchars and passed in on all path based operation and change it to use a small function call instead to set the mapping type properly (and check for the mapping type in the cifs unicode functions) Signed-off-by: Steve French <smfrench@gmail.com>	2014-10-16 15:20:20 -05:00
Steve French	a4153cb1d3	Allow conversion of characters in Mac remap range (part 2) The previous patch allowed remapping reserved characters from directory listenings, this patch adds conversion the other direction, allowing opening of files with any of the seven reserved characters. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>	2014-10-16 15:20:20 -05:00
Steve French	b693855fe6	Allow conversion of characters in Mac remap range. Part 1 This allows directory listings to Mac to display filenames correctly which have been created with illegal (to Windows) characters in their filename. It does not allow converting the other direction yet ie opening files with these characters (followon patch). There are seven reserved characters that need to be remapped when mounting to Windows, Mac (or any server without Unix Extensions) which are valid in POSIX but not in the other OS. : \ < > ? * \| We used the normal UCS-2 remap range for this in order to convert this to/from UTF8 as did Windows Services for Unix (basically add 0xF000 to any of the 7 reserved characters), at least when the "mapchars" mount option was specified. Mac used a very slightly different "Services for Mac" remap range 0xF021 through 0xF027. The attached patch allows cifs.ko (the kernel client) to read directories on macs containing files with these characters and display their names properly. In theory this even might be useful on mounts to Samba when the vfs_catia or new "vfs_fruit" module is loaded. Currently the 7 reserved characters look very strange in directory listings from cifs.ko to Mac server. This patch allows these file name characters to be read (requires specifying mapchars on mount). Two additional changes are needed: 1) Make it more automatic: a way of detecting enough info so that we know to try to always remap these characters or not. Various have suggested that the SFM approach be made the default when the server does not support POSIX Unix extensions (cifs mounts to Samba for example) so need to make SFM remapping the default unless mapchars (SFU style mapping) specified on mount or no mapping explicitly requested or no mapping needed (cifs mounts to Samba). 2) Adding a patch to map the characters the other direction (ie UTF-8 to UCS-2 on open). This patch does it for translating readdir entries (ie UCS-2 to UTF-8) Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>	2014-10-16 15:20:20 -05:00
Steve French	c22870ea2d	mfsymlinks support for SMB2.1/SMB3. Part 2 query symlink Adds support on SMB2.1 and SMB3 mounts for emulation of symlinks via the "Minshall/French" symlink format already used for cifs mounts when mfsymlinks mount option is used (and also used by Apple). http://wiki.samba.org/index.php/UNIX_Extensions#Minshall.2BFrench_symlinks This second patch adds support to query them (recognize them as symlinks and read them). Third version of patch makes minor corrections to error handling. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Stefan Metzmacher <metze@samba.org>	2014-10-16 15:20:20 -05:00
Steve French	5ab97578cb	Add mfsymlinks support for SMB2.1/SMB3. Part 1 create symlink Adds support on SMB2.1 and SMB3 mounts for emulation of symlinks via the "Minshall/French" symlink format already used for cifs mounts when mfsymlinks mount option is used (and also used by Apple). http://wiki.samba.org/index.php/UNIX_Extensions#Minshall.2BFrench_symlinks This first patch adds support to create them. The next patch will add support for recognizing them and reading them. Although CIFS/SMB3 have other types of symlinks, in the many use cases they aren't practical (e.g. either require cifs only mounts with unix extensions to Samba, or require the user to be Administrator to Windows for SMB3). This also helps enable running additional xfstests over SMB3 (since some xfstests directly or indirectly require symlink support). Signed-off-by: Steve French <smfrench@gmail.com> CC: Stefan Metzmacher <metze@samba.org>	2014-10-16 15:20:20 -05:00
Steve French	db8b631d4b	Allow mknod and mkfifo on SMB2/SMB3 mounts The "sfu" mount option did not work on SMB2/SMB3 mounts. With these changes when the "sfu" mount option is passed in on an smb2/smb2.1/smb3 mount the client can emulate (and recognize) fifo and device (character and device files). In addition the "sfu" mount option should not conflict with "mfsymlinks" (symlink emulation) as we will never create "sfu" style symlinks, but using "sfu" mount option will allow us to recognize existing symlinks, created with Microsoft "Services for Unix" (SFU and SUA). To enable the "sfu" mount option for SMB2/SMB3 the calling syntax of the generic cifs/smb2/smb3 sync_read and sync_write protocol dependent function needed to be changed (we don't have a file struct in all cases), but this actually ended up simplifying the code a little. Signed-off-by: Steve French <smfrench@gmail.com>	2014-10-16 15:20:19 -05:00
Steve French	7332297909	add defines for two new file attributes Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>	2014-10-16 15:20:19 -05:00
Anton Altaparmakov	3569b70c40	NTFS: Bump version to 2.1.31. Signed-off-by: Anton Altaparmakov <anton@tuxera.com>	2014-10-16 12:53:35 +01:00
Anton Altaparmakov	3f7fc6f2a2	NTFS: Add bmap address space operation needed for FIBMAP ioctl. Signed-off-by: Anton Altaparmakov <anton@tuxera.com>	2014-10-16 12:50:52 +01:00
Anton Altaparmakov	ce1bafa094	NTFS: Split ntfs_aops into ntfs_normal_aops and ntfs_compressed_aops in preparation for them diverging. Signed-off-by: Anton Altaparmakov <anton@tuxera.com>	2014-10-16 12:28:03 +01:00
Valdis Kletnieks	d4bf205da6	pstore: Fix duplicate {console,ftrace}-efi entries The pstore filesystem still creates duplicate filename/inode pairs for some pstore types. Add the id to the filename to prevent that. Before patch: [/sys/fs/pstore] ls -li total 0 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi After: [/sys/fs/pstore] ls -li total 0 1232 -r--r--r--. 1 root root 148 Sep 29 17:09 console-efi-141202499100000 1231 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi-141202499200000 1230 -r--r--r--. 1 root root 148 Sep 29 17:44 console-efi-141202705400000 1229 -r--r--r--. 1 root root 67 Sep 29 17:44 console-efi-141202705500000 1228 -r--r--r--. 1 root root 67 Sep 29 20:42 console-efi-141203772600000 1227 -r--r--r--. 1 root root 148 Sep 29 23:42 console-efi-141204854900000 1226 -r--r--r--. 1 root root 67 Sep 29 23:42 console-efi-141204855000000 1225 -r--r--r--. 1 root root 148 Sep 29 23:59 console-efi-141204954200000 1224 -r--r--r--. 1 root root 67 Sep 29 23:59 console-efi-141204954400000 Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Acked-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org # 3.6+ Signed-off-by: Tony Luck <tony.luck@intel.com>	2014-10-15 13:51:33 -07:00
Linus Torvalds	0429fbc0bd	Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu consistent-ops changes from Tejun Heo: "Way back, before the current percpu allocator was implemented, static and dynamic percpu memory areas were allocated and handled separately and had their own accessors. The distinction has been gone for many years now; however, the now duplicate two sets of accessors remained with the pointer based ones - this_cpu_() - evolving various other operations over time. During the process, we also accumulated other inconsistent operations. This pull request contains Christoph's patches to clean up the duplicate accessor situation. __get_cpu_var() uses are replaced with with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr(). Unfortunately, the former sometimes is tricky thanks to C being a bit messy with the distinction between lvalues and pointers, which led to a rather ugly solution for cpumask_var_t involving the introduction of this_cpu_cpumask_var_ptr(). This converts most of the uses but not all. Christoph will follow up with the remaining conversions in this merge window and hopefully remove the obsolete accessors" 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits) irqchip: Properly fetch the per cpu offset percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write. percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t Revert "powerpc: Replace __get_cpu_var uses" percpu: Remove __this_cpu_ptr clocksource: Replace __this_cpu_ptr with raw_cpu_ptr sparc: Replace __get_cpu_var uses avr32: Replace __get_cpu_var with __this_cpu_write blackfin: Replace __get_cpu_var uses tile: Use this_cpu_ptr() for hardware counters tile: Replace __get_cpu_var uses powerpc: Replace __get_cpu_var uses alpha: Replace __get_cpu_var ia64: Replace __get_cpu_var uses s390: cio driver &__get_cpu_var replacements s390: Replace __get_cpu_var uses mips: Replace __get_cpu_var uses MIPS: Replace __get_cpu_var uses in FPU emulator. arm: Replace __this_cpu_ptr with raw_cpu_ptr ...	2014-10-15 07:48:18 +02:00
Linus Torvalds	6929c35897	LLVMLinux patches for v3.18 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJUPOQPAAoJEHKvRublQaWiZnQP/0bDfPvhaNKZtmDKYzbqQm9x Gh3fB53d7TtIei///5eD/LUGObg0Ze8g2k8aAC05hcq5Be8Haitma3iDDSvJq6ig umYU9BWkv47BHQy0gAVMXyaxHocICq7W9bnOvU4SSEW/sWeneBllyxgCfV7EKSTd B/OXP+Ovr4FtjAweOACCN8b0M9w4wdxNKfDNV+WDIHAddYBngxrwq7zASKNNCx2N 0u9sXIdTp0Fvxmyx/lYLC5NXN0CUDjB3Ffdx3+eehrBp2lT6JdlkYU403c85cIMP oIPJVKFbOnM04MfCjhFNTrK9OtC2eD6PoWq+FLL0UJx3YW5HkbLsiHGm/2UTJ2Z1 5QOwDebMxlvrb6f6Gv846ADl7YcByiXkieTDHRlOnplVRNV5Sj8UWAgq+zyq7sWq 2uRuW2UvKx7vYoAwKRwCaIoqpIe3NIvZQzE7C9mGOprIawZ5e0YJzaR6OoBs4Y8i gmBeFx266URJun7isy1R7JJsMjYzxbEXju9zH/SkghbLHnf8yqafIG+pG1GD7n5R o2C/5TVXjmEhIoDn8j2ZozaElX4mD7REKILpIGaE4XltExNTexq8neRo9D3ajzif N5RrMCAkBzIxMz83evDppe3ObtkEaf0K43VCO3AVQ2g9jXg7ttKhR2hb8HRqCMHe Lp3d8qKyZyL/HZ5F62yX =zJyI -----END PGP SIGNATURE----- Merge tag 'llvmlinux-for-v3.18' of git://git.linuxfoundation.org/llvmlinux/kernel Pull LLVM updates from Behan Webster: "These patches remove the use of VLAIS using a new SHASH_DESC_ON_STACK macro. Some of the previously accepted VLAIS removal patches haven't used this macro. I will push new patches to consistently use this macro in all those older cases for 3.19" [ More LLVM patches coming in through subsystem trees, and LLVM itself needs some fixes that are already in many distributions but not in released versions of LLVM. Some day this will all "just work" - Linus ] * tag 'llvmlinux-for-v3.18' of git://git.linuxfoundation.org/llvmlinux/kernel: crypto: LLVMLinux: Remove VLAIS usage from crypto/testmgr.c security, crypto: LLVMLinux: Remove VLAIS from ima_crypto.c crypto: LLVMLinux: Remove VLAIS usage from libcrc32c.c crypto: LLVMLinux: Remove VLAIS usage from crypto/hmac.c crypto, dm: LLVMLinux: Remove VLAIS usage from dm-crypt crypto: LLVMLinux: Remove VLAIS from crypto/.../qat_algs.c crypto: LLVMLinux: Remove VLAIS from crypto/omap_sham.c crypto: LLVMLinux: Remove VLAIS from crypto/n2_core.c crypto: LLVMLinux: Remove VLAIS from crypto/mv_cesa.c crypto: LLVMLinux: Remove VLAIS from crypto/ccp/ccp-crypto-sha.c btrfs: LLVMLinux: Remove VLAIS crypto: LLVMLinux: Add macro to remove use of VLAIS in crypto code	2014-10-15 07:30:52 +02:00
Linus Torvalds	6b04908166	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There is the long-awaited discard support for RBD (Guangliang Zhao, Josh Durgin), a pile of RBD bug fixes that didn't belong in late -rc's (Ilya Dryomov, Li RongQing), a pile of fs/ceph bug fixes and performance and debugging improvements (Yan, Zheng, John Spray), and a smattering of cleanups (Chao Yu, Fabian Frederick, Joe Perches)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits) ceph: fix divide-by-zero in __validate_layout() rbd: rbd workqueues need a resque worker libceph: ceph-msgr workqueue needs a resque worker ceph: fix bool assignments libceph: separate multiple ops with commas in debugfs output libceph: sync osd op definitions in rados.h libceph: remove redundant declaration ceph: additional debugfs output ceph: export ceph_session_state_name function ceph: include the initial ACL in create/mkdir/mknod MDS requests ceph: use pagelist to present MDS request data libceph: reference counting pagelist ceph: fix llistxattr on symlink ceph: send client metadata to MDS ceph: remove redundant code for max file size verification ceph: remove redundant io_iter_advance() ceph: move ceph_find_inode() outside the s_mutex ceph: request xattrs if xattr_version is zero rbd: set the remaining discard properties to enable support rbd: use helpers to handle discard for layered images correctly ...	2014-10-15 06:46:01 +02:00
Linus Torvalds	ce9d7f7b45	Merge branch 'CVE-2014-7970' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux Pull pivot_root() fix from Andy Lutomirski. Prevent a leak of unreachable mounts. * 'CVE-2014-7970' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux: mnt: Prevent pivot_root from creating a loop in the mount tree	2014-10-15 06:43:27 +02:00
Eric W. Biederman	0d0826019e	mnt: Prevent pivot_root from creating a loop in the mount tree Andy Lutomirski recently demonstrated that when chroot is used to set the root path below the path for the new ``root'' passed to pivot_root the pivot_root system call succeeds and leaks mounts. In examining the code I see that starting with a new root that is below the current root in the mount tree will result in a loop in the mount tree after the mounts are detached and then reattached to one another. Resulting in all kinds of ugliness including a leak of that mounts involved in the leak of the mount loop. Prevent this problem by ensuring that the new mount is reachable from the current root of the mount tree. [Added stable cc. Fixes CVE-2014-7970. --Andy] Cc: stable@vger.kernel.org Reported-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/87bnpmihks.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net>	2014-10-14 14:27:19 -07:00
Neale Ferguson	c07127b48c	dlm: fix missing endian conversion of rcom_status flags The flags are already converted to le when being sent, but are not being converted back to cpu when received. Signed-off-by: Neale Ferguson <neale@sinenomine.net> Signed-off-by: David Teigland <teigland@redhat.com>	2014-10-14 15:11:48 -05:00
Yan, Zheng	0bc62284ee	ceph: fix divide-by-zero in __validate_layout() The 'stripe_unit' field is 64 bits, casting it to 32 bits can result zero. Signed-off-by: Yan, Zheng <zyan@redhat.com>	2014-10-14 12:57:05 -07:00
Fabian Frederick	ab6c2c3ebe	ceph: fix bool assignments Fix some coccinelle warnings: fs/ceph/caps.c:2400:6-10: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2401:6-15: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2402:6-17: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2403:6-22: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2404:6-22: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2405:6-19: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2440:4-20: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2469:3-16: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2490:2-18: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2519:3-7: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2549:3-12: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2575:2-6: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2589:3-7: WARNING: Assignment of bool to 0/1 Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2014-10-14 12:57:04 -07:00

... 5 6 7 8 9 ...

38950 Commits