OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Paul Gortmaker	c013d5a458	fs/notify: don't use module_init for non-modular inotify_user code The INOTIFY_USER option is bool, and hence this code is either present or absent. It will never be modular, so using module_init as an alias for __initcall is rather misleading. Fix this up now, so that we can relocate module_init from init.h into module.h in the future. If we don't do this, we'd have to add module.h to obviously non-modular code, and that would be a worse thing. Note that direct use of __initcall is discouraged, vs. one of the priority categorized subgroups. As __initcall gets mapped onto device_initcall, our use of fs_initcall (which makes sense for fs code) will thus change this registration from level 6-device to level 5-fs (i.e. slightly earlier). However no observable impact of that small difference has been observed during testing, or is expected. Cc: John McCutchan <john@johnmccutchan.com> Cc: Robert Love <rlove@rlove.org> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2015-06-16 14:12:34 -04:00
Trond Myklebust	3438995bc4	NFS: NFSoRDMA Client Changes These patches continue to build up for improving the rsize and wsize that the NFS client uses when talking over RDMA. In addition, these patches also add in scalability enhancements and other bugfixes. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJVey8qAAoJENfLVL+wpUDroT4P/3lwspXwdxS6VZWsW1VpNtdV V1KKd5D+TkpBpz/ih9GdOVaBZijaHpb6XtMReh8xuh0KI893iYmsmLoyhMTPJMsU 6sjUDEv8IFrXwlRKldX1KEfBvNgR0czCNiha6O+YsV5Az08+zr57ahyGKmLUzMxo 4XzPZbwnb5fxvgmBgENUU33g+xXGsXDbsdzLvKW3UGcPU2x6PGOTLr5vP7lQkwxE 20d9ak8xQeRUk0hsmRM4fAebzcluD1o3PLIFQBEhh0Gqm1VGtSCkr9o493gT5TgM /+XrU7B8OnbdJ1B4f/y4Bz4RucfKzyRuXMpulrnK1hL7QIiZLqiph7UrTel/ajcD 0us9PImNwXPo8tMz7Wjw2XMQplndHB3FG3M3lXlJGHlXvCI7F0yjm21AP4SeetOm kxL24Qiyi7l/S7JJxHqNlOc0b8kpVLohBZm6yee9w4r/JUPnynUqfnXCHLjIp/5W F1hzbCUATyfKrSs7VKO0hCQHfntigPEhRmyfoyXRAXzl5LnR1XqD6Wah3a3pwXn+ mEquUd6fKRHIIvJ8cKU6KtykkhRHg1sR/z1mw2ZEW/2PCd0cb+8+WN7X/fQqEN+u +VQSo7oPp38SHdsyozuUUyukN5qHptTMSrNZL+LI7J8/0+BuRuIvW0nojViapc51 LOUlcgqRdUlIvmn754Yo =N1tO -----END PGP SIGNATURE----- Merge tag 'nfs-rdma-for-4.2' of git://git.linux-nfs.org/projects/anna/nfs-rdma NFS: NFSoRDMA Client Changes These patches continue to build up for improving the rsize and wsize that the NFS client uses when talking over RDMA. In addition, these patches also add in scalability enhancements and other bugfixes. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> * tag 'nfs-rdma-for-4.2' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (142 commits) xprtrdma: Reduce per-transport MR allocation xprtrdma: Stack relief in fmr_op_map() xprtrdma: Split rb_lock xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy xprtrdma: Remove ->ro_reset xprtrdma: Remove unused LOCAL_INV recovery logic xprtrdma: Acquire MRs in rpcrdma_register_external() xprtrdma: Introduce an FRMR recovery workqueue xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external() xprtrdma: Introduce helpers for allocating MWs xprtrdma: Use ib_device pointer safely xprtrdma: Remove rr_func xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt xprtrdma: Warn when there are orphaned IB objects ...	2015-06-16 11:37:50 -04:00
Trond Myklebust	5ba12443a1	NFSv4: Fix stateid recovery on revoked delegations Ensure that we fix the non-NULL stateid case as well. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-16 11:29:51 -04:00
Olga Kornievskaia	ae2ffef383	Recover from stateid-type error on SETATTR Client can receives stateid-type error (eg., BAD_STATEID) on SETATTR when delegation stateid was used. When no open state exists, in case of application calling truncate() on the file, client has no state to recover and fails with EIO. Instead, upon such error, return the bad delegation and then resend the SETATTR with a zero stateid. Signed-off: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-16 11:29:46 -04:00
Kinglong Mee	df05a49f72	nfs: Fix showing truncated fsid/dev in, /proc/net/nfsfs/volumes A truncated fsid showing from /proc/fs/nfsfs/volumes as, NV SERVER PORT DEV FSID FSC v4 c0a80881 801 0:43 34931f044c2a439b no It should be as, NV SERVER PORT DEV FSID FSC v4 c0a80881 801 0:43 34931f044c2a439b:954c5d830fa4be8c no The max buffer length for storing "%llx:%llx" format should be 16 + 1 + 16 + 1 = 34 (16 for %llx, 1 for ':', 1 for '\0'). Also, for storing "%u:%u" of MAJOR() and MINOR() should be 8 + 1 + 3 + 1 = 13 (8 for 2^24, 1 for ':', 3 for 2^8, 1 for '\0'). v2, add comments for dev/fsid buffer and use sizeof in snprintf. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-16 11:17:37 -04:00
Jeff Layton	873e385116	nfs: make nfs4_init_uniform_client_string use a dynamically allocated buffer Change the uniform client string generator to dynamically allocate the NFSv4 client name string buffer. With this patch, we can eliminate the buffers that are embedded within the "args" structs and simply use the name string that is hanging off the client. This uniform string case is a little simpler than the nonuniform since we don't need to deal with RCU, but we do have two different cases, depending on whether there is a uniquifier or not. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-16 11:15:51 -04:00
Jeff Layton	a319268891	nfs: make nfs4_init_nonuniform_client_string use a dynamically allocated buffer The way the *_client_string functions work is a little goofy. They build the string in an on-stack buffer and then use kstrdup to copy it. This is not only stack-heavy but artificially limits the size of the client name string. Change it so that we determine the length of the string, allocate it and then scnprintf into it. Since the contents of the nonuniform string depend on rcu-managed data structures, it's possible that they'll change between when we allocate the string and when we go to fill it. If that happens, free the string, recalculate the length and try again. If it the mismatch isn't resolved on the second try then just give up and return -EINVAL. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-16 11:15:45 -04:00
Jeff Layton	b8fb2f595e	nfs: update maxsz values for SETCLIENTID and EXCHANGE_ID The spec allows for up to NFS4_OPAQUE_LIMIT (1k). While we'll almost certainly never use that much, these ops are generally the only ones in the compound so we might as well allow for them to be that large. Also, the existing code didn't add in a word for the opaque length field for either name string. Fix that while we're in there. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-16 11:15:40 -04:00
Jeff Layton	3a6bb73879	nfs: convert setclientid and exchange_id encoders to use clp->cl_owner_id ...instead of buffers that are part of their arg structs. We already hold a reference to the client, so we might as well use the allocated buffer. In the event that we can't allocate the clp->cl_owner_id, then just return -ENOMEM. Note too that we switch from a GFP_KERNEL allocation here to GFP_NOFS. It's possible we could end up trying to do a SETCLIENTID or EXCHANGE_ID in order to reclaim some memory, and the GFP_KERNEL allocations in the existing code could cause recursion back into NFS reclaim. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-16 11:15:31 -04:00
Fabian Frederick	455b6ee645	pnfs/flexfiles: use swap() in ff_layout_sort_mirrors() Use kernel.h macro definition. Thanks to Julia Lawall for Coccinelle scripting support. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-16 11:14:03 -04:00
Al Viro	70d45cdb66	ufs: don't touch mtime/ctime of directory being moved See "ext2: Do not update mtime of a moved directory" (and followup in "ext2: fix unbalanced kmap()/kunmap()") for background; this is UFS equivalent - the same problem exists here. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-06-16 02:08:34 -04:00
Al Viro	a50e4a02ad	ufs: don't bother with lock_ufs()/unlock_ufs() for directory access We are already serialized by ->i_mutex and operations on different directories are independent. These calls are just rudiments of blind BKL conversion and they should've been removed back then. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-06-16 02:08:31 -04:00
Jan Kara	514d748f69	ufs: Fix possible deadlock when looking up directories Commit `e4502c63f5` (ufs: deal with nfsd/iget races) made ufs create inodes with I_NEW flag set. However ufs_mkdir() never cleared this flag. Thus if someone ever tried to lookup the directory by inode number, he would deadlock waiting for I_NEW to be cleared. Luckily this mostly happens only if the filesystem is exported over NFS since otherwise we have the inode attached to dentry and don't look it up by inode number. In rare cases dentry can get freed without inode being freed and then we'd hit the deadlock even without NFS export. Fix the problem by clearing I_NEW before instantiating new directory inode. Fixes: `e4502c63f5` Reported-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-06-16 02:08:12 -04:00
Jan Kara	12ecbb4b1d	ufs: Fix warning from unlock_new_inode() Commit `e4502c63f5` (ufs: deal with nfsd/iget races) introduced unlock_new_inode() call into ufs_add_nondir(). However that function gets called also from ufs_link() which hands it already initialized inode and thus unlock_new_inode() complains. The problem is harmless but annoying. Fix the problem by opencoding necessary stuff in ufs_link() Fixes: `e4502c63f5` Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-06-16 02:08:07 -04:00
Fabian Frederick	cdd9eefdf9	fs/ufs: restore s_lock mutex Commit `0244756edc` ("ufs: sb mutex merge + mutex_destroy") generated deadlocks in read/write mode on mkdir. This patch partially reverts it keeping fixes by Andrew Morton and mutex_destroy() [AV: fixed a missing bit in ufs_remount()] Signed-off-by: Fabian Frederick <fabf@skynet.be> Reported-by: Ian Campbell <ian.campbell@citrix.com> Suggested-by: Jan Kara <jack@suse.cz> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Alexey Khoroshilov <khoroshilov@ispras.ru> Cc: Roger Pau Monne <roger.pau@citrix.com> Cc: Ian Jackson <Ian.Jackson@eu.citrix.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-06-16 02:07:38 -04:00
Michal Hocko	7b506b1035	jbd2: get rid of open coded allocation retry loop insert_revoke_hash does an open coded endless allocation loop if journal_oom_retry is true. It doesn't implement any allocation fallback strategy between the retries, though. The memory allocator doesn't know about the never fail requirement so it cannot potentially help to move on with the allocation (e.g. use memory reserves). Get rid of the retry loop and use __GFP_NOFAIL instead. We will lose the debugging message but I am not sure it is anyhow helpful. Do the same for journal_alloc_journal_head which is doing a similar thing. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-15 15:45:58 -04:00
Andreas Dilger	b03a2f7eb2	ext4: improve warning directory handling messages Several ext4_warning() messages in the directory handling code do not report the inode number of the (potentially corrupt) directory where a problem is seen, and others report this in an ad-hoc manner. Add an ext4_warning_inode() helper to print the inode number and command name consistent with ext4_error_inode(). Consolidate the place in ext4.h that these macros are defined. Clean up some other directory error and warning messages to print the calling function name. Minor code style fixes in nearby lines. Signed-off-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-15 14:50:26 -04:00
Joseph Qi	6f6a6fda29	jbd2: fix ocfs2 corrupt when updating journal superblock fails If updating journal superblock fails after journal data has been flushed, the error is omitted and this will mislead the caller as a normal case. In ocfs2, the checkpoint will be treated successfully and the other node can get the lock to update. Since the sb_start is still pointing to the old log block, it will rewrite the journal data during journal recovery by the other node. Thus the new updates will be overwritten and ocfs2 corrupts. So in above case we have to return the error, and ocfs2_commit_cache will take care of the error and prevent the other node to do update first. And only after recovering journal it can do the new updates. The issue discussion mail can be found at: https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.html http://comments.gmane.org/gmane.comp.file-systems.ext4/48841 [ Fixed bug in patch which allowed a non-negative error return from jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this was causing xfstests ext4/306 to fail. -- Ted ] Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Tested-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: stable@vger.kernel.org	2015-06-15 14:36:01 -04:00
Rasmus Villemoes	97b4af2f76	ext4: mballoc: avoid 20-argument function call Making a function call with 20 arguments is rather expensive in both stack and .text. In this case, doing the formatting manually doesn't make it any less readable, so we might as well save 155 bytes of .text and 112 bytes of stack. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>	2015-06-15 00:32:58 -04:00
Lukas Czerner	0d306dcf86	ext4: wait for existing dio workers in ext4_alloc_file_blocks() Currently existing dio workers can jump in and potentially increase extent tree depth while we're allocating blocks in ext4_alloc_file_blocks(). This may cause us to underestimate the number of credits needed for the transaction because the extent tree depth can change after our estimation. Fix this by waiting for all the existing dio workers in the same way as we do it in ext4_punch_hole. We've seen errors caused by this in xfstest generic/299, however it's really hard to reproduce. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-15 00:23:53 -04:00
Lukas Czerner	4134f5c88d	ext4: recalculate journal credits as inode depth changes Currently in ext4_alloc_file_blocks() the number of credits is calculated only once before we enter the allocation loop. However within the allocation loop the extent tree depth can change, hence the number of credits needed can increase potentially exceeding the number of credits reserved in the handle which can cause journal failures. Fix this by recalculating number of credits when the inode depth changes. Note that even though ext4_alloc_file_blocks() is only currently used by extent base inodes we will avoid recalculating number of credits unnecessarily in the case of indirect based inodes. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-15 00:20:46 -04:00
Dmitry Monakhov	b4f1afcd06	jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail() jbd2_cleanup_journal_tail() can be invoked by jbd2__journal_start() So allocations should be done with GFP_NOFS [Full stack trace snipped from 3.10-rh7] [<ffffffff815c4bd4>] dump_stack+0x19/0x1b [<ffffffff8105dba1>] warn_slowpath_common+0x61/0x80 [<ffffffff8105dcca>] warn_slowpath_null+0x1a/0x20 [<ffffffff815c2142>] slab_pre_alloc_hook.isra.31.part.32+0x15/0x17 [<ffffffff8119c045>] kmem_cache_alloc+0x55/0x210 [<ffffffff811477f5>] ? mempool_alloc_slab+0x15/0x20 [<ffffffff811477f5>] mempool_alloc_slab+0x15/0x20 [<ffffffff81147939>] mempool_alloc+0x69/0x170 [<ffffffff815cb69e>] ? _raw_spin_unlock_irq+0xe/0x20 [<ffffffff8109160d>] ? finish_task_switch+0x5d/0x150 [<ffffffff811f1a8e>] bio_alloc_bioset+0x1be/0x2e0 [<ffffffff8127ee49>] blkdev_issue_flush+0x99/0x120 [<ffffffffa019a733>] jbd2_cleanup_journal_tail+0x93/0xa0 [jbd2] -->GFP_KERNEL [<ffffffffa019aca1>] jbd2_log_do_checkpoint+0x221/0x4a0 [jbd2] [<ffffffffa019afc7>] __jbd2_log_wait_for_space+0xa7/0x1e0 [jbd2] [<ffffffffa01952d8>] start_this_handle+0x2d8/0x550 [jbd2] [<ffffffff811b02a9>] ? __memcg_kmem_put_cache+0x29/0x30 [<ffffffff8119c120>] ? kmem_cache_alloc+0x130/0x210 [<ffffffffa019573a>] jbd2__journal_start+0xba/0x190 [jbd2] [<ffffffff811532ce>] ? lru_cache_add+0xe/0x10 [<ffffffffa01c9549>] ? ext4_da_write_begin+0xf9/0x330 [ext4] [<ffffffffa01f2c77>] __ext4_journal_start_sb+0x77/0x160 [ext4] [<ffffffffa01c9549>] ext4_da_write_begin+0xf9/0x330 [ext4] [<ffffffff811446ec>] generic_file_buffered_write_iter+0x10c/0x270 [<ffffffff81146918>] __generic_file_write_iter+0x178/0x390 [<ffffffff81146c6b>] __generic_file_aio_write+0x8b/0xb0 [<ffffffff81146ced>] generic_file_aio_write+0x5d/0xc0 [<ffffffffa01bf289>] ext4_file_write+0xa9/0x450 [ext4] [<ffffffff811c31d9>] ? pipe_read+0x379/0x4f0 [<ffffffff811b93f0>] do_sync_write+0x90/0xe0 [<ffffffff811b9b6d>] vfs_write+0xbd/0x1e0 [<ffffffff811ba5b8>] SyS_write+0x58/0xb0 [<ffffffff815d4799>] system_call_fastpath+0x16/0x1b Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2015-06-15 00:18:02 -04:00
Fabian Frederick	13b987ea27	fs/ufs: revert "ufs: fix deadlocks introduced by sb mutex merge" This reverts commit `9ef7db7f38` ("ufs: fix deadlocks introduced by sb mutex merge") That patch tried to solve commit `0244756edc` ("ufs: sb mutex merge + mutex_destroy") which is itself partially reverted due to multiple deadlocks. Signed-off-by: Fabian Frederick <fabf@skynet.be> Suggested-by: Jan Kara <jack@suse.cz> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Alexey Khoroshilov <khoroshilov@ispras.ru> Cc: Roger Pau Monne <roger.pau@citrix.com> Cc: Ian Jackson <Ian.Jackson@eu.citrix.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2015-06-14 11:31:51 -04:00
Al Viro	3f4a949410	ncpfs: successful rename() should invalidate caches for parents Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-06-14 11:31:39 -04:00
Fabian Frederick	bf86546760	ext4: use swap() in mext_page_double_lock() Use kernel.h macro definition. Thanks to Julia Lawall for Coccinelle scripting support. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-12 23:47:33 -04:00
Fabian Frederick	4b7e2db5c0	ext4: use swap() in memswap() Use kernel.h macro definition. Thanks to Julia Lawall for Coccinelle scripting support. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-12 23:46:33 -04:00
Theodore Ts'o	bdf96838ae	ext4: fix race between truncate and __ext4_journalled_writepage() The commit cf108bca465d: "ext4: Invert the locking order of page_lock and transaction start" caused __ext4_journalled_writepage() to drop the page lock before the page was written back, as part of changing the locking order to jbd2_journal_start -> page_lock. However, this introduced a potential race if there was a truncate racing with the data=journalled writeback mode. Fix this by grabbing the page lock after starting the journal handle, and then checking to see if page had gotten truncated out from under us. This fixes a number of different warnings or BUG_ON's when running xfstests generic/086 in data=journalled mode, including: jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7 c0, 164), jh->b_transaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0 - and - kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200! ... Call Trace: [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117 [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 [<c027d883>] ? lock_buffer+0x36/0x36 [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22 [<c0229139>] do_invalidatepage+0x22/0x26 [<c0229198>] truncate_inode_page+0x5b/0x85 [<c022934b>] truncate_inode_pages_range+0x156/0x38c [<c0229592>] truncate_inode_pages+0x11/0x15 [<c022962d>] truncate_pagecache+0x55/0x71 [<c02b913b>] ext4_setattr+0x4a9/0x560 [<c01ca542>] ? current_kernel_time+0x10/0x44 [<c026c4d8>] notify_change+0x1c7/0x2be [<c0256a00>] do_truncate+0x65/0x85 [<c0226f31>] ? file_ra_state_init+0x12/0x29 - and - WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396 irty_metadata+0x14a/0x1ae() ... Call Trace: [<c01b879f>] ? console_unlock+0x3a1/0x3ce [<c082cbb4>] dump_stack+0x48/0x60 [<c0178b65>] warn_slowpath_common+0x89/0xa0 [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae [<c0178bef>] warn_slowpath_null+0x14/0x18 [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d [<c02b2f44>] write_end_fn+0x40/0x53 [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a [<c02b59e7>] ext4_writepage+0x354/0x3b8 [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4 [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 [<c02b5a5b>] __writepage+0x10/0x2e [<c0225956>] write_cache_pages+0x22d/0x32c [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 [<c02b6ee8>] ext4_writepages+0x102/0x607 [<c019adfe>] ? sched_clock_local+0x10/0x10e [<c01a8a7c>] ? __lock_is_held+0x2e/0x44 [<c01a8ad5>] ? lock_is_held+0x43/0x51 [<c0226dff>] do_writepages+0x1c/0x29 [<c0276bed>] __writeback_single_inode+0xc3/0x545 [<c0277c07>] writeback_sb_inodes+0x21f/0x36d ... Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2015-06-12 23:45:33 -04:00
Theodore Ts'o	1cb767cd4a	ext4 crypto: fail the mount if blocksize != pagesize We currently don't correctly handle the case where blocksize != pagesize, so disallow the mount in those cases. Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-12 23:44:33 -04:00
Josef Bacik	37b8d27de5	Btrfs: use received_uuid of parent during send Neil Horman pointed out a problem where if he did something like this receive A snap A B change B send -p A B and then on another box do recieve A receive B the receive B would fail because we use the UUID of A for the clone sources for B. This makes sense most of the time because normally you are sending from the original sources, not a received source. However when you use a recieved subvol its UUID is going to be something completely different, so if you then try to receive the diff on a different volume it won't find the UUID because the new A will be something else. The only constant is the received uuid. So instead check to see if we have received_uuid set on the root, and if so use that as the clone source, as btrfs receive looks for matches either in received_uuid or uuid. Thanks, Reported-by: Neil Horman <nhorman@redhat.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-12 13:20:38 -07:00
Liu Bo	0eeff2362b	Btrfs: fix use-after-free in btrfs_replay_log @log_root_tree should not be referenced after kfree. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-12 11:03:21 -07:00
Chao Yu	3c45414527	f2fs: do not trim preallocated blocks when truncating after i_size When we perform generic/092 in xfstests, output is like below: XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) 0: [0..10239]: data 0: [0..10239]: data -1: [10240..20479]: unwritten +1: [10240..14335]: unwritten This is because with this testcase, we redefine the regulation for truncate in perallocated space past i_size as below: "There was some confused about what the fs was supposed to do when you truncate at i_size with preallocated space past i_size. We decided on the following things. 1) truncate(i_size) will trim all blocks past i_size. 2) truncate(x) where x > i_size will not trim all blocks past i_size. " This method is used in xfs, and then ext4/btrfs will follow the rule. This patch fixes to follow the new rule for f2fs. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-11 18:30:49 -07:00
Trond Myklebust	4e54ab8d8c	NFS: Ensure that we update the sequence id under the slot table lock Fix a callback slot table regression. Fixes: `e937ee714b` ("nfs: Only update callback sequnce id when CB_SEQUENCE success") Cc: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-11 21:15:52 -04:00
Kinglong Mee	0579c8d208	nfs: Initialize cb_sequenceres information before validate_seqid() For a cb_layoutrecall replay, nfsd got CB_SEQUENCE status of zero, but all informations of cb_sequenceres are zero too !!! validate_seqid() return NFS4ERR_RETRY_UNCACHED_REP for a replay, and skip the initlize cb_sequenceres. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-11 21:09:06 -04:00
Jaegeuk Kim	43f54cd52f	f2fs crypto: add alloc_bounce_page This patch adds alloc_bounce_page likewise ext4. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-11 15:04:20 -07:00
Jaegeuk Kim	7e8e754a4b	f2fs crypto: fix to handle errors likewise ext4 This patch makes some error handling policies same with ext4. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-11 15:00:36 -07:00
Jeff Layton	6f02dc88be	nfs: deny backchannel RPCs with an incorrect authflavor instead of dropping them A drop should really only be done when the frame is malformed or we have reason to think that there is some sort of DoS going on. When we get an RPC with bad auth, we should send back an error instead. Cc: Andy Adamson <William.Adamson@netapp.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-11 14:06:34 -04:00
Kinglong Mee	e937ee714b	nfs: Only update callback sequnce id when CB_SEQUENCE success When testing pnfs layout, nfsd got error NFS4ERR_SEQ_MISORDERED. It is caused by nfs return NFS4ERR_DELAY before validate_seqid(), don't update the sequnce id, but nfsd updates the sequnce id !!! According to RFC5661 20.9.3, " If CB_SEQUENCE returns an error, then the state of the slot (sequence ID, cached reply) MUST NOT change. " Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-11 14:00:40 -04:00
Vaishali Thakkar	4ed0d83d05	NFS: Convert use of __constant_htonl to htonl In little endian cases, the macro htonl unfolds to __swab32 which provides special case for constants. In big endian cases, __constant_htonl and htonl expand directly to the same expression. So, replace __constant_htonl with htonl with the goal of getting rid of the definition of __constant_htonl completely. The semantic patch that performs this transformation is as follows: @@expression x;@@ - __constant_htonl(x) + htonl(x) Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-10 18:57:59 -04:00
Anna Schumaker	11598b8ff2	NFS: Remove unused nfs_rw_ops->rw_release() function This was only ever set to nfs_writeback_release_common(), a function which is completely empty. Let's just drop this function pointer and simplify the code a bit. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-10 18:32:40 -04:00
Dominique Martinet	c86c90c656	NFSv4: handle nfs4_get_referral failure nfs4_proc_lookup_common is supposed to return a posix error, we have to handle any error returned that isn't errno Reported-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com> Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-10 18:28:02 -04:00
Jeff Layton	3c87ef6efb	sunrpc: keep a count of swapfiles associated with the rpc_clnt Jerome reported seeing a warning pop when working with a swapfile on NFS. The nfs_swap_activate can end up calling sk_set_memalloc while holding the rcu_read_lock and that function can sleep. To fix that, we need to take a reference to the xprt while holding the rcu_read_lock, set the socket up for swapping and then drop that reference. But, xprt_put is not exported and having NFS deal with the underlying xprt is a bit of layering violation anyway. Fix this by adding a set of activate/deactivate functions that take a rpc_clnt pointer instead of an rpc_xprt, and have nfs_swap_activate and nfs_swap_deactivate call those. Also, add a per-rpc_clnt atomic counter to keep track of the number of active swapfiles associated with it. When the counter does a 0->1 transition, we enable swapping on the xprt, when we do a 1->0 transition we disable swapping on it. This also allows us to be a bit more selective with the RPC_TASK_SWAPPER flag. If non-swapper and swapper clnts are sharing a xprt, then we only need to flag the tasks from the swapper clnt with that flag. Acked-by: Mel Gorman <mgorman@suse.de> Reported-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-10 18:26:14 -04:00
Zhao Lei	9a4e7276d3	btrfs: wait for delayed iputs on no space btrfs will report no_space when we run following write and delete file loop: # FILE_SIZE_M=[ 75% of fs space ] # DEV=[ some dev ] # MNT=[ some dir ] # # mkfs.btrfs -f "$DEV" # mount -o nodatacow "$DEV" "$MNT" # for ((i = 0; i < 100; i++)); do dd if=/dev/zero of="$MNT"/file0 bs=1M count="$FILE_SIZE_M"; rm -f "$MNT"/file0; done # Reason: iput() and evict() is run after write pages to block device, if write pages work is not finished before next write, the "rm"ed space is not freed, and caused above bug. Fix: We can add "-o flushoncommit" mount option to avoid above bug, but it have performance problem. Actually, we can to wait for on-the-fly writes only when no-space happened, it is which this patch do. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:26:34 -07:00
Qu Wenruo	d672633545	btrfs: qgroup: Make snapshot accounting work with new extent-oriented qgroup. Make snapshot accounting work with new extent-oriented mechanism by skipping given root in new/old_roots in create_pending_snapshot(). Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:26:29 -07:00
Qu Wenruo	9086db86e0	btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots. This is used by later qgroup fix patches for snapshot. As current snapshot accounting is done by btrfs_qgroup_inherit(), but new extent oriented quota mechanism will account extent from btrfs_copy_root() and other snapshot things, causing wrong result. So add this ability to handle snapshot accounting. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:26:23 -07:00
Qu Wenruo	d4b8040459	btrfs: ulist: Add ulist_del() function. This function will delete unode with given (val,aux) pair. And with this patch, seqnum for debug usage doesn't have any meaning now, so remove them. This is used by later patches to skip snapshot root. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:26:17 -07:00
Qu Wenruo	e69bcee376	btrfs: qgroup: Cleanup the old ref_node-oriented mechanism. Goodbye, the old mechanisim. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:26:11 -07:00
Qu Wenruo	442244c963	btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism. Since the self test transaction don't have delayed_ref_roots, so use find_all_roots() and export btrfs_qgroup_account_extent() to simulate it Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:26:05 -07:00
Qu Wenruo	0ed4792af0	btrfs: qgroup: Switch to new extent-oriented qgroup mechanism. Switch from old ref_node based qgroup to extent based qgroup mechanism for normal operations. The new mechanism should hugely reduce the overhead of btrfs quota system, and further more, the codes and logic should be more clean and easier to maintain. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:25:59 -07:00
Qu Wenruo	9d220c95f5	btrfs: qgroup: Switch rescan to new mechanism. Switch rescan to use the new new extent oriented mechanism. As rescan is also based on extent, new mechanism is just a perfect match for rescan. With re-designed internal functions, rescan is quite easy, just call btrfs_find_all_roots() and then btrfs_qgroup_account_one_extent(). Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:25:54 -07:00
Qu Wenruo	550d7a2ed5	btrfs: qgroup: Add new qgroup calculation function btrfs_qgroup_account_extents(). The new btrfs_qgroup_account_extents() function should be called in btrfs_commit_transaction() and it will update all the qgroup according to delayed_ref_root->dirty_extent_root. The new function can handle both normal operation during commit_transaction() or in rescan in a unified method with clearer logic. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:25:49 -07:00
Qu Wenruo	21633fc603	btrfs: backref: Add special time_seq == (u64)-1 case for btrfs_find_all_roots(). Allow btrfs_find_all_roots() to skip all delayed_ref_head lock and tree lock to do tree search. This is important for later qgroup implement which will call find_all_roots() after fs trees are committed. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:25:45 -07:00
Qu Wenruo	3b7d00f99c	btrfs: qgroup: Add new function to record old_roots. Add function btrfs_qgroup_prepare_account_extents() to get old_roots which are needed for qgroup. We do it in commit_transaction() and before switch_roots(), and only search commit_root, so it gives a quite accurate view for previous transaction. With old_roots from previous transaction, we can use it to do accurate account with current transaction. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:25:39 -07:00
Qu Wenruo	3368d001ba	btrfs: qgroup: Record possible quota-related extent for qgroup. Add hook in add_delayed_ref_head() to record quota-related extent record into delayed_ref_root->dirty_extent_record rb-tree for later qgroup accounting. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:25:32 -07:00
Qu Wenruo	823ae5b8e3	btrfs: qgroup: Add function qgroup_update_counters(). Add function qgroup_update_counters(), which will update related qgroups' rfer/excl according to old/new_roots. This is one of the two core functions for the new qgroup implement. This is based on btrfs_adjust_coutners() but with clearer logic and comment. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:25:28 -07:00
Qu Wenruo	d810ef2be5	btrfs: qgroup: Add function qgroup_update_refcnt(). This function is used to update refcnt for qgroups. And is one of the two core functions used in the new qgroup implement. This is based on the old update_old/new_refcnt, but provides a unified logic and behavior. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:25:24 -07:00
Qu Wenruo	c682f9b3c2	btrfs: extent-tree: Use ref_node to replace unneeded parameters in __inc_extent_ref() and __free_extent() __btrfs_inc_extent_ref() and __btrfs_free_extent() have already had too many parameters, but three of them can be extracted from btrfs_delayed_ref_node struct. So use btrfs_delayed_ref_node struct as a single parameter to replace the bytenr/num_byte/no_quota parameters. The real objective of this patch is to allow btrfs_qgroup_record_ref() get the delayed_ref_node in incoming qgroup patches. Other functions calling btrfs_qgroup_record_ref() are not affected since the rest will only add/sub exclusive extents, where node is not used. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:25:18 -07:00
Qu Wenruo	9c542136fd	btrfs: qgroup: Cleanup open-coded old/new_refcnt update and read. Use inline functions to do such things, to improve readability. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Acked-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:25:13 -07:00
Qu Wenruo	c43d160fcd	btrfs: delayed-ref: Cleanup the unneeded functions. Cleanup the rb_tree merge/insert/update functions, since now we use list instead of rb_tree now. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:25:09 -07:00
Qu Wenruo	c6fc245499	btrfs: delayed-ref: Use list to replace the ref_root in ref_head. This patch replace the rbtree used in ref_head to list. This has the following advantage: 1) Easier merge logic. With the new list implement, we only need to care merging the tail ref_node with the new ref_node. And this can be done quite easy at insert time, no need to do a indicated merge at run_delayed_refs(). Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:25:03 -07:00
Qu Wenruo	00db646d3f	btrfs: backref: Don't merge refs which are not for same block. Old __merge_refs() in backref.c will even merge refs whose root_id are different, which makes qgroup gives wrong result. Fix it by checking ref_for_same_block() before any mode specific works. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 09:24:59 -07:00
Zhao Lei	20b2e3029e	btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx() lockdep report following warning in test: [25176.843958] ================================= [25176.844519] [ INFO: inconsistent lock state ] [25176.845047] 4.1.0-rc3 #22 Tainted: G W [25176.845591] --------------------------------- [25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes: [25176.847246] (&wr_ctx->wr_lock){+.?...}, at: [<ffffffffa04cdc6d>] scrub_free_ctx+0x2d/0xf0 [btrfs] [25176.847838] {SOFTIRQ-ON-W} state was registered at: [25176.848396] [<ffffffff810bf460>] __lock_acquire+0x6a0/0xe10 [25176.848955] [<ffffffff810bfd1e>] lock_acquire+0xce/0x2c0 [25176.849491] [<ffffffff816489af>] mutex_lock_nested+0x7f/0x410 [25176.850029] [<ffffffffa04d04ff>] scrub_stripe+0x4df/0x1080 [btrfs] [25176.850575] [<ffffffffa04d11b1>] scrub_chunk.isra.19+0x111/0x130 [btrfs] [25176.851110] [<ffffffffa04d144c>] scrub_enumerate_chunks+0x27c/0x510 [btrfs] [25176.851660] [<ffffffffa04d3b87>] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs] [25176.852189] [<ffffffffa04e918e>] btrfs_dev_replace_start+0x36e/0x450 [btrfs] [25176.852771] [<ffffffffa04a98e0>] btrfs_ioctl+0x1e10/0x2d20 [btrfs] [25176.853315] [<ffffffff8121c5b8>] do_vfs_ioctl+0x318/0x570 [25176.853868] [<ffffffff8121c851>] SyS_ioctl+0x41/0x80 [25176.854406] [<ffffffff8164da17>] system_call_fastpath+0x12/0x6f [25176.854935] irq event stamp: 51506 [25176.855511] hardirqs last enabled at (51506): [<ffffffff810d4ce5>] vprintk_emit+0x225/0x5e0 [25176.856059] hardirqs last disabled at (51505): [<ffffffff810d4b77>] vprintk_emit+0xb7/0x5e0 [25176.856642] softirqs last enabled at (50886): [<ffffffff81067a23>] __do_softirq+0x363/0x640 [25176.857184] softirqs last disabled at (50949): [<ffffffff8106804d>] irq_exit+0x10d/0x120 [25176.857746] other info that might help us debug this: [25176.858845] Possible unsafe locking scenario: [25176.859981] CPU0 [25176.860537] ---- [25176.861059] lock(&wr_ctx->wr_lock); [25176.861705] <Interrupt> [25176.862272] lock(&wr_ctx->wr_lock); [25176.862881] * DEADLOCK * Reason: Above warning is caused by: Interrupt -> bio_endio() -> ... -> scrub_put_ctx() -> scrub_free_ctx() 1 -> ... -> mutex_lock(&wr_ctx->wr_lock); scrub_put_ctx() is allowed to be called in end_bio interrupt, but in code design, it will never call scrub_free_ctx(sctx) in interrupe context(above 1), because btrfs_scrub_dev() get one additional reference of sctx->refs, which makes scrub_free_ctx() only called withine btrfs_scrub_dev(). Now the code runs out of our wish, because free sequence in scrub_pending_bio_dec() have a gap. Current code: -----------------------------------+----------------------------------- scrub_pending_bio_dec() \| btrfs_scrub_dev -----------------------------------+----------------------------------- atomic_dec(&sctx->bios_in_flight); \| wake_up(&sctx->list_wait); \| \| scrub_put_ctx() \| -> atomic_dec_and_test(&sctx->refs) scrub_put_ctx(sctx); \| -> atomic_dec_and_test(&sctx->refs)\| -> scrub_free_ctx() \| -----------------------------------+----------------------------------- We expected: -----------------------------------+----------------------------------- scrub_pending_bio_dec() \| btrfs_scrub_dev -----------------------------------+----------------------------------- atomic_dec(&sctx->bios_in_flight); \| wake_up(&sctx->list_wait); \| scrub_put_ctx(sctx); \| -> atomic_dec_and_test(&sctx->refs)\| \| scrub_put_ctx() \| -> atomic_dec_and_test(&sctx->refs) \| -> scrub_free_ctx() -----------------------------------+----------------------------------- Fix: Move scrub_pending_bio_dec() to a workqueue, to avoid this function run in interrupt context. Tested by check tracelog in debug. Changelog v1->v2: Use workqueue instead of adjust function call sequence in v1, because v1 will introduce a bug pointed out by: Filipe David Manana <fdmanana@gmail.com> Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 07:04:52 -07:00
Mark Fasheh	e1d227a42e	btrfs: Handle unaligned length in extent_same The extent-same code rejects requests with an unaligned length. This poses a problem when we want to dedupe the tail extent of files as we skip cloning the portion between i_size and the extent boundary. If we don't clone the entire extent, it won't be deleted. So the combination of these behaviors winds up giving us worst-case dedupe on many files. We can fix this by allowing a length that extents to i_size and internally aligining those to the end of the block. This is what btrfs_ioctl_clone() so we can just copy that check over. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 07:02:50 -07:00
chandan	070034bdf9	Btrfs: btrfs_defrag_file: Fix calculation of max_to_defrag. max_to_defrag represents the number of pages to defrag rather than the last page of the file range to be defragged. Consider a file having 10 4k blocks (i.e. blocks in the range [0 - 9]). If the defrag ioctl was invoked for the block range [3 - 6], then max_to_defrag should actually have the value 4. Instead in the current code we end up setting it to 6. Now, this does not (yet) cause an issue since the first part of the while loop condition in btrfs_defrag_file() (i.e. "i <= last_index") causes the control to flow out of the while loop before any buggy behavior is actually caused. So the patch just makes sure that max_to_defrag ends up having the right value rather than fixing a bug. I did run the xfstests suite to make sure that the code does not regress. Changelog: v1->v2: Provide a much descriptive commit message. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 07:02:48 -07:00
chandan	e4826a5b24	Btrfs: btrfs_defrag_file: Fix ra_index computation. Read-ahead is done for the pages in the range [ra_index, ra_index + cluster - 1]. So the next read-ahead should be starting from the page at index 'ra_index + cluster' (unless we deemed that the extent at 'ra_index + cluster' as non-defraggable) rather than from the page at index 'ra_index + max_cluster'. This patch fixes this. I did run the xfstests suite to make sure that the code does not regress. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 07:02:47 -07:00
Filipe Manana	4617ea3a52	Btrfs: fix necessary chunk tree space calculation when allocating a chunk When allocating a new chunk or removing one we need to update num_devs device items and insert or remove a chunk item in the chunk tree, so in the worst case the space needed in the chunk space_info is: btrfs_calc_trunc_metadata_size(chunk_root, num_devs) + btrfs_calc_trans_metadata_size(chunk_root, 1) That is, in the worst case we need to cow num_devs paths and cow 1 other path that can result in splitting every node and leaf, and each path consisting of BTRFS_MAX_LEVEL - 1 nodes and 1 leaf. We were requiring some additional chunk_root->nodesize * BTRFS_MAX_LEVEL * num_devs bytes, which were unnecessary since updating the existing device items does not result in splitting the nodes and leaf since after updating them they remain with the same size. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 07:02:46 -07:00
Filipe Manana	7558c8bc17	Btrfs: don't attach unnecessary extents to transaction on fsync We don't need to attach ordered extents that have completed to the current transaction. Doing so only makes us hold memory for longer than necessary and delaying the iput of the inode until the transaction is committed (for each created ordered extent we do an igrab and then schedule an asynchronous iput when the ordered extent's reference count drops to 0), preventing the inode from being evictable until the transaction commits. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 07:02:44 -07:00
Filipe Manana	b659ef0277	Btrfs: avoid syncing log in the fast fsync path when not necessary Commit `3a8b36f378` ("Btrfs: fix data loss in the fast fsync path") added a performance regression for that causes an unnecessary sync of the log trees (fs/subvol and root log trees) when 2 consecutive fsyncs are done against a file, without no writes or any metadata updates to the inode in between them and if a transaction is committed before the second fsync is called. Huang Ying reported this to lkml (https://lkml.org/lkml/2015/3/18/99) after a test sysbench test that measured a -62% decrease of file io requests per second for that tests' workload. The test is: echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor mkfs -t btrfs /dev/sda2 mount -t btrfs /dev/sda2 /fs/sda2 cd /fs/sda2 for ((i = 0; i < 1024; i++)); do fallocate -l 67108864 testfile.$i; done sysbench --test=fileio --max-requests=0 --num-threads=4 --max-time=600 \ --file-test-mode=rndwr --file-total-size=68719476736 --file-io-mode=sync \ --file-num=1024 run A test on kvm guest, running a debug kernel gave me the following results: Without 3a8b36f378060d: 16.01 reqs/sec With 3a8b36f378060d: 3.39 reqs/sec With `3a8b36f378` and this patch: 16.04 reqs/sec Reported-by: Huang Ying <ying.huang@intel.com> Tested-by: Huang, Ying <ying.huang@intel.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-10 07:02:43 -07:00
Chris Mason	1ab818b137	Merge branch 'send_fixes_4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.2	2015-06-10 07:02:41 -07:00
Jaegeuk Kim	de6a8ec982	f2fs: drop the volatile_write flag only When aborting volatile_writes, let's drop its flag and give up any further volatile_writes. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-09 13:56:47 -07:00
Abhi Das	86066914ed	gfs2: Don't support fallocate on jdata files We cannot provide an efficient implementation due to the headers on the data blocks, so there doesn't seem much point in having it. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2015-06-09 09:16:46 -05:00
Namjae Jeon	331573febb	ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate This patch implements fallocate's FALLOC_FL_INSERT_RANGE for Ext4. 1) Make sure that both offset and len are block size aligned. 2) Update the i_size of inode by len bytes. 3) Compute the file's logical block number against offset. If the computed block number is not the starting block of the extent, split the extent such that the block number is the starting block of the extent. 4) Shift all the extents which are lying between [offset, last allocated extent] towards right by len bytes. This step will make a hole of len bytes at offset. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>	2015-06-09 01:55:03 -04:00
David S. Miller	941742f497	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2015-06-08 20:06:56 -07:00
Chao Yu	c5bda1c8b1	f2fs: skip committing valid superblock In recovery procedure for superblock, we try to write data of valid superblock into invalid one for recovery, work should be finished here, but then still we will write the valid one with its original data. This operation is not needed. Let's skip doing this unnecessary work. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-08 11:20:51 -07:00
Chao Yu	09d54cdd20	f2fs: setting discard option in parse_options() For the first mount of f2fs image with realtime discard option, we will disable discard option if device is not supported, but for remount operation, our discard option can still be set, this should be avoided. This patch moves configuring of discard option to parse_options() to fix this issue. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-08 11:09:19 -07:00
Greg Kroah-Hartman	987aec39a7	Merge 4.1-rc7 into driver-core-next We want the fixes in this branch as well for testing and merge resolution. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-06-08 10:19:40 -07:00
Jan Kara	de92c8caf1	jbd2: speedup jbd2_journal_get_[write\|undo]_access() jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are frequently called for buffers that are already part of the running transaction - most frequently it is the case for bitmaps, inode table blocks, and superblock. Since in such cases we have nothing to do, it is unfortunate we still grab reference to journal head, lock the bh, lock bh_state only to find out there's nothing to do. Improving this is a bit subtle though since until we find out journal head is attached to the running transaction, it can disappear from under us because checkpointing / commit decided it's no longer needed. We deal with this by protecting journal_head slab with RCU. We still have to be careful about journal head being freed & reallocated within slab and about exposing journal head in consistent state (in particular b_modified and b_frozen_data must be in correct state before we allow user to touch the buffer). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-08 12:46:37 -04:00
Jan Kara	8b00f400ee	jbd2: more simplifications in do_get_write_access() Check for the simple case of unjournaled buffer first, handle it and bail out. This allows us to remove one if and unindent the difficult case by one tab. The result is easier to read. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-08 12:44:21 -04:00
Jan Kara	d012aa5965	jbd2: simplify error path on allocation failure in do_get_write_access() We were acquiring bh_state_lock when allocation of buffer failed in do_get_write_access() only to be able to jump to a label that releases the lock and does all other checks that don't make sense for this error path. Just jump into the right label instead. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-08 12:40:39 -04:00
Jan Kara	ee57aba159	jbd2: simplify code flow in do_get_write_access() needs_copy is set only in one place in do_get_write_access(), just move the frozen buffer copying into that place and factor it out to a separate function to make do_get_write_access() slightly more readable. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-08 12:39:07 -04:00
Fabian Frederick	b4ab9e2982	ext4 crypto: fix sparse warnings in fs/ext4/ioctl.c [ Added another sparse fix for EXT4_IOC_GET_ENCRYPTION_POLICY while we're at it. --tytso ] Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-08 12:23:21 -04:00
Abhi Das	1bdf45352e	gfs2: s64 cast for negative quota value One-line fix to cast quota value to s64 before comparison. By default the quantity is treated as u64. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>	2015-06-08 11:20:50 -05:00
David Moore	8bc3b1e6e8	ext4: BUG_ON assertion repeated for inode1, not done for inode2 During a source code review of fs/ext4/extents.c I noted identical consecutive lines. An assertion is repeated for inode1 and never done for inode2. This is not in keeping with the rest of the code in the ext4_swap_extents function and appears to be a bug. Assert that the inode2 mutex is not locked. Signed-off-by: David Moore <dmoorefo@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com>	2015-06-08 11:59:12 -04:00
Theodore Ts'o	ad0a0ce894	ext4 crypto: fix ext4_get_crypto_ctx()'s calling convention in ext4_decrypt_one Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-08 11:54:56 -04:00
Lukas Czerner	42ac1848ea	ext4: return error code from ext4_mb_good_group() Currently ext4_mb_good_group() only returns 0 or 1 depending on whether the allocation group is suitable for use or not. However we might get various errors and fail while initializing new group including -EIO which would never get propagated up the call chain. This might lead to an endless loop at writeback when we're trying to find a good group to allocate from and we fail to initialize new group (read error for example). Fix this by returning proper error code from ext4_mb_good_group() and using it in ext4_mb_regular_allocator(). In ext4_mb_regular_allocator() we will always return only the first occurred error from ext4_mb_good_group() and we only propagate it back to the caller if we do not get any other errors and we fail to allocate any blocks. Note that with other modes than errors=continue, we will fail immediately in ext4_mb_good_group() in case of error, however with errors=continue we should try to continue using the file system, that's why we're not going to fail immediately when we see an error from ext4_mb_good_group(), but rather when we fail to find a suitable block group to allocate from due to an problem in group initialization. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>	2015-06-08 11:40:40 -04:00
Lukas Czerner	bbdc322f2c	ext4: try to initialize all groups we can in case of failure on ppc64 Currently on the machines with page size > block size when initializing block group buddy cache we initialize it for all the block group bitmaps in the page. However in the case of read error, checksum error, or if a single bitmap is in any way corrupted we would fail to initialize all of the bitmaps. This is problematic because we will not have access to the other allocation groups even though those might be perfectly fine and usable. Fix this by reading all the bitmaps instead of error out on the first problem and simply skip the bitmaps which were either not read properly, or are not valid. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-08 11:38:37 -04:00
Lukas Czerner	41e5b7ed3e	ext4: verify block bitmap even after fresh initialization If we want to rely on the buffer_verified() flag of the block bitmap buffer, we have to set it consistently. However currently if we're initializing uninitialized block bitmap in ext4_read_block_bitmap_nowait() we're not going to set buffer verified at all. We can do this by simply setting the flag on the buffer, but I think it's actually better to run ext4_validate_block_bitmap() to make sure that what we did in the ext4_init_block_bitmap() is right. So run ext4_validate_block_bitmap() even after the block bitmap initialization. Also bail out early from ext4_validate_block_bitmap() if we see corrupt bitmap, since we already know it's corrupt and we do not need to verify that. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-08 11:18:52 -04:00
Tejun Heo	412a19b64a	v9fs: fix error handling in v9fs_session_init() On failure, v9fs_session_init() returns with the v9fs_session_info struct partially initialized and expects the caller to invoke v9fs_session_close() to clean it up; however, it doesn't track whether the bdi is initialized or not and curiously invokes bdi_destroy() in both vfs_session_init() failure path too. A. If v9fs_session_init() fails before the bdi is initialized, the follow-up v9fs_session_close() will invoke bdi_destroy() on an uninitialized bdi. B. If v9fs_session_init() fails after the bdi is initialized, bdi_destroy() will be called twice on the same bdi - once in the failure path of v9fs_session_init() and then by v9fs_session_close(). A is broken no matter what. B used to be okay because bdi_destroy() allowed being invoked multiple times on the same bdi, which BTW was broken in its own way - if bdi_destroy() was invoked on an initialiezd but !registered bdi, it'd fail to free percpu counters. Since `f0054bb1e1` ("writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback"), this no longer work - bdi_destroy() on an initialized but not registered bdi works correctly but multiple invocations of bdi_destroy() is no longer allowed. The obvious culprit here is v9fs_session_init()'s odd and broken error behavior. It should simply clean up after itself on failures. This patch makes the following updates to v9fs_session_init(). * @rc -> @retval error return propagation removed. It didn't serve any purpose. Just use @rc. * Move addition to v9fs_sessionlist to the end of the function so that incomplete sessions are not put on the list or iterated and error path doesn't have to worry about it. * Update error handling so that it cleans up after itself. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-08 09:05:06 -06:00
Michal Hocko	6ccaf3e2f3	jbd2: revert must-not-fail allocation loops back to GFP_NOFAIL This basically reverts `47def82672` (jbd2: Remove __GFP_NOFAIL from jbd2 layer). The deprecation of __GFP_NOFAIL was a bad choice because it led to open coding the endless loop around the allocator rather than removing the dependency on the non failing allocation. So the deprecation was a clear failure and the reality tells us that __GFP_NOFAIL is not even close to go away. It is still true that __GFP_NOFAIL allocations are generally discouraged and new uses should be evaluated and an alternative (pre-allocations or reservations) should be considered but it doesn't make any sense to lie the allocator about the requirements. Allocator can take steps to help making a progress if it knows the requirements. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Acked-by: David Rientjes <rientjes@google.com>	2015-06-08 10:53:10 -04:00
Kinglong Mee	276f03e3ba	nfsd: Update callback sequnce id only CB_SEQUENCE success When testing pnfs layout, nfsd got error NFS4ERR_SEQ_MISORDERED. It is caused by nfs return NFS4ERR_DELAY before validate_seqid(), don't update the sequnce id, but nfsd updates the sequnce id !!! According to RFC5661 20.9.3, " If CB_SEQUENCE returns an error, then the state of the slot (sequence ID, cached reply) MUST NOT change. " Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-06-04 16:51:30 -04:00
Kinglong Mee	4399396eec	nfsd: Reset cb_status in nfsd4_cb_prepare() at retrying nfsd enters a infinite loop and prints message every 10 seconds: May 31 18:33:52 test-server kernel: Error sending entire callback! May 31 18:34:01 test-server kernel: Error sending entire callback! This is caused by a cb_layoutreturn getting error -10008 (NFS4ERR_DELAY), the client crashing, and then nfsd entering the infinite loop: bc_sendto --> call_timeout --> nfsd4_cb_done --> nfsd4_cb_layout_done with error -10008 --> rpc_delay(task, HZ/100) --> bc_sendto ... Reproduced using xfstests 074 with nfs client's kdump on, CONFIG_DEFAULT_HUNG_TASK_TIMEOUT set, and client's blkmapd down: 1. nfs client's write operation will get the layout of file, and then send getdeviceinfo, 2. but layout segment is not recorded by client because blkmapd is down, 3. client writes data by sending WRITE to server, 4. nfs server recalls the layout of the file before WRITE, 5. network error causes the client reset the session and return NFS4ERR_DELAY, 6. so client's WRITE operation is waiting the reply. If the task hangs 120s, the client will crash. 7. so that, the next bc_sendto will fail with TIMEOUT, and cb_status is NFS4ERR_DELAY. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-06-04 16:43:39 -04:00
Trond Myklebust	8eee52af27	NFSv4: nfs4_handle_delegation_recall_error should ignore EAGAIN EAGAIN is a valid return code from nfs4_open_recover(), and should be handled by nfs4_handle_delegation_recall_error by simply passing it through. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-04 13:51:13 -04:00
Eric W. Biederman	8c6cf9cc82	mnt: Modify fs_fully_visible to deal with locked ro nodev and atime Ignore an existing mount if the locked readonly, nodev or atime attributes are less permissive than the desired attributes of the new mount. On success ensure the new mount locks all of the same readonly, nodev and atime attributes as the old mount. The nosuid and noexec attributes are not checked here as this change is destined for stable and enforcing those attributes causes a regression in lxc and libvirt-lxc where those applications will not start and there are no known executables on sysfs or proc and no known way to create exectuables without code modifications Cc: stable@vger.kernel.org Fixes: `e51db73532` ("userns: Better restrictions on when proc and sysfs can be mounted") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2015-06-04 10:29:25 -05:00
Dave Chinner	4ea7976616	Merge branch 'xfs-commit-cleanup' into for-next Conflicts: fs/xfs/xfs_attr_inactive.c	2015-06-04 13:55:48 +10:00
Christoph Hellwig	f78c390107	xfs: fix xfs_log_done interface Instead of the confusing flags argument pass a boolean flag to indicate if we want to release or regrant a log reservation. Also ensure that xfs_log_done always drop the reference on the log ticket, to both simplify the code and make the logic in xfs_trans_roll easier to understand. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 13:48:20 +10:00
Christoph Hellwig	70393313dd	xfs: saner xfs_trans_commit interface The flags argument to xfs_trans_commit is not useful for most callers, as a commit of a transaction without a permanent log reservation must pass 0 here, and all callers for a transaction with a permanent log reservation except for xfs_trans_roll must pass XFS_TRANS_RELEASE_LOG_RES. So remove the flags argument from the public xfs_trans_commit interfaces, and introduce low-level __xfs_trans_commit variant just for xfs_trans_roll that regrants a log reservation instead of releasing it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 13:48:08 +10:00
Christoph Hellwig	4906e21545	xfs: remove the flags argument to xfs_trans_cancel xfs_trans_cancel takes two flags arguments: XFS_TRANS_RELEASE_LOG_RES and XFS_TRANS_ABORT. Both of them are a direct product of the transaction state, and can be deducted: - any dirty transaction needs XFS_TRANS_ABORT to be properly canceled, and XFS_TRANS_ABORT is a noop for a transaction that is not dirty. - any transaction with a permanent log reservation needs XFS_TRANS_RELEASE_LOG_RES to be properly canceled, and passing XFS_TRANS_RELEASE_LOG_RES for a transaction without a permanent log reservation is invalid. So just remove the flags argument and do the right thing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 13:47:56 +10:00
Christoph Hellwig	eacb24e734	xfs: pass a boolean flag to xfs_trans_free_items The flags value always was 0 or XFS_TRANS_ABORT. Switch to a bool parameter to allow further cleanups. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 13:47:43 +10:00
Christoph Hellwig	2e6db6c4c1	xfs: switch remaining xfs_trans_dup users to xfs_trans_roll We have three remaining callers of xfs_trans_dup: - xfs_itruncate_extents which open codes xfs_trans_roll - xfs_bmap_finish doesn't have an xfs_inode argument and thus leaves attaching them to it's callers, but otherwise is identical to xfs_trans_roll - xfs_dir_ialloc looks at the log reservations in the old xfs_trans structure instead of the log reservation parameters, but otherwise is identical to xfs_trans_roll. By allowing a NULL xfs_inode argument to xfs_trans_roll we can switch these three remaining users over to xfs_trans_roll and mark xfs_trans_dup static. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 13:47:29 +10:00
Dave Chinner	4497f28750	Merge branch 'xfs-misc-fixes-for-4.2-2' into for-next	2015-06-04 13:31:13 +10:00
Brian Foster	46fc58dacf	xfs: check min blks for random debug mode sparse allocations The inode allocator enables random sparse inode chunk allocations in DEBUG mode to facilitate testing. Sparse inode allocations are not always possible, however, depending on the fs geometry. For example, there is no possibility for a sparse inode allocation on filesystems where the block size is large enough to fit one or more inode chunks within a single block. Fix up the DEBUG mode sparse inode allocation logic to trigger random sparse allocations only when the geometry of the fs allows it. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 13:03:34 +10:00
Brian Foster	3cdaa1898f	xfs: fix sparse inodes 32-bit compile failure The kbuild test robot reports the following compilation failure with a 32-bit kernel configuration: fs/built-in.o: In function `xfs_ifree_cluster': >> xfs_inode.c:(.text+0x17ac84): undefined reference to `__umoddi3' This is due to the use of the modulus operator on a 64-bit variable in the ASSERT() added as part of the following commit: xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() This ASSERT() simply checks that the offset of the inode in a sparse cluster is appropriately aligned. Since the maximum inode record offset is 63 (for a 64 inode record) and the calculated offset here should be something less than that, just use a 32-bit variable to store the offset and call the do_mod() helper. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 13:03:34 +10:00
Dave Chinner	66e8ac7bfa	Merge branch 'xfs-dax-support' into for-next	2015-06-04 13:01:49 +10:00
Dave Chinner	cbe4dab119	xfs: add initial DAX support Add initial DAX support to XFS. To do this we need a new mount option to turn DAX on filesystem, and we need to propagate this into the inode flags whenever an inode is instantiated so that the per-inode checks throughout the code Do The Right Thing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:19:18 +10:00
Dave Chinner	6e1ba0bcb8	xfs: add DAX IO path support DAX does not do buffered IO (can't buffer direct access!) and hence all read/write IO is vectored through the direct IO path. Hence we need to add the DAX IO path callouts to the direct IO infrastructure. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:19:15 +10:00
Dave Chinner	9969441f9f	xfs: add DAX truncate support When we truncate a DAX file, we need to call through the DAX page truncation path rather than through block_truncate_page() so that mappings and block zeroing are all handled correctly. Otherwise, truncate does not need to change. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:19:10 +10:00
Dave Chinner	4f69f578a8	xfs: add DAX block zeroing support Add initial support for DAX block zeroing operations to XFS. DAX cannot use buffered IO through the page cache for zeroing, nor do we need to issue IO for uncached block zeroing. In both cases, we can simply call out to the dax block zeroing function. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:19:08 +10:00
Dave Chinner	6b698edeee	xfs: add DAX file operations support Add the initial support for DAX file operations to XFS. This includes the necessary block allocation and mmap page fault hooks for DAX to function. Note that there are changes to the splice interfaces to ensure that for DAX splice avoids direct page cache manipulations and instead takes the DAX IO paths for read/write operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:18:53 +10:00
Dave Chinner	ce5c5d554d	dax: expose __dax_fault for filesystems with locking constraints Some filesystems cannot call dax_fault() directly because they have different locking and/or allocation constraints in the page fault IO path. To handle this, we need to follow the same model as the generic block_page_mkwrite code, where the internals are exposed via __block_page_mkwrite() so that filesystems can wrap the correct locking and operations around the outside. This is loosely based on a patch originally from Matthew Willcox. Unlike the original patch, it does not change ext4 code, error returns or unwritten extent conversion handling. It also adds a __dax_mkwrite() wrapper for .page_mkwrite implementations to do the right thing, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:18:18 +10:00
Dave Chinner	e842f29039	dax: don't abuse get_block mapping for endio callbacks dax_fault() currently relies on the get_block callback to attach an io completion callback to the mapping buffer head so that it can run unwritten extent conversion after zeroing allocated blocks. Instead of this hack, pass the conversion callback directly into dax_fault() similar to the get_block callback. When the filesystem allocates unwritten extents, it will set the buffer_unwritten() flag, and hence the dax_fault code can call the completion function in the contexts where it is necessary without overloading the mapping buffer head. Note: The changes to ext4 to use this interface are suspect at best. In fact, the way ext4 did this end_io assignment in the first place looks suspect because it only set a completion callback when there wasn't already some other write() call taking place on the same inode. The ext4 end_io code looks rather intricate and fragile with all it's reference counting and passing to different contexts for modification via inode private pointers that aren't protected by locks... Signed-off-by: Dave Chinner <dchinner@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:18:18 +10:00
Dave Chinner	ec56b1f1fd	xfs: mmap lock needs to be inside freeze protection Lock ordering for the new mmap lock needs to be: mmap_sem sb_start_pagefault i_mmap_lock page lock <fault processsing> Right now xfs_vm_page_mkwrite gets this the wrong way around, While technically it cannot deadlock due to the current freeze ordering, it's still a landmine that might explode if we change anything in future. Hence we need to nest the locks correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-04 09:18:18 +10:00
Theodore Ts'o	3dbb5eb9a3	ext4 crypto: allocate bounce pages using GFP_NOWAIT Previously we allocated bounce pages using a combination of alloc_page() and mempool_alloc() with the __GFP_WAIT bit set. Instead, use mempool_alloc() with GFP_NOWAIT. The mempool_alloc() function will try using alloc_pages() initially, and then only use the mempool reserve of pages if alloc_pages() is unable to fulfill the request. This minimizes the the impact on the mm layer when we need to do a large amount of writeback of encrypted files, as Jaeguk Kim had reported that under a heavy fio workload on a system with restricted amounts memory (which unfortunately, includes many mobile handsets), he had observed the the OOM killer getting triggered several times. Using GFP_NOWAIT If the mempool_alloc() function fails, we will retry the page writeback at a later time; the function of the mempool is to ensure that we can writeback at least 32 pages at a time, so we can more efficiently dispatch I/O under high memory pressure situations. In the future we should make this be a tunable so we can determine the best tradeoff between permanently sequestering memory and the ability to quickly launder pages so we can free up memory quickly when necessary. Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-06-03 09:32:39 -04:00
Filipe Manana	6ca0709756	Btrfs: fix hang during inode eviction due to concurrent readahead Zygo Blaxell and other users have reported occasional hangs while an inode is being evicted, leading to traces like the following: [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds. [ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2 [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 0x00000000 [ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001 [ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78 [ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123 [ 5281.981396] Call Trace: [ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83 [ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs] [ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 [ 5281.986715] [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs] [ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs] [ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs] [ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148 [ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5 [ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa [ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 [ 5281.996120] [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab [ 5281.997562] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b [ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [ 5282.001299] 1 lock held by rm/20488: [ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b This happens when we have readahead, which calls readpages(), happening right before the inode eviction handler is invoked. So the reason is essentially: 1) readpages() is called while a reference on the inode is held, so eviction can not be triggered before readpages() returns. It also locks one or more ranges in the inode's io_tree (which is done at extent_io.c:__do_contiguous_readpages()); 2) readpages() submits several read bios, all with an end io callback that runs extent_io.c:end_bio_extent_readpage() and that is executed by other task when a bio finishes, corresponding to a work queue (fs_info->end_io_workers) worker kthread. This callback unlocks the ranges in the inode's io_tree that were previously locked in step 1; 3) readpages() returns, the reference on the inode is dropped; 4) One or more of the read bios previously submitted are still not complete (their end io callback was not yet invoked or has not yet finished execution); 5) Inode eviction is triggered (through an unlink call for example). The inode reference count was not incremented before submitting the read bios, therefore this is possible; 6) The eviction handler starts executing and enters the loop that iterates over all extent states in the inode's io_tree; 7) The loop picks one extent state record and uses its ->start and ->end fields, after releasing the inode's io_tree spinlock, to call lock_extent_bits() and clear_extent_bit(). The call to lock the range [state->start, state->end] blocks because the whole range or a part of it was locked by the previous call to readpages() and the corresponding end io callback, which unlocks the range was not yet executed; 8) The end io callback for the read bio is executed and unlocks the range [state->start, state->end] (or a superset of that range). And at clear_extent_bit() the extent_state record state is used as a second argument to split_state(), which sets state->start to a larger value; 9) The task executing the eviction handler is woken up by the task executing the bio's end io callback (through clear_state_bit) and the eviction handler locks the range [old value for state->start, state->end]. Shortly after, when calling clear_extent_bit(), it unlocks the range [new value for state->start, state->end], so it ends up unlocking only part of the range that it locked, leaving an extent state record in the io_tree that represents the unlocked subrange; 10) The eviction handler loop, in its next iteration, gets the extent_state record for the subrange that it did not unlock in the previous step and then tries to lock it, resulting in an hang. So fix this by not using the ->start and ->end fields of an existing extent_state record. This is a simple solution, and an alternative could be to bump the inode's reference count before submitting each read bio and having it dropped in the bio's end io callback. But that would be a more invasive/complex change and would not protect against other possible places that are not holding a reference on the inode as well. Something to consider in the future. Many thanks to Zygo Blaxell for reporting, in the mailing list, the issue, a set of scripts to trigger it and testing this fix. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:03:09 -07:00
Liu Bo	64c043de46	Btrfs: fix up read_tree_block to return proper error The return value of read_tree_block() can confuse callers as it always returns NULL for either -ENOMEM or -EIO, so it's likely that callers parse it to a wrong error, for instance, in btrfs_read_tree_root(). This fixes the above issue. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:03:08 -07:00
Liu Bo	8635eda91e	Btrfs: add missing free_extent_buffer read_tree_block may take a reference on the 'eb', a following free_extent_buffer is necessary. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:03:07 -07:00
Liu Bo	0c304304fe	Btrfs: remove csum_bytes_left After commit `8407f55326` ("Btrfs: fix data corruption after fast fsync and writeback error"), during wait_ordered_extents(), we wait for ordered extent setting BTRFS_ORDERED_IO_DONE or BTRFS_ORDERED_IOERR, at which point we've already got checksum information, so we don't need to check (csum_bytes_left == 0) in the whole logging path. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:03:06 -07:00
Filipe Manana	39c2d7facc	Btrfs: fix -ENOSPC on block group removal Unlike when attempting to allocate a new block group, where we check that we have enough space in the system space_info to update the device items and insert a new chunk item in the chunk tree, we were not checking if the system space_info had enough space for updating the device items and deleting the chunk item in the chunk tree. This often lead to -ENOSPC error when attempting to allocate blocks for the chunk tree (during btree node/leaf COW operations) while updating the device items or deleting the chunk item, which resulted in the current transaction being aborted and turning the filesystem into read-only mode. While running fstests generic/038, which stresses allocation of block groups and removal of unused block groups, with a large scratch device (750Gb) this happened often, despite more than enough unallocated space, and resulted in the following trace: [68663.586604] WARNING: CPU: 3 PID: 1521 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]() [68663.600407] BTRFS: Transaction aborted (error -28) (...) [68663.730829] Call Trace: [68663.732585] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [68663.734334] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [68663.739980] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [68663.757153] [<ffffffffa036ca6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs] [68663.760925] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [68663.762854] [<ffffffffa03b159d>] ? btrfs_update_device+0x15a/0x16c [btrfs] [68663.764073] [<ffffffffa036ca6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs] [68663.765130] [<ffffffffa03b3638>] btrfs_remove_chunk+0x597/0x5ee [btrfs] [68663.765998] [<ffffffffa0384663>] ? btrfs_delete_unused_bgs+0x245/0x296 [btrfs] [68663.767068] [<ffffffffa0384676>] btrfs_delete_unused_bgs+0x258/0x296 [btrfs] [68663.768227] [<ffffffff8143527f>] ? _raw_spin_unlock_irq+0x2d/0x4c [68663.769081] [<ffffffffa038b109>] cleaner_kthread+0x13d/0x16c [btrfs] [68663.799485] [<ffffffffa038afcc>] ? btrfs_alloc_root+0x28/0x28 [btrfs] [68663.809208] [<ffffffff8105f367>] kthread+0xef/0xf7 [68663.828795] [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28 [68663.844942] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad [68663.846486] [<ffffffff81435a88>] ret_from_fork+0x58/0x90 [68663.847760] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad [68663.849503] ---[ end trace 798477c6d6dbaad6 ]--- [68663.850525] BTRFS: error (device sdc) in btrfs_remove_chunk:2652: errno=-28 No space left So fix this by verifying that enough space exists in system space_info, and reserving the space in the chunk block reserve, before attempting to delete the block group and allocate a new system chunk if we don't have enough space to perform the necessary updates and delete in the chunk tree. Like for the block group creation case, we don't error our if we fail to allocate a new system chunk, since we might end up not needing it (no node/leaf splits happen during the COW operations and/or we end up not needing to COW any btree nodes or leafs because they were already COWed in the current transaction and their writeback didn't start yet). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:03:05 -07:00
Filipe Manana	4fbcdf6694	Btrfs: fix -ENOSPC when finishing block group creation While creating a block group, we often end up getting ENOSPC while updating the chunk tree, which leads to a transaction abortion that produces a trace like the following: [30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]() [30670.117777] BTRFS: Transaction aborted (error -28) (...) [30670.163567] Call Trace: [30670.163906] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [30670.164522] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [30670.165171] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [30670.166323] [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs] [30670.167213] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [30670.167862] [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs] [30670.169116] [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs] [30670.170593] [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs] [30670.171960] [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs] [30670.174649] [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs] [30670.176092] [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs] [30670.177218] [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15 [30670.178622] [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de [30670.179642] [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f [30670.180692] [<ffffffff81152863>] SyS_fallocate+0x47/0x62 [30670.186737] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [30670.187792] ---[ end trace 0373e6b491c4a8cc ]--- This is because we don't do proper space reservation for the chunk block reserve when we have multiple tasks allocating chunks in parallel. So block group creation has 2 phases, and the first phase essentially checks if there is enough space in the system space_info, allocating a new system chunk if there isn't, while the second phase updates the device, extent and chunk trees. However, because the updates to the chunk tree happen in the second phase, if we have N tasks, each with its own transaction handle, allocating new chunks in parallel and if there is only enough space in the system space_info to allocate M chunks, where M < N, none of the tasks ends up allocating a new system chunk in the first phase and N - M tasks will get -ENOSPC when attempting to update the chunk tree in phase 2 if they need to COW any nodes/leafs from the chunk tree. Fix this by doing proper reservation in the chunk block reserve. The issue could be reproduced by running fstests generic/038 in a loop, which eventually triggered the problem. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:03:04 -07:00
Josef Bacik	0d2b2372e0	Btrfs: set UNWRITTEN for prealloc'ed extents in fiemap We should be doing this, it's weird we hadn't been doing this. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:03:03 -07:00
Omar Sandoval	c8d3fe028f	Btrfs: show subvol= and subvolid= in /proc/mounts Now that we're guaranteed to have a meaningful root dentry, we can just export seq_dentry() and use it in btrfs_show_options(). The subvolume ID is easy to get and can also be useful, so put that in there, too. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:03:02 -07:00
Omar Sandoval	05dbe6837b	Btrfs: unify subvol= and subvolid= mounting Currently, mounting a subvolume with subvolid= takes a different code path than mounting with subvol=. This isn't really a big deal except for the fact that mounts done with subvolid= or the default subvolume don't have a dentry that's connected to the dentry tree like in the subvol= case. To unify the code paths, when given subvolid= or using the default subvolume ID, translate it into a subvolume name by walking ROOT_BACKREFs in the root tree and INODE_REFs in the filesystem trees. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:03:01 -07:00
Omar Sandoval	bb289b7be6	Btrfs: fail on mismatched subvol and subvolid mount options There's nothing to stop a user from passing both subvol= and subvolid= to mount, but if they don't refer to the same subvolume, someone is going to be surprised at some point. Error out on this case, but allow users to pass in both if they do match (which they could, for example, get out of /proc/mounts). Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:03:00 -07:00
Omar Sandoval	fa33065950	Btrfs: clean up error handling in mount_subvol() In preparation for new functionality in mount_subvol(), give it ownership of subvol_name and tidy up the error paths. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:02:59 -07:00
Omar Sandoval	e6e4dbe894	Btrfs: remove all subvol options before mounting top-level Currently, setup_root_args() substitutes 's/subvol=[^,]*/subvolid=0/'. But, this means that if the user passes both a subvol and subvolid for some reason, we won't actually mount the top-level when we recursively mount. For example, consider: mkfs.btrfs -f /dev/sdb mount /dev/sdb /mnt btrfs subvol create /mnt/subvol1 # subvolid=257 btrfs subvol create /mnt/subvol2 # subvolid=258 umount /mnt mount -osubvol=/subvol1,subvolid=258 /dev/sdb /mnt In the final mount, subvol=/subvol1,subvolid=258 becomes subvolid=0,subvolid=258, and the last option takes precedence, so we mount subvol2 and try to look up subvol1 inside of it, which fails. So, instead, do a thorough scan through the argument list and remove any subvol= and subvolid= options, then append subvolid=0 to the end. This implicitly makes subvol= take precedence over subvolid=, but we're about to add a stricter check for that. This also makes setup_root_args() more generic, which we'll need soon. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:02:58 -07:00
Omar Sandoval	773cd04ec1	Btrfs: lock superblock before remounting for rw subvol Since commit `0723a0473f` ("btrfs: allow mounting btrfs subvolumes with different ro/rw options"), when mounting a subvolume read/write when another subvolume has previously been mounted read-only, we first do a remount. However, this should be done with the superblock locked, as per sync_filesystem(): /* * We need to be protected against the filesystem going from * r/o to r/w or vice versa. */ WARN_ON(!rwsem_is_locked(&sb->s_umount)); This WARN_ON can easily be hit with: mkfs.btrfs -f /dev/vdb mount /dev/vdb /mnt btrfs subvol create /mnt/vol1 btrfs subvol create /mnt/vol2 umount /mnt mount -oro,subvol=/vol1 /dev/vdb /mnt mount -orw,subvol=/vol2 /dev/vdb /mnt2 Fixes: `0723a0473f` ("btrfs: allow mounting btrfs subvolumes with different ro/rw options") Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:02:57 -07:00
Filipe Manana	0f31871f44	Btrfs: wake up extent state waiters on unlock through clear_extent_bits When we clear an extent state's EXTENT_LOCKED bit with clear_extent_bits() through free_io_failure(), we weren't waking up any tasks waiting for the extent's state EXTENT_LOCKED bit, leading to an hang. So make sure clear_extent_bits() ends up waking up any waiters if the bit EXTENT_LOCKED is supplied by its callers. Zygo Blaxell was experiencing such hangs at inode eviction time after file unlinks. Thanks to him for a set of scripts to reproduce the issue. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:02:56 -07:00
Filipe Manana	c152b63efc	Btrfs: fix chunk allocation regression leading to transaction abort With commit `1b98450816` ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole") introduced in the kernel 4.1 merge window, we end up using part of a device hole for which there are already pending chunks or pinned chunks. Before that commit we didn't use the hole and would just move on to the next hole in the device. However when we adjust the start offset for the chunk allocation and we have pinned chunks, we set it blindly to the end offset of the pinned chunk we are currently processing, which is dangerous because we can have a pending chunk that has a start offset that matches the end offset of our pinned chunk - leading us to a case where we end up getting two pending chunks that start at the same physical device offset, which makes us later abort the current transaction with -EEXIST when finishing the chunk allocation at btrfs_create_pending_block_groups(): [194737.659017] ------------[ cut here ]------------ [194737.660192] WARNING: CPU: 15 PID: 31111 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]() [194737.662209] BTRFS: Transaction aborted (error -17) [194737.663175] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse [194737.674015] CPU: 15 PID: 31111 Comm: xfs_io Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [194737.675986] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [194737.682999] 0000000000000009 ffff8800564c7a98 ffffffff8142fa46 ffffffff8108b6a2 [194737.684540] ffff8800564c7ae8 ffff8800564c7ad8 ffffffff81045ea5 ffff8800564c7b78 [194737.686017] ffffffffa0383aa7 00000000ffffffef ffff88000c7ba000 ffff8801a1f66f40 [194737.687509] Call Trace: [194737.688068] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [194737.689027] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [194737.690095] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [194737.691198] [<ffffffffa0383aa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs] [194737.693789] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [194737.695065] [<ffffffffa0383aa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs] [194737.696806] [<ffffffffa039a3bd>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs] [194737.698683] [<ffffffffa03aa433>] __btrfs_end_transaction+0x84/0x366 [btrfs] [194737.700329] [<ffffffffa03aa725>] btrfs_end_transaction+0x10/0x12 [btrfs] [194737.701924] [<ffffffffa0394b51>] btrfs_check_data_free_space+0x11f/0x27c [btrfs] [194737.703675] [<ffffffffa03b8ba4>] __btrfs_buffered_write+0x16a/0x4c8 [btrfs] [194737.705417] [<ffffffffa03bb502>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs] [194737.707058] [<ffffffffa03bb511>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs] [194737.708560] [<ffffffffa03bb68d>] btrfs_file_write_iter+0x325/0x431 [btrfs] [194737.710673] [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e [194737.712076] [<ffffffff811534c3>] new_sync_write+0x7c/0xa0 [194737.713293] [<ffffffff81153b58>] vfs_write+0xb2/0x117 [194737.714443] [<ffffffff81154424>] SyS_pwrite64+0x64/0x82 [194737.715646] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [194737.717175] ---[ end trace f2d5dc04e56d7e48 ]--- [194737.718170] BTRFS: error (device sdc) in btrfs_create_pending_block_groups:9524: errno=-17 Object already exists The -EEXIST failure comes from btrfs_finish_chunk_alloc(), called by btrfs_create_pending_block_groups(), when it attempts to insert a duplicated device extent item via btrfs_alloc_dev_extent(). This issue was reproducible with fstests generic/038 running in a loop for several hours (it's very hard to hit) and using MOUNT_OPTIONS="-o discard". Applying Jeff's recent patch titled "btrfs: add missing discards when unpinning extents with -o discard" makes the issue much easier to reproduce (usually within 4 to 5 hours), since it pins chunks for longer periods of time when an unused block group is deleted by the cleaner kthread. Fix this by making sure that we never adjust the start offset to a lower value than it currently has. Fixes: `1b98450816` ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole" Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:02:55 -07:00
Sasha Levin	2037a0933b	btrfs: use after free when closing devices __btrfs_close_devices() would call_rcu to free the device, which is racy with list_for_each_entry() accessing the memory to retrieve the next device on the list. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-02 19:34:36 -07:00
David Sterba	01b810b889	btrfs: make root id query unprivileged The INO_LOOKUP ioctl can lookup path for a given inode number and is thus restricted. As a sideefect it can find the root id of the containing subvolume and we're using this int the 'btrfs inspect rootid' command. The restriction is unnecessary in case we set the ioctl args args::treeid = 0 args::objectid = 256 (BTRFS_FIRST_FREE_OBJECTID) Then the path will be empty and the treeid is filled with the root id of the inode on which the ioctl is called. This behaviour is unchanged, after the root restriction is removed. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-02 19:34:36 -07:00
Filipe Manana	2e6e518335	Btrfs: fix block group ->space_info null pointer dereference When we create a block group we add it to the rbtree of block groups before setting its ->space_info field (while it's NULL). This is problematic since other tasks can access the block group from the rbtree and attempt to use its ->space_info before it is set by btrfs_make_block_group(). This can happen for example when a concurrent fitrim ioctl operation is ongoing, which produces a trace like the following when CONFIG_DEBUG_PAGEALLOC is set. [11509.604369] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 [11509.606373] IP: [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02 [11509.608179] PGD 2296a8067 PUD 22f4a2067 PMD 0 [11509.608179] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [11509.608179] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq processor i2c_piix4 psmou [11509.608179] CPU: 10 PID: 8538 Comm: fstrim Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [11509.608179] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [11509.608179] task: ffff88009f5c46d0 ti: ffff8801b3edc000 task.ti: ffff8801b3edc000 [11509.608179] RIP: 0010:[<ffffffff8107d675>] [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02 [11509.608179] RSP: 0018:ffff8801b3edf9e8 EFLAGS: 00010002 [11509.608179] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000000 [11509.608179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 [11509.608179] RBP: ffff8801b3edfaa8 R08: 0000000000000001 R09: 0000000000000000 [11509.608179] R10: 0000000000000000 R11: ffff88009f5c4f98 R12: 0000000000000000 [11509.608179] R13: 0000000000000000 R14: 0000000000000018 R15: ffff88009f5c46d0 [11509.608179] FS: 00007f280a10e840(0000) GS:ffff88023ed40000(0000) knlGS:0000000000000000 [11509.608179] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [11509.608179] CR2: 0000000000000018 CR3: 00000002119bc000 CR4: 00000000000006e0 [11509.608179] Stack: [11509.608179] 0000000000000000 0000000000000000 0000000000000004 0000000000000000 [11509.608179] ffff880100000000 ffffffff00000000 0000000000000001 ffffffff00000000 [11509.608179] 0000000000000001 0000000000000000 ffff880100000000 00000000000006c4 [11509.608179] Call Trace: [11509.608179] [<ffffffff8107dc57>] ? __lock_acquire+0x696/0xf02 [11509.608179] [<ffffffff8107e806>] lock_acquire+0xa5/0x116 [11509.608179] [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs] [11509.608179] [<ffffffff81434f37>] _raw_spin_lock+0x34/0x44 [11509.608179] [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs] [11509.608179] [<ffffffffa04cc876>] do_trimming+0x51/0x145 [btrfs] [11509.608179] [<ffffffffa04cde7d>] btrfs_trim_block_group+0x201/0x491 [btrfs] [11509.608179] [<ffffffffa04849e2>] btrfs_trim_fs+0xe0/0x129 [btrfs] [11509.608179] [<ffffffffa04bb80a>] btrfs_ioctl_fitrim+0x138/0x167 [btrfs] [11509.608179] [<ffffffffa04c002f>] btrfs_ioctl+0x50d/0x21e8 [btrfs] [11509.608179] [<ffffffff81123bda>] ? might_fault+0x58/0xb5 [11509.608179] [<ffffffff81123bda>] ? might_fault+0x58/0xb5 [11509.608179] [<ffffffff81123bda>] ? might_fault+0x58/0xb5 [11509.608179] [<ffffffff81158050>] ? cp_new_stat+0x147/0x15e [11509.608179] [<ffffffff81163041>] do_vfs_ioctl+0x3c6/0x479 [11509.608179] [<ffffffff81158116>] ? SYSC_newfstat+0x25/0x2e [11509.608179] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 [11509.608179] [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f [11509.608179] [<ffffffff8116314e>] SyS_ioctl+0x5a/0x7f [11509.608179] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [11509.608179] Code: f4 01 00 0f 85 c0 00 00 00 48 c7 c1 f3 1f 7d 81 48 c7 c2 aa cb 7c 81 be fc 0b 00 00 eb 70 83 3d 61 eb 9c 00 00 0f 84 a5 00 00 00 <49> 81 3e 40 a3 2b 82 b8 00 00 00 [11509.608179] RIP [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02 [11509.608179] RSP <ffff8801b3edf9e8> [11509.608179] CR2: 0000000000000018 [11509.608179] ---[ end trace 570a5c6769f0e49a ]--- Which corresponds to the following access in fs/btrfs/free-space-cache.c: static int do_trimming(struct btrfs_block_group_cache block_group, u64 total_trimmed, u64 start, u64 bytes, u64 reserved_start, u64 reserved_bytes, struct btrfs_trim_range trim_entry) { struct btrfs_space_info space_info = block_group->space_info; (...) spin_lock(&space_info->lock); ^^^^^ - block_group->space_info is NULL... Fix this by ensuring the block group's ->space_info is set before adding the block group to the rbtree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-02 19:34:36 -07:00
Anand Jain	33b97e4327	Btrfs: check error before reporting missing device and add uuid Report missing device when add is successful, otherwise it would exit as ENOMEM. And add uuid to the report. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-02 19:34:35 -07:00
Qu Wenruo	1f6e4b3f9f	btrfs: Fix superblock csum type check. Old csum type check is wrong and can't catch csum_type 1(not supported). Fix it to avoid hostile 0 division. Reported-by: Lukas Lueg <lukas.lueg@gmail.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-02 19:34:35 -07:00
Filipe Manana	619d8c4ef7	Btrfs: incremental send, fix clone operations for compressed extents Marc reported a problem where the receiving end of an incremental send was performing clone operations that failed with -EINVAL. This happened because, unlike for uncompressed extents, we were not checking if the source clone offset and length, after summing the data offset, falls within the source file's boundaries. So make sure we do such checks when attempting to issue clone operations for compressed extents. Problem reproducible with the following steps: $ mkfs.btrfs -f /dev/sdb $ mount -o compress /dev/sdb /mnt $ mkfs.btrfs -f /dev/sdc $ mount -o compress /dev/sdc /mnt2 # Create the file with a single extent of 128K. This creates a metadata file # extent item with a data start offset of 0 and a logical length of 128K. $ xfs_io -f -c "pwrite -S 0xaa 64K 128K" -c "fsync" /mnt/foo # Now rewrite the range 64K to 112K of our file. This will make the inode's # metadata continue to point to the 128K extent we created before, but now # with an extent item that points to the extent with a data start offset of # 112K and a logical length of 16K. # That metadata file extent item is associated with the logical file offset # at 176K and covers the logical file range 176K to 192K. $ xfs_io -c "pwrite -S 0xbb 64K 112K" -c "fsync" /mnt/foo # Now rewrite the range 180K to 12K. This will make the inode's metadata # continue to point the the 128K extent we created earlier, with a single # extent item that points to it with a start offset of 112K and a logical # length of 4K. # That metadata file extent item is associated with the logical file offset # at 176K and covers the logical file range 176K to 180K. $ xfs_io -c "pwrite -S 0xcc 180K 12K" -c "fsync" /mnt/foo $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ touch /mnt/bar # Calls the btrfs clone ioctl. $ ~/xfstests/src/cloner -s $((176 * 1024)) -d $((176 * 1024)) \ -l $((4 * 1024)) /mnt/foo /mnt/bar $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 \| btrfs receive /mnt2 At subvol /mnt/snap1 At subvol snap1 $ btrfs send -p /mnt/snap1 /mnt/snap2 \| btrfs receive /mnt2 At subvol /mnt/snap2 At snapshot snap2 ERROR: failed to clone extents to bar Invalid argument A test case for fstests follows soon. Reported-by: Marc MERLIN <marc@merlins.org> Tested-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.cz> Tested-by: Jan Alexander Steffens (heftig) <jan.steffens@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-02 19:34:35 -07:00
Christian Engelmayer	ab3680dd18	btrfs: qgroup: Fix possible leak in btrfs_add_qgroup_relation() Commit `9c8b35b1ba` ("btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags when assigning/deleting a qgroup relations.") introduced the allocation of a temporary ulist in function btrfs_add_qgroup_relation() and added the corresponding cleanup to the out path. However, the allocation was introduced before the src/dst level check that directly returns. Fix the possible leakage of the ulist by moving the allocation after the input validation. Detected by Coverity CID 1295988. Signed-off-by: Christian Engelmayer <cengelma@gmx.at> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-02 19:34:35 -07:00
Filipe Manana	35c766425a	Btrfs: fix mutex unlock without prior lock on space cache truncation If the call to btrfs_truncate_inode_items() failed and we don't have a block group, we were unlocking the cache_write_mutex without having locked it (we do it only if we have a block group). Fixes: `1bbc621ef2` ("Btrfs: allow block group cache writeout outside critical section in commit") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-02 19:34:34 -07:00
Anand Jain	816fcebe8f	Btrfs: log when missing device is created Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-02 19:34:34 -07:00
David Sterba	6d13f5497f	btrfs: fix warnings after changes in btrfs_abort_transaction fs/btrfs/volumes.c: In function ‘btrfs_create_uuid_tree’: fs/btrfs/volumes.c:3909:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=] btrfs_abort_transaction(trans, tree_root, ^ CC [M] fs/btrfs/ioctl.o fs/btrfs/ioctl.c: In function ‘create_subvol’: fs/btrfs/ioctl.c:549:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=] btrfs_abort_transaction(trans, root, PTR_ERR(new_root)); PTR_ERR returns long, but we're really using 'int' for the error codes everywhere so just set and use the local variable. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-02 19:34:34 -07:00
David Sterba	c0d19e2b9a	btrfs: add 'cold' compiler annotations to all error handling functions The annotated functios will be placed into .text.unlikely section. The annotation also hints compiler to move the code out of the hot paths, and may implicitly mark if-statement leading to that block as unlikely. This is a heuristic, the impact on the generated code is not significant. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-02 19:34:34 -07:00
David Sterba	1a9a8a71ed	btrfs: report exact callsite where transaction abort occurs WARN is called from a single location and all bugreports say that's in super.c __btrfs_abort_transaction. This is slightly confusing as we'd rather want to know the exact callsite. Whereas this information is printed in the syslog below the stacktrace, this requires further look and we usually see only the headline from WARNING. Moving the WARN into the macro has to inline some code and increases code by a few kilobytes: text data bss dec hex filename 835481 20305 14120 869906 d4612 btrfs.ko.before 842883 20305 14120 877308 d62fc btrfs.ko.after The delta is +7k (130+ calls), measured on 3.19 x86_64, distro config. The increase is not small and could lead to worse icache use. The code is on error/exit paths that can be recognized by compiler as cold and moved out of the way so the impact is speculated to be low, if measurable at all. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-02 19:34:34 -07:00
David Sterba	13028901a4	btrfs: let tree defrag work in SSD mode Long time ago (2008) the defrag was automatic for new b-tree writes but has been disabled after performance problems. There was a leftover in tree-defrag.c that effectively stops any defragmentation on b-trees. This is a bit unexpected and IMHO undesired. The SSD mode is an optimization and defrag is supposed to work if the users asks for it. Related commits: `6702ed490c` Btrfs: Add run time btree defrag, and an ioctl to force btree defrag `e18e4809b1` Btrfs: Add mount -o ssd, which includes optimizations for seek free storage `b3236e68bf` Btrfs: Leave on the tree defragger in mount -o ssd, it still helps there `9afbb0b752` Btrfs: Disable tree defrag in SSD mode The last three commits switch the defrag+ssd off/on/off and the last one `3f157a2fd2` Btrfs: Online btree defragmentation fixes misses the bits from tree-defrag.c to revert to the behaviour introduced in `e18e4809b1`. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-02 19:34:33 -07:00
Filipe Manana	53e489bc8c	Btrfs: check pending chunks when shrinking fs to avoid corruption When we shrink the usable size of a device (its total_bytes), we go over all the device extent items in the device tree and attempt to relocate the chunk of any device extent that goes beyond the new usable size for the device. We do that after setting the new usable size (total_bytes) in the device object, so that all new allocations (and reallocations) don't use areas of the device that go beyond the new (shorter) size. However we were not considering that before setting the new size in the device, pending chunks might have been created that use device extents that go beyond the new size, and those device extents are not yet in the device tree after we search the device tree - they are still attached to the list of new block group for some ongoing transaction handle, and they are only added to the device tree when the transaction handle is ended (via btrfs_create_pending_block_groups()). So check for pending chunks with device extents that go beyond the new size and if any exists, commit the current transaction and repeat the search in the device tree. Not doing this it would mean we would return success to user space while still having extents that go beyond the new size, and later user space could override those locations on the device while the fs still references them, causing all sorts of corruption and unexpected events. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-02 19:34:33 -07:00
Omar Sandoval	64ad6c4889	Btrfs: don't invalidate root dentry when subvolume deletion fails Since commit `bafc9b754f` ("vfs: More precise tests in d_invalidate"), mounted subvolumes can be deleted because d_invalidate() won't fail. However, we run into problems when we attempt to delete the default subvolume while it is mounted as the root filesystem: # btrfs subvol list / ID 257 gen 306 top level 5 path rootvol ID 267 gen 334 top level 5 path snap1 # btrfs subvol get-default / ID 267 gen 334 top level 5 path snap1 # btrfs inspect-internal rootid / 267 # mount -o subvol=/ /dev/vda1 /mnt # btrfs subvol del /mnt/snap1 Delete subvolume (no-commit): '/mnt/snap1' ERROR: cannot delete '/mnt/snap1' - Operation not permitted # findmnt / findmnt: can't read /proc/mounts: No such file or directory # ls /proc # Markus reported that this same scenario simply led to a kernel oops. This happens because in btrfs_ioctl_snap_destroy(), we call d_invalidate() before we check may_destroy_subvol(), which means that we detach the submounts and drop the dentry before erroring out. Instead, we should only invalidate the dentry once the deletion has succeeded. Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate() will prune the dcache for the deleted subvolume. Cc: <stable@vger.kernel.org> Fixes: `bafc9b754f` ("vfs: More precise tests in d_invalidate") Reported-by: Markus Schauler <mschauler@gmail.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-02 19:34:33 -07:00
Filipe Manana	8b191a6849	Btrfs: incremental send, check if orphanized dir inode needs delayed rename If a directory inode is orphanized, because some inode previously processed has a new name that collides with the old name of the current inode, we need to check if it needs its rename operation delayed too, as its ancestor-descendent relationship with some other inode might have been reversed between the parent and send snapshots and therefore its rename operation needs to happen after that other inode is renamed. For example, for the following reproducer where this is needed (provided by Robbie Ko): $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt2 $ mkdir -p /mnt/data/n1/n2 $ mkdir /mnt/data/n4 $ mkdir -p /mnt/data/t6/t7 $ mkdir /mnt/data/t5 $ mkdir /mnt/data/t7 $ mkdir /mnt/data/n4/t2 $ mkdir /mnt/data/t4 $ mkdir /mnt/data/t3 $ mv /mnt/data/t7 /mnt/data/n4/t2 $ mv /mnt/data/t4 /mnt/data/n4/t2/t7 $ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4 $ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5 $ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6 $ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6 $ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2 $ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4 $ mv /mnt/data/n4/t2 /mnt/data/n4/n1 $ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2 $ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2 $ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2 $ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6 $ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3 $ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 \| btrfs receive /mnt2 $ btrfs send -p /mnt/snap1 /mnt/snap2 \| btrfs receive /mnt2 ERROR: send ioctl failed with -12: Cannot allocate memory Where the parent snapshot directory hierarchy is the following: . (ino 256) \|-- data/ (ino 257) \|-- n4/ (ino 260) \|-- t2/ (ino 265) \|-- t7/ (ino 264) \|-- t4/ (ino 266) \|-- t5/ (ino 263) \|-- t6/ (ino 261) \|-- n1/ (ino 258) \|-- n2/ (ino 259) \|-- t7/ (ino 262) \|-- t3/ (ino 267) And the send snapshot's directory hierarchy is the following: . (ino 256) \|-- data/ (ino 257) \|-- n4/ (ino 260) \|-- n1/ (ino 258) \|-- t2/ (ino 265) \|-- n2/ (ino 259) \|-- t3/ (ino 267) \| \|-- t7 (ino 264) \| \|-- t6/ (ino 261) \| \|-- t4/ (ino 266) \| \|-- t5/ (ino 263) \| \|-- t7/ (ino 262) While processing inode 262 we orphanize inode 264 and later attempt to rename inode 264 to its new name/location, which resulted in building an incorrect destination path string for the rename operation with the value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must have been done only after inode 267 is processed and renamed, as the ancestor-descendent relationship between inodes 264 and 267 was reversed between both snapshots, because otherwise it results in an infinite loop when building the path string for inode 264 when we are processing an inode with a number larger than 264. That loop is the following: start inode 264, send progress of 265 for example parent of 264 -> 267 parent of 267 -> 262 parent of 262 -> 259 parent of 259 -> 261 parent of 261 -> 263 parent of 263 -> 266 parent of 266 -> 264 \|--> back to first iteration while current path string length is <= PATH_MAX, and fail with -ENOMEM otherwise So fix this by making the check if we need to delay a directory rename regardless of the current inode having been orphanized or not. A test case for fstests follows soon. Thanks to Robbie Ko for providing a reproducer for this problem. Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>	2015-06-03 03:10:40 +01:00
Filipe Manana	80aa602756	Btrfs: incremental send, don't delay directory renames unnecessarily Even though we delay the rename of directories when they become descendents of other directories that were also renamed in the send root to prevent infinite path build loops, we were doing it in cases where this was not needed and was actually harmful resulting in infinite path build loops as we ended up with a circular dependency of delayed directory renames. Consider the following reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt2 $ mkdir /mnt/data $ mkdir /mnt/data/n1 $ mkdir /mnt/data/n1/n2 $ mkdir /mnt/data/n4 $ mkdir /mnt/data/n1/n2/p1 $ mkdir /mnt/data/n1/n2/p1/p2 $ mkdir /mnt/data/t6 $ mkdir /mnt/data/t7 $ mkdir -p /mnt/data/t5/t7 $ mkdir /mnt/data/t2 $ mkdir /mnt/data/t4 $ mkdir -p /mnt/data/t1/t3 $ mkdir /mnt/data/p1 $ mv /mnt/data/t1 /mnt/data/p1 $ mkdir -p /mnt/data/p1/p2 $ mv /mnt/data/t4 /mnt/data/p1/p2/t1 $ mv /mnt/data/t5 /mnt/data/n4/t5 $ mv /mnt/data/n1/n2/p1/p2 /mnt/data/n4/t5/p2 $ mv /mnt/data/t7 /mnt/data/n4/t5/p2/t7 $ mv /mnt/data/t2 /mnt/data/n4/t1 $ mv /mnt/data/p1 /mnt/data/n4/t5/p2/p1 $ mv /mnt/data/n1/n2 /mnt/data/n4/t5/p2/p1/p2/n2 $ mv /mnt/data/n4/t5/p2/p1/p2/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1 $ mv /mnt/data/n4/t5/t7 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7 $ mv /mnt/data/n4/t5/p2/p1/t1/t3 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3 $ mv /mnt/data/n4/t5/p2/p1/p2/n2/p1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1 $ mv /mnt/data/t6 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t5 $ mv /mnt/data/n4/t5/p2/p1/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t1 $ mv /mnt/data/n1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/n1 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/data/n4/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/t1 $ mv /mnt/data/n4/t5/p2/p1/p2/n2/t1 /mnt/data/n4/ $ mv /mnt/data/n4/t5/p2/p1/p2/n2 /mnt/data/n4/t1/n2 $ mv /mnt/data/n4/t1/t7/p1 /mnt/data/n4/t1/n2/p1 $ mv /mnt/data/n4/t1/t3/t1 /mnt/data/n4/t1/n2/t1 $ mv /mnt/data/n4/t1/t3 /mnt/data/n4/t1/n2/t1/t3 $ mv /mnt/data/n4/t5/p2/p1/p2 /mnt/data/n4/t1/n2/p1/p2 $ mv /mnt/data/n4/t1/t7 /mnt/data/n4/t1/n2/p1/t7 $ mv /mnt/data/n4/t5/p2/p1 /mnt/data/n4/t1/n2/p1/p2/p1 $ mv /mnt/data/n4/t1/n2/t1/t3/t5 /mnt/data/n4/t1/n2/p1/p2/t5 $ mv /mnt/data/n4/t5 /mnt/data/n4/t1/n2/p1/p2/p1/t5 $ mv /mnt/data/n4/t1/n2/p1/p2/p1/t5/p2 /mnt/data/n4/t1/n2/p1/p2/p1/p2 $ mv /mnt/data/n4/t1/n2/p1/p2/p1/p2/t7 /mnt/data/n4/t1/t7 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 \| btrfs receive /mnt2 $ btrfs send -p /mnt/snap1 /mnt/snap2 \| btrfs receive -vv /mnt2 ERROR: send ioctl failed with -12: Cannot allocate memory This reproducer resulted in an infinite path build loop when building the path for inode 266 because the following circular dependency of delayed directory renames was created: ino 272 <- ino 261 <- ino 259 <- ino 268 <- ino 267 <- ino 261 Where the notation "X <- Y" means the rename of inode X is delayed by the rename of inode Y (X will be renamed after Y is renamed). This resulted in an infinite path build loop of inode 266 because that inode has inode 261 as an ancestor in the send root and inode 261 is in the circular dependency of delayed renames listed above. Fix this by not delaying the rename of a directory inode if an ancestor of the inode in the send root, which has a delayed rename operation, is not also a descendent of the inode in the parent root. Thanks to Robbie Ko for sending the reproducer example. A test case for xfstests follows soon. Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>	2015-06-03 03:10:20 +01:00
Jaegeuk Kim	f56aa1c57e	f2fs: fix to return exact trimmed size Now, we add all the candidates for trim commands and then finally issue discard commands. So, we should count the trimmed size in back-end. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-02 15:48:20 -07:00
Sasha Levin	161f873b89	vfs: read file_handle only once in handle_to_path We used to read file_handle twice. Once to get the amount of extra bytes, and once to fetch the entire structure. This may be problematic since we do size verifications only after the first read, so if the number of extra bytes changes in userspace between the first and second calls, we'll have an incoherent view of file_handle. Instead, read the constant size once, and copy that over to the final structure without having to re-read it again. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-06-02 10:29:07 -07:00
Chao Yu	f62185d0e2	f2fs: support FALLOC_FL_INSERT_RANGE FALLOC_FL_INSERT_RANGE flag for ->fallocate was introduced in commit `dd46c78778` ("fs: Add support FALLOC_FL_INSERT_RANGE for fallocate"). The effect of FALLOC_FL_INSERT_RANGE command is the opposite of FALLOC_FL_COLLAPSE_RANGE, if this command was performed, all data from offset to EOF in our file will be shifted to right as given length, and then range [offset, offset + length] becomes a hole. This command is useful for our user who wants to add some data in the middle of the file, for example: video/music editor will insert a keyframe in specified position of media file, with this command we can easily create a hole for inserting without removing original data. This patch introduces f2fs_insert_range() to support FALLOC_FL_INSERT_RANGE. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-02 09:53:27 -07:00
Chao Yu	528e34593d	f2fs: hide common code in f2fs_replace_block This patch clean up codes through: 1.rename f2fs_replace_block to __f2fs_replace_block(). 2.introduce new f2fs_replace_block() to include __f2fs_replace_block() and some common related codes around __f2fs_replace_block(). Then, newly introduced function f2fs_replace_block can be used by following patch. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-02 09:52:07 -07:00
Abhi Das	9cde2898d0	gfs2: limit quota log messages This patch makes the quota subsystem only report once that a particular user/group has exceeded their allotted quota. Previously, it was possible for a program to continuously try exceeding quota (despite receiving EDQUOT) and in turn trigger gfs2 to issue a kernel log message about quota exceed. In theory, this could get out of hand and flood the log and the filesystem hosting the log files. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2015-06-02 11:03:04 -05:00
Abhi Das	39a725803b	gfs2: fix quota updates on block boundaries For smaller block sizes (512B, 1K, 2K), some quotas straddle block boundaries such that the usage value is on one block and the rest of the quota is on the previous block. In such cases, the value does not get updated correctly. This patch fixes that by addressing the boundary conditions correctly. This patch also adds a (s64) cast that was missing in a call to gfs2_quota_change() in inode.c Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2015-06-02 11:02:24 -05:00
Jens Axboe	d2e73fcceb	buffer: remove unusued 'ret' variable Merge hickup on my part, due to a clash between the writeback changes and the EOPNOTSUPP removal in _submit_bh(). Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 09:22:34 -06:00
Tejun Heo	e8a7abf5a5	writeback: disassociate inodes from dying bdi_writebacks For the purpose of foreign inode detection, wb's (bdi_writeback's) are identified by the associated memcg ID. As we create a separate wb for each memcg, this is enough to identify the active wb's; however, when blkcg is enabled or disabled higher up in the hierarchy, the mapping between memcg and blkcg changes which in turn creates a new wb to service the new mapping. The old wb is unlinked from index and released after all references are drained. The foreign inode detection logic can't detect this condition because both the old and new wb's point to the same memcg and thus never decides to move inodes attached to the old wb to the new one. This patch adds logic to initiate switching immediately in wbc_attach_and_unlock_inode() if the associated wb is dying. We can make the usual foreign detection logic to distinguish the different wb's mapped to the memcg but the dying wb is never gonna be in active service again and there's no point in tracking the usage history and reaching the switch verdict after enough data points are collected. It's already known that the wb has to be switched. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:40:20 -06:00
Tejun Heo	d10c809552	writeback: implement foreign cgroup inode bdi_writeback switching As concurrent write sharing of an inode is expected to be very rare and memcg only tracks page ownership on first-use basis severely confining the usefulness of such sharing, cgroup writeback tracks ownership per-inode. While the support for concurrent write sharing of an inode is deemed unnecessary, an inode being written to by different cgroups at different points in time is a lot more common, and, more importantly, charging only by first-use can too readily lead to grossly incorrect behaviors (single foreign page can lead to gigabytes of writeback to be incorrectly attributed). To resolve this issue, cgroup writeback detects the majority dirtier of an inode and transfers the ownership to it. The previous patches implemented the foreign condition detection mechanism and laid the groundwork. This patch implements the actual switching. With the previously implemented [unlocked_]inode_to_wb_and_list_lock() and wb stat transaction, grabbing wb->list_lock, inode->i_lock and mapping->tree_lock gives us full exclusion against all wb operations on the target inode. inode_switch_wb_work_fn() grabs all the locks and transfers the inode atomically along with its RECLAIMABLE and WRITEBACK stats. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:40:20 -06:00
Tejun Heo	aaa2cacf81	writeback: add lockdep annotation to inode_to_wb() With the previous three patches, all operations which acquire wb from inode are either under one of inode->i_lock, mapping->tree_lock or wb->list_lock or protected by unlocked_inode_to_wb transaction. This will be depended upon by foreign inode wb switching. This patch adds lockdep assertion to inode_to_wb() so that usages outside the above list locks can be caught easily. There are three exceptions. * locked_inode_to_wb_and_lock_list() is holding wb->list_lock but the wb may not be the inode's. Ensuring that is the function's role after all. Updated to deref inode->i_wb directly. * inode_wb_stat_unlocked_begin() is usually protected by combination of !I_WB_SWITCH and rcu_read_lock(). Updated to deref inode->i_wb directly. * inode_congested() wants to test whether inode->i_wb is set before starting the transaction. Added inode_to_wb_is_valid() which tests inode->i_wb directly. v5: might_lock() removed. It annotates that the lock is grabbed w/ irq enabled which isn't the case and triggering lockdep warning spuriously. v4: might_lock() added to unlocked_inode_to_wb_begin(). v3: inode_congested() conversion added. v2: locked_inode_to_wb_and_lock_list() was missing in the first version. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:40:20 -06:00
Tejun Heo	5cb8b8241e	writeback: use unlocked_inode_to_wb transaction in inode_congested() Similar to wb stat updates, inode_congested() accesses the associated wb of an inode locklessly, which will break with foreign inode wb switching. This path updates inode_congested() to use unlocked inode wb access transaction introduced by the previous patch. Combined with the previous two patches, this makes all wb list and access operations to be protected by either of inode->i_lock, wb->list_lock, or mapping->tree_lock while wb switching is in progress. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:40:20 -06:00
Tejun Heo	682aa8e1a6	writeback: implement unlocked_inode_to_wb transaction and use it for stat updates The mechanism for detecting whether an inode should switch its wb (bdi_writeback) association is now in place. This patch build the framework for the actual switching. This patch adds a new inode flag I_WB_SWITCHING, which has two functions. First, the easy one, it ensures that there's only one switching in progress for a give inode. Second, it's used as a mechanism to synchronize wb stat updates. The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters but track the current number of dirty pages and pages under writeback respectively. As such, when an inode is moved from one wb to another, the inode's portion of those stats have to be transferred together; unfortunately, this is a bit tricky as those stat updates are percpu operations which are performed without holding any lock in some places. This patch solves the problem in a similar way as memcg. Each such lockless stat updates are wrapped in transaction surrounded by unlocked_inode_to_wb_begin/end(). During normal operation, they map to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted, mapping->tree_lock is grabbed across the transaction. In turn, the switching path sets I_WB_SWITCHING and waits for a RCU grace period to pass before actually starting to switch, which guarantees that all stat update paths are synchronizing against mapping->tree_lock. This patch still doesn't implement the actual switching. v3: Updated on top of the recent cancel_dirty_page() updates. unlocked_inode_to_wb_begin() now nests inside mem_cgroup_begin_page_stat() to match the locking order. v2: The i_wb access transaction will be used for !stat accesses too. Function names and comments updated accordingly. s/inode_wb_stat_unlocked_{begin\|end}/unlocked_inode_to_wb_{begin\|end}/ s/switch_wb/switch_wbs/ Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:40:20 -06:00
Tejun Heo	87e1d789bf	writeback: implement [locked_]inode_to_wb_and_lock_list() cgroup writeback currently assumes that inode to wb association doesn't change; however, with the planned foreign inode wb switching mechanism, the association will change dynamically. When an inode needs to be put on one of the IO lists of its wb, the current code simply calls inode_to_wb() and locks the returned wb; however, with the planned wb switching, the association may change before locking the wb and may even get released. This patch implements [locked_]inode_to_wb_and_lock_list() which pins the associated wb while holding i_lock, releases it, acquires wb->list_lock and verifies that the association hasn't changed inbetween. As the association will be protected by both locks among other things, this guarantees that the wb is the inode's associated wb until the list_lock is released. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:40:20 -06:00
Tejun Heo	2a81490811	writeback: implement foreign cgroup inode detection As concurrent write sharing of an inode is expected to be very rare and memcg only tracks page ownership on first-use basis severely confining the usefulness of such sharing, cgroup writeback tracks ownership per-inode. While the support for concurrent write sharing of an inode is deemed unnecessary, an inode being written to by different cgroups at different points in time is a lot more common, and, more importantly, charging only by first-use can too readily lead to grossly incorrect behaviors (single foreign page can lead to gigabytes of writeback to be incorrectly attributed). To resolve this issue, cgroup writeback detects the majority dirtier of an inode and will transfer the ownership to it. To avoid unnnecessary oscillation, the detection mechanism keeps track of history and gives out the switch verdict only if the foreign usage pattern is stable over a certain amount of time and/or writeback attempts. The detection mechanism has fairly low space and computation overhead. It adds 8 bytes to struct inode (one int and two u16's) and minimal amount of calculation per IO. The detection mechanism converges to the correct answer usually in several seconds of IO time when there's a clear majority dirtier. Even when there isn't, it can reach an acceptable answer fairly quickly under most circumstances. Please see wb_detach_inode() for more details. This patch only implements detection. Following patches will implement actual switching. v2: wbc_account_io() now checks whether the wbc is associated with a wb before dereferencing it. This can happen when pageout() is writing pages directly without going through the usual writeback path. As pageout() path is single-threaded, we don't want it to be blocked behind a slow cgroup and ultimately want it to delegate actual writing to the usual writeback path. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:40:20 -06:00
Tejun Heo	b16b1deb55	writeback: make writeback_control track the inode being written back Currently, for cgroup writeback, the IO submission paths directly associate the bio's with the blkcg from inode_to_wb_blkcg_css(); however, it'd be necessary to keep more writeback context to implement foreign inode writeback detection. wbc (writeback_control) is the natural fit for the extra context - it persists throughout the writeback of each inode and is passed all the way down to IO submission paths. This patch adds wbc_attach_and_unlock_inode(), wbc_detach_inode(), and wbc_attach_fdatawrite_inode() which are used to associate wbc with the inode being written back. IO submission paths now use wbc_init_bio() instead of directly associating bio's with blkcg themselves. This leaves inode_to_wb_blkcg_css() w/o any user. The function is removed. wbc currently only tracks the associated wb (bdi_writeback). Future patches will add more for foreign inode detection. The association is established under i_lock which will be depended upon when migrating foreign inodes to other wb's. As currently, once established, inode to wb association never changes, going through wbc when initializing bio's doesn't cause any behavior changes. v2: submit_blk_blkcg() now checks whether the wbc is associated with a wb before dereferencing it. This can happen when pageout() is writing pages directly without going through the usual writeback path. As pageout() path is single-threaded, we don't want it to be blocked behind a slow cgroup and ultimately want it to delegate actual writing to the usual writeback path. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:39:48 -06:00
Tejun Heo	21c6321fbb	writeback: relocate wb[_try]_get(), wb_put(), inode_{attach\|detach}_wb() Currently, majority of cgroup writeback support including all the above functions are implemented in include/linux/backing-dev.h and mm/backing-dev.c; however, the portion closely related to writeback logic implemented in include/linux/writeback.h and mm/page-writeback.c will expand to support foreign writeback detection and correction. This patch moves wb[_try]_get() and wb_put() to include/linux/backing-dev-defs.h so that they can be used from writeback.h and inode_{attach\|detach}_wb() to writeback.h and page-writeback.c. This is pure reorganization and doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:39:08 -06:00
Tejun Heo	aa661bbe1e	writeback: move over_bground_thresh() to mm/page-writeback.c and rename it to wb_over_bg_thresh(). The function is closely tied to the dirty throttling mechanism implemented in page-writeback.c. This relocation will allow future updates necessary for cgroup writeback support. While at it, add function comment. This is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:38:13 -06:00
Tejun Heo	dcc25ae76e	writeback: move global_dirty_limit into wb_domain This patch is a part of the series to define wb_domain which represents a domain that wb's (bdi_writeback's) belong to and are measured against each other in. This will enable IO backpressure propagation for cgroup writeback. global_dirty_limit exists to regulate the global dirty threshold which is a property of the wb_domain. This patch moves hard_dirty_limit, dirty_lock, and update_time into wb_domain. This is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:38:12 -06:00
Tejun Heo	8a73179956	writeback: reorganize [__]wb_update_bandwidth() __wb_update_bandwidth() is called from two places - fs/fs-writeback.c::balance_dirty_pages() and mm/page-writeback.c::wb_writeback(). The latter updates only the write bandwidth while the former also deals with the dirty ratelimit. The two callsites are distinguished by whether @thresh parameter is zero or not, which is cryptic. In addition, the two files define their own different versions of wb_update_bandwidth() on top of __wb_update_bandwidth(), which is confusing to say the least. This patch cleans up [__]wb_update_bandwidth() in the following ways. * __wb_update_bandwidth() now takes explicit @update_ratelimit parameter to gate dirty ratelimit handling. * mm/page-writeback.c::wb_update_bandwidth() is flattened into its caller - balance_dirty_pages(). * fs/fs-writeback.c::wb_update_bandwidth() is moved to mm/page-writeback.c and __wb_update_bandwidth() is made static. * While at it, add a lockdep assertion to __wb_update_bandwidth(). Except for the lockdep addition, this is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:38:12 -06:00
Tejun Heo	0d960a383a	writeback: clean up wb_dirty_limit() The function name wb_dirty_limit(), its argument @dirty and the local variable @wb_dirty are mortally confusing given that the function calculates per-wb threshold value not dirty pages, especially given that @dirty and @wb_dirty are used elsewhere for dirty pages. Let's rename the function to wb_calc_thresh() and wb_dirty to wb_thresh. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:38:12 -06:00
Tejun Heo	108dad65be	ext2: enable cgroup writeback support Writeback now supports cgroup writeback and the generic writeback, buffer, libfs, and mpage helpers that ext2 uses are all updated to work with cgroup writeback. This patch enables cgroup writeback for ext2 by adding FS_CGROUP_WRITEBACK to its ->fs_flags. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: linux-ext4@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:38:04 -06:00
Tejun Heo	429b3fb027	mpage: make __mpage_writepage() honor cgroup writeback __mpage_writepage() is used to implement mpage_writepages() which in turn is used for ->writepages() of various filesystems. All writeback logic is now updated to handle cgroup writeback and the block cgroup to issue IOs for is encoded in writeback_control and can be retrieved from the inode; however, __mpage_writepage() currently ignores the blkcg indicated by the inode and issues all bio's without explicit blkcg association. This patch updates __mpage_writepage() so that the issued bio's are associated with inode_to_writeback_blkcg_css(inode). v2: Updated for per-inode wb association. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:38:04 -06:00
Tejun Heo	bafc0dba1e	buffer, writeback: make __block_write_full_page() honor cgroup writeback [__]block_write_full_page() is used to implement ->writepage in various filesystems. All writeback logic is now updated to handle cgroup writeback and the block cgroup to issue IOs for is encoded in writeback_control and can be retrieved from the inode; however, [__]block_write_full_page() currently ignores the blkcg indicated by inode and issues all bio's without explicit blkcg association. This patch adds submit_bh_blkcg() which associates the bio with the specified blkio cgroup before issuing and uses it in __block_write_full_page() so that the issued bio's are associated with inode_to_wb_blkcg_css(inode). v2: Updated for per-inode wb association. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:37:23 -06:00
Tejun Heo	0747259d13	writeback: dirty inodes against their matching cgroup bdi_writeback's __mark_inode_dirty() always dirtied the inode against the root wb (bdi_writeback). The previous patches added all the infrastructure necessary to attribute an inode against the wb of the dirtying cgroup. This patch updates __mark_inode_dirty() so that it uses the wb associated with the inode instead of unconditionally using the root one. Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all pages will keep being dirtied against the root wb. v2: Updated for per-inode wb association. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:37 -06:00
Tejun Heo	db12536040	writeback: make writeback initiation functions handle multiple bdi_writeback's [try_]writeback_inodes_sb[_nr]() and sync_inodes_sb() currently only handle dirty inodes on the root wb (bdi_writeback) of the target bdi. This patch implements bdi_split_work_to_wbs() and use it to make these functions handle multiple wb's. bdi_split_work_to_wbs() takes a base wb_writeback_work and create clones of it and issue them to the wb's of the target bdi. The base work's nr_pages is distributed using wb_split_bdi_pages() - ie. according to each wb's write bandwidth's proportion in the bdi. Cloning a bdi involves memory allocation which may fail. In such cases, bdi_split_work_to_wbs() issues the base work directly and waits for its completion before proceeding to the next wb to guarantee forward progress and correctness under memory pressure. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:37 -06:00
Tejun Heo	f30a7d0cc8	writeback: restructure try_writeback_inodes_sb[_nr]() try_writeback_inodes_sb_nr() wraps writeback_inodes_sb_nr() so that it handles s_umount locking and skips if writeback is already in progress. The in progress test is performed on the root wb (bdi_writeback) which isn't sufficient for cgroup writeback support. The test must be done per-wb. To prepare for the change, this patch factors out __writeback_inodes_sb_nr() from writeback_inodes_sb_nr() and adds @skip_if_busy and moves the in progress test right before queueing the wb_writeback_work. try_writeback_inodes_sb_nr() now just grabs s_umount and invokes __writeback_inodes_sb_nr() with asserted @skip_if_busy. This way, later addition of multiple wb handling can skip only the wb's which already have writeback in progress. This swaps the order between in progress test and s_umount test which can flip the return value when writeback is in progress and s_umount is being held by someone else but this shouldn't cause any meaningful difference. It's a fringe condition and the return value is an unsynchronized hint anyway. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:36 -06:00
Tejun Heo	98754bf770	writeback: implement wb_wait_for_single_work() For cgroup writeback, multiple wb_writeback_work items may need to be issuedto accomplish a single task. The previous patch updated the waiting mechanism such that wb_wait_for_completion() can wait for multiple work items. Issuing mulitple work items involves memory allocation which may fail. As most writeback operations can't fail or blocked on memory allocation, in such cases, we'll fall back to sequential issuing of an on-stack work item, which would need to be waited upon sequentially. This patch implements wb_wait_for_single_work() which waits for a single work item independently from wb_completion waiting so that such fallback mechanism can be used without getting tangled with the usual issuing / completion operation. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:36 -06:00
Tejun Heo	cc395d7f1f	writeback: implement bdi_wait_for_completion() If the completion of a wb_writeback_work can be waited upon by setting its ->done to a struct completion and waiting on it; however, for cgroup writeback support, it's necessary to issue multiple work items to multiple bdi_writebacks and wait for the completion of all. This patch implements wb_completion which can wait for multiple work items and replaces the struct completion with it. It can be defined using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and waited for by wb_wait_for_completion(). Nobody currently issues multiple work items and this patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:36 -06:00
Tejun Heo	ac7b19a34f	writeback: add wb_writeback_work->auto_free Currently, a wb_writeback_work is freed automatically on completion if it doesn't have ->done set. Add wb_writeback_work->auto_free to make the switch explicit. This will help cgroup writeback support where waiting for completion and whether to free automatically don't necessarily move together. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:36 -06:00
Tejun Heo	001fe6f617	writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's wakeup_dirtytime_writeback() currently only starts writeback on the root wb (bdi_writeback). For cgroup writeback support, update the function to check all wbs. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:36 -06:00
Tejun Heo	f2b6512160	writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's wakeup_flusher_threads() currently only starts writeback on the root wb (bdi_writeback). For cgroup writeback support, update the function to wake up all wbs and distribute the number of pages to write according to the proportion of each wb's write bandwidth, which is implemented in wb_split_bdi_pages(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:36 -06:00
Tejun Heo	9ecf4866c0	writeback: make bdi_start_background_writeback() take bdi_writeback instead of backing_dev_info bdi_start_background_writeback() currently takes @bdi and kicks the root wb (bdi_writeback). In preparation for cgroup writeback support, make it take wb instead. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:36 -06:00
Tejun Heo	bc05873dcc	writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info writeback_in_progress() currently takes @bdi and returns whether writeback is in progress on its root wb (bdi_writeback). In preparation for cgroup writeback support, make it take wb instead. While at it, make it an inline function. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:36 -06:00
Tejun Heo	c00ddad39f	writeback: remove bdi_start_writeback() bdi_start_writeback() is a thin wrapper on top of __wb_start_writeback() which is used only by laptop_mode_timer_fn(). This patches removes bdi_start_writeback(), renames __wb_start_writeback() to wb_start_writeback() and makes laptop_mode_timer_fn() use it instead. This doesn't cause any functional difference and will ease making laptop_mode_timer_fn() cgroup writeback aware. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:36 -06:00
Tejun Heo	e79729123f	writeback: don't issue wb_writeback_work if clean There are several places in fs/fs-writeback.c which queues wb_writeback_work without checking whether the target wb (bdi_writeback) has dirty inodes or not. The only thing wb_writeback_work does is writing back the dirty inodes for the target wb and queueing a work item for a clean wb is essentially noop. There are some side effects such as bandwidth stats being updated and triggering tracepoints but these don't affect the operation in any meaningful way. This patch makes all writeback_inodes_sb_nr() and sync_inodes_sb() skip wb_queue_work() if the target bdi is clean. Also, it moves dirtiness check from wakeup_flusher_threads() to __wb_start_writeback() so that all its callers benefit from the check. While the overhead incurred by scheduling a noop work isn't currently significant, the overhead may be higher with cgroup writeback support as we may end up issuing noop work items to a lot of clean wb's. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:36 -06:00
Tejun Heo	95a46c65e3	writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account bdi_has_dirty_io() used to only reflect whether the root wb (bdi_writeback) has dirty inodes. For cgroup writeback support, it needs to take all active wb's into account. If any wb on the bdi has dirty inodes, bdi_has_dirty_io() should return true. To achieve that, as inode_wb_list_{move\|del}_locked() now keep track of the dirty state transition of each wb, the number of dirty wbs can be counted in the bdi; however, bdi is already aggregating wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when there are any dirty inodes by ensuring wb->avg_write_bandwidth can't dip below 1. bdi_has_dirty_io() can simply test whether bdi->tot_write_bandwidth is zero or not. While this bumps the value of wb->avg_write_bandwidth to one when it used to be zero, this shouldn't cause any meaningful behavior difference. bdi_has_dirty_io() is made an inline function which tests whether ->tot_write_bandwidth is non-zero. Also, WARN_ON_ONCE()'s on its value are added to inode_wb_list_{move\|del}_locked(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:36 -06:00
Tejun Heo	766a9d6e60	writeback: implement backing_dev_info->tot_write_bandwidth cgroup writeback support needs to keep track of the sum of avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to distribute write workload. This patch adds bdi->tot_write_bandwidth and updates inode_wb_list_move_locked(), inode_wb_list_del_locked() and wb_update_write_bandwidth() to adjust it as wb's gain and lose dirty inodes and its avg_write_bandwidth gets updated. As the update events are not synchronized with each other, bdi->tot_write_bandwidth is an atomic_long_t. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:35 -06:00
Tejun Heo	d6c10f1fc8	writeback: implement WB_has_dirty_io wb_state flag Currently, wb_has_dirty_io() determines whether a wb (bdi_writeback) has any dirty inode by testing all three IO lists on each invocation without actively keeping track. For cgroup writeback support, a single bdi will host multiple wb's each of which will host dirty inodes separately and we'll need to make bdi_has_dirty_io(), which currently only represents the root wb, aggregate has_dirty_io from all member wb's, which requires tracking transitions in has_dirty_io state on each wb. This patch introduces inode_wb_list_{move\|del}_locked() to consolidate IO list operations leaving queue_io() the only other function which directly manipulates IO lists (via move_expired_inodes()). All three functions are updated to call wb_io_lists_[de]populated() which keep track of whether the wb has dirty inodes or not and record it using the new WB_has_dirty_io flag. inode_wb_list_moved_locked()'s return value indicates whether the wb had no dirty inodes before. mark_inode_dirty() is restructured so that the return value of inode_wb_list_move_locked() can be used for deciding whether to wake up the wb. While at it, change {bdi\|wb}_has_dirty_io()'s return values to bool. These functions were returning 0 and 1 before. Also, add a comment explaining the synchronization of wb_state flags. v2: Updated to accommodate b_dirty_time. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:35 -06:00
Tejun Heo	703c270887	writeback: implement and use inode_congested() In several places, bdi_congested() and its wrappers are used to determine whether more IOs should be issued. With cgroup writeback support, this question can't be answered solely based on the bdi (backing_dev_info). It's dependent on whether the filesystem and bdi support cgroup writeback and the blkcg the inode is associated with. This patch implements inode_congested() and its wrappers which take @inode and determines the congestion state considering cgroup writeback. The new functions replace bdi_*congested() calls in places where the query is about specific inode and task. There are several filesystem users which also fit this criteria but they should be updated when each filesystem implements cgroup writeback support. v2: Now that a given inode is associated with only one wb, congestion state can be determined independent from the asking task. Drop @task. Spotted by Vivek. Also, converted to take @inode instead of @mapping and renamed to inode_congested(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:35 -06:00
Tejun Heo	52ebea749a	writeback: make backing_dev_info host cgroup-specific bdi_writebacks For the planned cgroup writeback support, on each bdi (backing_dev_info), each memcg will be served by a separate wb (bdi_writeback). This patch updates bdi so that a bdi can host multiple wbs (bdi_writebacks). On the default hierarchy, blkcg implicitly enables memcg. This allows using memcg's page ownership for attributing writeback IOs, and every memcg - blkcg combination can be served by its own wb by assigning a dedicated wb to each memcg. This means that there may be multiple wb's of a bdi mapped to the same blkcg. As congested state is per blkcg - bdi combination, those wb's should share the same congested state. This is achieved by tracking congested state via bdi_writeback_congested structs which are keyed by blkcg. bdi->wb remains unchanged and will keep serving the root cgroup. cgwb's (cgroup wb's) for non-root cgroups are created on-demand or looked up while dirtying an inode according to the memcg of the page being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree by its memcg id. Once an inode is associated with its wb, it can be retrieved using inode_to_wb(). Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all pages will keep being associated with bdi->wb. v3: inode_attach_wb() in account_page_dirtied() moved inside mapping_cap_account_dirty() block where it's known to be !NULL. Also, an unnecessary NULL check before kfree() removed. Both detected by the kbuild bot. v2: Updated so that wb association is per inode and wb is per memcg rather than blkcg. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: kbuild test robot <fengguang.wu@intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:35 -06:00
Tejun Heo	a212b105b0	bdi: make inode_to_bdi() inline Now that bdi definitions are moved to backing-dev-defs.h, backing-dev.h can include blkdev.h and inline inode_to_bdi() without worrying about introducing circular include dependency. The function gets called from hot paths and fairly trivial. This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the function calls inline. blockdev_superblock and noop_backing_dev_info are EXPORT_GPL'd to allow the inline functions to be used from modules. While at it, make sb_is_blkdev_sb() return bool instead of int. v2: Fixed typo in description as suggested by Jan. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:34 -06:00
Tejun Heo	66114cad64	writeback: separate out include/linux/backing-dev-defs.h With the planned cgroup writeback support, backing-dev related declarations will be more widely used across block and cgroup; unfortunately, including backing-dev.h from include/linux/blkdev.h makes cyclic include dependency quite likely. This patch separates out backing-dev-defs.h which only has the essential definitions and updates blkdev.h to include it. c files which need access to more backing-dev details now include backing-dev.h directly. This takes backing-dev.h off the common include dependency chain making it a lot easier to use it across block and cgroup. v2: fs/fat build failure fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:34 -06:00
Tejun Heo	f0054bb1e1	writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->wb_lock and ->worklist into wb. * The lock protects bdi->worklist and bdi->wb.dwork scheduling. While moving, rename it to wb->work_lock as wb->wb_lock is confusing. Also, move wb->dwork downwards so that it's colocated with the new ->work_lock and ->work_list fields. * bdi_writeback_workfn() -> wb_workfn() bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb) bdi_wakeup_thread(bdi) -> wb_wakeup(wb) bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...) __bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...) get_next_work_item(bdi) -> get_next_work_item(wb) * bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb. The function contained parts which belong to the containing bdi rather than the wb itself - testing cap_writeback_dirty and bdi_remove_from_list() invocation. Those are moved to bdi_unregister(). * bdi_wb_{init\|exit}() are renamed to wb_{init\|exit}(). Initializations of the moved bdi->wb_lock and ->work_list are relocated from bdi_init() to wb_init(). * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->state are mechanically replaced with bdi->wb.state introducing no behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:34 -06:00
Tejun Heo	a88a341a73	writeback: move bandwidth related fields from backing_dev_info into bdi_writeback Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bandwidth related fields from backing_dev_info into bdi_writeback. * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp, write_bandwidth, avg_write_bandwidth, dirty_ratelimit, balanced_dirty_ratelimit, completions and dirty_exceeded. * writeback_chunk_size() and over_bground_thresh() now take @wb instead of @bdi. * bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...) bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...) bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...) bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...) [__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...) bdi_{max\|min}_pause(bdi, ...) -> wb_{max\|min}_pause(wb, ...) bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...) * Init/exits of the relocated fields are moved to bdi_wb_init/exit() respectively. Note that explicit zeroing is dropped in the process as wb's are cleared in entirety anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[] introducing no behavior changes. v2: Typo in description fixed as suggested by Jan. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:34 -06:00
Tejun Heo	93f78d8828	writeback: move backing_dev_info->bdi_stat[] into bdi_writeback Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->bdi_stat[] into wb. * enum bdi_stat_item is renamed to wb_stat_item and the prefix of all enums is changed from BDI_ to WB_. * BDI_STAT_BATCH() -> WB_STAT_BATCH() * [__]{add\|inc\|dec\|sum}_wb_stat(bdi, ...) -> [__]{add\|inc}_wb_stat(wb, ...) * bdi_stat[_error]() -> wb_stat[_error]() * bdi_writeout_inc() -> wb_writeout_inc() * stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and frees stat. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[] introducing no behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:34 -06:00
Tejun Heo	4452226ea2	writeback: move backing_dev_info->state into bdi_writeback Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->state into wb. * enum bdi_state is renamed to wb_state and the prefix of all enums is changed from BDI_ to WB_. * Explicit zeroing of bdi->state is removed without adding zeoring of wb->state as the whole data structure is zeroed on init anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->state are mechanically replaced with bdi->wb.state introducing no behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: drbd-dev@lists.linbit.com Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:34 -06:00
Greg Thelen	c4843a7593	memcg: add per cgroup dirty page accounting When modifying PG_Dirty on cached file pages, update the new MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where global NR_FILE_DIRTY is managed. The new memcg stat is visible in the per memcg memory.stat cgroupfs file. The most recent past attempt at this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632 The new accounting supports future efforts to add per cgroup dirty page throttling and writeback. It also helps an administrator break down a container's memory usage and provides evidence to understand memcg oom kills (the new dirty count is included in memcg oom kill messages). The ability to move page accounting between memcg (memory.move_charge_at_immigrate) makes this accounting more complicated than the global counter. The existing mem_cgroup_{begin,end}_page_stat() lock is used to serialize move accounting with stat updates. Typical update operation: memcg = mem_cgroup_begin_page_stat(page) if (TestSetPageDirty()) { [...] mem_cgroup_update_page_stat(memcg) } mem_cgroup_end_page_stat(memcg) Summary of mem_cgroup_end_page_stat() overhead: - Without CONFIG_MEMCG it's a no-op - With CONFIG_MEMCG and no inter memcg task movement, it's just rcu_read_lock() - With CONFIG_MEMCG and inter memcg task movement, it's rcu_read_lock() + spin_lock_irqsave() A memcg parameter is added to several routines because their callers now grab mem_cgroup_begin_page_stat() which returns the memcg later needed by for mem_cgroup_update_page_stat(). Because mem_cgroup_begin_page_stat() may disable interrupts, some adjustments are needed: - move __mark_inode_dirty() from __set_page_dirty() to its caller. __mark_inode_dirty() locking does not want interrupts disabled. - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in __delete_from_page_cache(), replace_page_cache_page(), invalidate_complete_page2(), and __remove_mapping(). text data bss dec hex filename 8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before 8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after +192 text bytes 8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before `8966750` 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after +773 text bytes Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for all metrics, they're all wall clock or cycle counts. The read and write fault benchmarks just measure fault time, they do not include I/O time. * CONFIG_MEMCG not set: baseline patched kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples) dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03% dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99% dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77% read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples) write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples) * CONFIG_MEMCG=y root_memcg: baseline patched kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples) dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90% dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33% dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00% read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples) write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples) * CONFIG_MEMCG=y non-root_memcg: baseline patched kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples) dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82% dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27% dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52% read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples) write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples) As expected anon page faults are not affected by this patch. tj: Updated to apply on top of the recent cancel_dirty_page() changes. Signed-off-by: Sha Zhengju <handai.szj@gmail.com> Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:33 -06:00
Tejun Heo	11f81becca	page_writeback: revive cancel_dirty_page() in a restricted form cancel_dirty_page() had some issues and `b9ea25152e` ("page_writeback: clean up mess around cancel_dirty_page()") replaced it with account_page_cleaned() which makes the caller responsible for clearing the dirty bit; unfortunately, the planned changes for cgroup writeback support requires synchronization between dirty bit manipulation and stat updates. While we can open-code such synchronization in each account_page_cleaned() callsite, that's gonna be unnecessarily awkward and verbose. This patch revives cancel_dirty_page() but in a more restricted form. All it does is TestClearPageDirty() followed by account_page_cleaned() invocation if the page was dirty. This helper covers all account_page_cleaned() usages except for __delete_from_page_cache() which is a special case anyway and left alone. As this leaves no module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped from it. This patch just revives cancel_dirty_page() as a trivial wrapper to replace equivalent usages and doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:33 -06:00
Julia Lawall	13985b1f77	NFS: drop unneeded goto Delete jump to a label on the next line, when that label is not used elsewhere. A simplified version of the semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @r@ identifier l; @@ -if (...) goto l; -l: // </smpl> Also drop the unnecessary ret variable. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-02 08:55:28 -04:00
Chuck Lever	d683cc49da	NFS: Fix size of NFSACL SETACL operations When encoding the NFSACL SETACL operation, reserve just the estimated size of the ACL rather than a fixed maximum. This eliminates needless zero padding on the wire that the server ignores. Fixes: `ee5dc7732b` ('NFS: Fix "kernel BUG at fs/nfs/nfs3xdr.c:1338!"') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-02 08:55:28 -04:00
NeilBrown	7ef5ca4fe4	NFS: report more appropriate block size for directories. In glibc 2.21 (and several previous), a call to opendir() will result in a 32K (BUFSIZ4) buffer being allocated and passed to getdents. However a call to fdopendir() results in an 'fstat' request to determine block size and a matching buffer allocated for subsequent use with getdents. This will typically be 1M. The first getdents call on an NFS directory will always use READDIR_PLUS (or NFSv4 equivalent) if available. Subsequent getdents calls only use this more expensive version if some 'stat' requests are made between the getdents calls. For this reason it is good to keep at least that first getdents call relatively short. When fdopendir() and readdir() is used on a large directory, it takes approximately 32 times as long to complete as using "opendir". Current versions of 'find' use fdopendir() and demonstrate this slowness. 'stat' on a directory currently returns the 'wsize'. This number has no meaning on directories. Actual READDIR requests are limited to ->dtsize, which itself is capped at 4 pages, coincidently the same as BUFSIZ4. So this is a meaningful number to use as the blocksize on directories, and has the effect of making 'find' on large directories go a lot faster. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-02 08:55:27 -04:00
Trond Myklebust	5cae02f427	NFSv4: Always drain the slot table before re-establishing the lease While the NFSv4.1 code has always drained the slot tables in order to stop non-recovery related RPC calls when doing lease recovery, the NFSv4 code did not. The reason for the difference in behaviour is that NFSv4 does not have session state, and so RPC calls can in theory proceed while recovery is happening. In practice, however, anything I/O or state related needs to wait until recovery is over. This patch changes the behaviour of NFSv4 to match that of NFSv4.1 so that we can simplify the state recovery code by assuming that we do not have to deal with races between recovery and ordinary I/O. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-02 08:55:27 -04:00
Chao Yu	a1fe33af5f	ubifs: fix to check error code of register_shrinker register_shrinker() in ubifs_init() can fail due to fail to call kzalloc. This patch fixes to check the return value of register_shrinker, otherwise our shrinker may be unregistered after ubifs initialized successfully. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2015-06-02 11:41:38 +02:00
David S. Miller	dda922c831	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/phy/amd-xgbe-phy.c drivers/net/wireless/iwlwifi/Kconfig include/net/mac80211.h iwlwifi/Kconfig and mac80211.h were both trivial overlapping changes. The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and the bug fix that happened on the 'net' side is already integrated into the rest of the amd-xgbe driver. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-01 22:51:30 -07:00
Chenxi Mao	96c6dd59bf	f2fs: disable the discard option when device doesn't support Current f2fs check the whether the blk device can support discard. However, the code will cause the discard option cannot be enabled. Because the clear_opt(sbi, DISCARD) will be invoked forever. This patch can fix this issue. Jaegeuk Kim: The original patch was intended to disable the discard option when device does not support trim command. Rather than remaining the buggy patch, let's replace with this patch as an integrated one. Signed-off-by: Chenxi Mao <chenxi.mao2013@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:21:32 -07:00
Jaegeuk Kim	4683ff837c	f2fs crypto: remove alloc_page for bounce_page We don't need to call alloc_page() prior to mempool_alloc(), since the mempool_alloc() calls alloc_page() internally. And, if __GFP_WAIT is set, it never fails on page allocation, so let's give GFP_NOWAIT and handle ENOMEM by writepage(). Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:21:10 -07:00
Jaegeuk Kim	9236cac566	f2fs: fix a deadlock for summary page lock vs. sentry_lock In f2fs_gc: In f2fs_replace_block: - lock_page(sum_page) - check_valid_map() - mutex_lock(sentry_lock) - mutex_lock(sentry_lock) - change_curseg() - lock_page(sum_page) This patch fixes the deadlock condition. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:21:09 -07:00
Jaegeuk Kim	e5e0906b6b	f2fs crypto: clean up error handling in f2fs_fname_setup_filename Sync with: ext4 crypto: clean up error handling in ext4_fname_setup_filename Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:21:08 -07:00
Jaegeuk Kim	e992e238ff	f2fs crypto: avoid f2fs_inherit_context for symlink This patch fixes to call f2fs_inherit_context twice for newly created symlink. The original one is called by f2fs_add_link(), which invokes f2fs_setxattr. If the second one is called again, f2fs_setxattr is triggered again with same encryption index. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:21:07 -07:00
Chao Yu	4637fd11ff	f2fs crypto: do not set encryption policy for non-directory by ioctl Encryption policy should only be set to an empty directory through ioctl, This patch add a judgement condition to verify type of the target inode to avoid incorrectly configuring for non-directory. Additionally, remove unneeded inline data conversion since regular or symlink file should not be processed here. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:21:07 -07:00
Chao Yu	81b0a8ffaa	f2fs crypto: allow setting encryption policy once This patch add XATTR_CREATE flag in setxattr when setting encryption context for inode. Without this flag the context could be set more than once, this should never happen. So, fix it. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:21:06 -07:00
Chao Yu	d3baf7c472	f2fs crypto: check context consistent for rename2 For exchange rename, we should check context consistent of encryption between new_dir and old_inode or old_dir and new_inode. Otherwise inheritance of parent's encryption context will be broken. Signed-off-by: Chao Yu <chao2.yu@samsung.com> [Jaegeuk Kim: sync with ext4 approach] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:21:05 -07:00
Chao Yu	1237702471	f2fs: avoid duplicated code by reusing f2fs_read_end_io This patch tries to clean up code because part code of f2fs_read_end_io and mpage_end_io are the same, so it's better to merge and reuse them. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:21:04 -07:00
Jaegeuk Kim	26bf3dc7e2	f2fs crypto: use per-inode tfm structure This patch applies the following ext4 patch: ext4 crypto: use per-inode tfm structure As suggested by Herbert Xu, we shouldn't allocate a new tfm each time we read or write a page. Instead we can use a single tfm hanging off the inode's crypt_info structure for all of our encryption needs for that inode, since the tfm can be used by multiple crypto requests in parallel. Also use cmpxchg() to avoid races that could result in crypt_info structure getting doubly allocated or doubly freed. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:21:04 -07:00
hujianyang	da554e48ca	f2fs: recovering broken superblock during mount This patch recovers a broken superblock with the other valid one. Signed-off-by: hujianyang <hujianyang@huawei.com> [Jaegeuk Kim: reinitialize local variables in f2fs_fill_super for retrial] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:21:03 -07:00
Jaegeuk Kim	304eecc346	f2fs crypto: check encryption for tmpfile This patch adds to check encryption for tmpfile in early stage. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:21:02 -07:00
Chao Yu	7e01e7ad74	f2fs: support RENAME_WHITEOUT As the description of rename in manual, RENAME_WHITEOUT is a special operation that only makes sense for overlay/union type filesystem. When performing rename with RENAME_WHITEOUT, dst will be replace with src, and meanwhile, a 'whiteout' will be create with name of src. A "whiteout" is designed to be a char device with 0,0 device number, it has specially meaning for stackable filesystem. In these filesystems, there are multiple layers exist, and only top of these can be modified. So a whiteout in top layer is used to hide a corresponding file in lower layer, as well removal of whiteout will make the file appear. Now in overlayfs, when we rename a file which is exist in lower layer, it will be copied up to upper if it is not on upper layer yet, and then rename it on upper layer, source file will be whiteouted to hide corresponding file in lower layer at the same time. So in upper layer filesystem, implementation of RENAME_WHITEOUT provide a atomic operation for stackable filesystem to support rename operation. There are multiple ways to implement RENAME_WHITEOUT in log of this commit: `7dcf5c3e45` ("xfs: add RENAME_WHITEOUT support") which pointed out by Dave Chinner. For now, we just try to follow the way that xfs/ext4 use. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:21:01 -07:00
Chao Yu	381722d2ac	f2fs: introduce update_meta_page Add a help function update_meta_page() to update meta page with specified buffer. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:21:00 -07:00
Chao Yu	cb5c94cf3a	f2fs crypto: zero next free dnode block Now page cache of meta inode is used by garbage collection for encrypted page, it may contain random data, so we should zero it before issuing discard. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:21:00 -07:00
Jaegeuk Kim	cfc4d971df	f2fs crypto: split f2fs_crypto_init/exit with two parts This patch splits f2fs_crypto_init/exit with two parts: base initialization and memory allocation. Firstly, f2fs module declares the base encryption memory pointers. Then, allocating internal memories is done at the first encrypted inode access. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:20:59 -07:00
Chao Yu	b9da898b05	f2fs crypto: fix incorrect release for crypto ctx When encryption feature is enable, if we rmmod f2fs module, we will encounter a stack backtrace reported in syslog: "BUG: Bad page state in process rmmod pfn:aaf8a page:f0f4f148 count:0 mapcount:129 mapping:ee2f4104 index:0x80 flags: 0xee2830a4(referenced\|lru\|slab\|private_2\|writeback\|swapbacked\|mlocked) page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: flags: 0x2030a0(lru\|slab\|private_2\|writeback\|mlocked) Modules linked in: f2fs(O-) fuse bnep rfcomm bluetooth dm_crypt binfmt_misc snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device joydev ppdev mac_hid lp hid_generic i2c_piix4 parport_pc psmouse snd serio_raw parport soundcore ext4 jbd2 mbcache usbhid hid e1000 [last unloaded: f2fs] CPU: 1 PID: 3049 Comm: rmmod Tainted: G B O 4.1.0-rc3+ #10 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 00000000 00000000 c0021eb4 c15b7518 f0f4f148 c0021ed8 c112e0b7 c1779174 c9b75674 000aaf8a 01b13ce1 c17791a4 f0f4f148 ee2830a4 c0021ef8 c112e3c3 00000000 f0f4f148 c0021f34 f0f4f148 ee2830a4 ef9f0000 c0021f20 c112fdf8 Call Trace: [<c15b7518>] dump_stack+0x41/0x52 [<c112e0b7>] bad_page.part.72+0xa7/0x100 [<c112e3c3>] free_pages_prepare+0x213/0x220 [<c112fdf8>] free_hot_cold_page+0x28/0x120 [<c1073380>] ? try_to_wake_up+0x2b0/0x2b0 [<c112ff15>] __free_pages+0x25/0x30 [<c112c4fd>] mempool_free_pages+0xd/0x10 [<c112c5f1>] mempool_free+0x31/0x90 [<f0f441cf>] f2fs_exit_crypto+0x6f/0xf0 [f2fs] [<f0f456c4>] exit_f2fs_fs+0x23/0x95f [f2fs] [<c10c30e0>] SyS_delete_module+0x130/0x180 [<c11556d6>] ? vm_munmap+0x46/0x60 [<c15bd888>] sysenter_do_call+0x12/0x12" The reason is that: since commit 0827e645fd35 ("f2fs crypto: shrink size of the f2fs_crypto_ctx structure") is merged, some fields in f2fs_crypto_ctx structure are merged into a union as they will never be used simultaneously in write path, read path or on free list. In f2fs_exit_crypto, we traverse each crypto ctx from free list, in this moment, our free_list field in union is valid, but still we will try to release memory space which is pointed by other invalid field in union structure for each ctx. Then the error occurs, let's fix it with this patch. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:20:58 -07:00
Chao Yu	7bf4b5576a	f2fs crypto: fix to release buffer for fname crypto This patch fixes memory leak issue in error path of f2fs_fname_setup_filename(). Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:20:57 -07:00
Jaegeuk Kim	ca40b03052	f2fs crypto: shrink size of the f2fs_crypto_ctx structure This patch integrates the below patch into f2fs. "ext4 crypto: shrink size of the ext4_crypto_ctx structure Some fields are only used when the crypto_ctx is being used on the read path, some are only used on the write path, and some are only used when the structure is on free list. Optimize memory use by using a union." Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:20:57 -07:00
Jaegeuk Kim	640778fbc9	f2fs crypto: get rid of ci_mode from struct f2fs_crypt_info This patch integrates the below patch into f2fs. "ext4 crypto: get rid of ci_mode from struct ext4_crypt_info The ci_mode field was superfluous, and getting rid of it gets rid of an unused hole in the structure." Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:20:56 -07:00
Jaegeuk Kim	8bacf6deb0	f2fs crypto: use slab caches This patch integrates the below patch into f2fs. "ext4 crypto: use slab caches Use slab caches the ext4_crypto_ctx and ext4_crypt_info structures for slighly better memory efficiency and debuggability." Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:20:55 -07:00
Jaegeuk Kim	06e1bc05ca	f2fs: truncate data blocks for orphan inode As Hu reported, F2FS has a space leak problem, when conducting: 1) format a 4GB f2fs partition 2) dd a 3G file, 3) unlink it. So, when doing f2fs_drop_inode(), we need to truncate data blocks before skipping it. We can also drop unused caches assigned to each inode. Reported-by: hujianyang <hujianyang@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:20:54 -07:00
Dan Carpenter	912a83b509	f2fs: cleanup a confusing indent The return was not indented far enough so it looked like it was supposed to go with the other if statement. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:20:53 -07:00
Arnd Bergmann	7beb428eda	f2fs: fix building on 32-bit architectures A bug fix to the debug output extended the type of some local variables to 64-bit, which now causes the kernel to fail building because of missing 64-bit division functions: ERROR: "__aeabi_uldivmod" [fs/f2fs/f2fs.ko] undefined! In the kernel, we have to use div_u64 or do_div to do this, in order to annotate that this is an expensive operation. As the function is only called for debug out, we know this is not performance critical, so it is safe to use div_u64. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: d1f85bd38db19 ("f2fs: avoid value overflow in showing current status") Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:20:53 -07:00
Jaegeuk Kim	e19ef527aa	f2fs: avoid buggy functions This patch avoids to use a buggy function for now. It needs to fix them later. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:20:52 -07:00
hujianyang	08b95126c7	f2fs: add compat_ioctl to provide backward compatability introduce compat_ioctl to regular files, but doesn't add this functionality to f2fs_dir_operations. While running a 32-bit busybox, I met an error like this: (A is a directory) chattr: reading flags on A: Inappropriate ioctl for device This patch copies compat_ioctl from f2fs_file_operations and fix this problem. Signed-off-by: hujianyang <hujianyang@huawei.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:20:51 -07:00
Jaegeuk Kim	40a02be178	f2fs: do not issue next dnode discard redundantly We have a discard map, so that we can avoid redundant discard issues. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-06-01 16:20:50 -07:00
Olga Kornievskaia	e8d975e73e	fixing infinite OPEN loop in 4.0 stateid recovery Problem: When an operation like WRITE receives a BAD_STATEID, even though recovery code clears the RECLAIM_NOGRACE recovery flag before recovering the open state, because of clearing delegation state for the associated inode, nfs_inode_find_state_and_recover() gets called and it makes the same state with RECLAIM_NOGRACE flag again. As a results, when we restart looking over the open states, we end up in the infinite loop instead of breaking out in the next test of state flags. Solution: unset the RECLAIM_NOGRACE set because of calling of nfs_inode_find_state_and_recover() after returning from calling recover_open() function. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-06-01 09:58:02 -04:00
Vladimir Zapolskiy	eaa5cd9263	fs: sysfs: don't pass count == 0 to bin file readers If count == 0 bytes are requested by a reader, sysfs_kf_bin_read() deliberately returns 0 without passing a potentially harmful value to some externally defined underlying battr->read() function. However in case of (pos == size && count) the next clause always sets count to 0 and this value is handed over to battr->read(). The change intends to make obsolete (and remove later) a redundant sanity check in battr->read(), if it is present, or add more protection to struct bin_attribute users, who does not care about input arguments. Signed-off-by: Vladimir Zapolskiy <vz@mleia.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-06-01 10:17:17 +09:00
Dave Chinner	b9a350a118	Merge branch 'xfs-sparse-inode' into for-next	2015-06-01 10:51:38 +10:00
Dave Chinner	e01c025fbd	Merge branch 'xfs-misc-fixes-for-4.2' into for-next	2015-06-01 10:50:18 +10:00
Nan Jia	339e4f66d1	xfs: Clean up xfs_trans_dup_dqinfo Fixed two missing spaces. Signed-off-by: Nan Jia <jiananmail@gmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-01 10:50:00 +10:00
Linus Torvalds	8ba64dc338	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fix from Al Viro: "Off-by-one in d_walk()/__dentry_kill() race fix. It's very hard to hit; possible in the same conditions as the original bug, except that you need the skipped branch to contain all the remaining evictables, so that the d_walk()-calling loop in d_invalidate() decides there's nothing more to do and doesn't go for another pass - otherwise that next pass will sweep the sucker. So it's not too urgent, but seeing that the fix is obvious and the original commit has spread into all -stable branches..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: d_walk() might skip too much	2015-05-31 16:00:34 -07:00
Eric Sandeen	39e56d9219	xfs: don't cast string literals The commit: `a9273ca5` xfs: convert attr to use unsigned names added these (unsigned char ) casts, but then the _SIZE macros return "7" - size of a pointer minus one - not the length of the string. This is harmless in the kernel, because the _SIZE macros are not used, but as we sync up with userspace, this will matter. I don't think the cast is necessary; i.e. assigning the string literal to an unsigned char , or passing it to a function expecting an unsigned char *, should be ok, right? Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-01 07:15:38 +10:00
Brian Foster	7f884dc198	xfs: fix quota block reservation leak when tp allocates and frees blocks Al Viro reports that generic/231 fails frequently on XFS and bisected the problem to the following commit: `5d11fb4b` xfs: rework zero range to prevent invalid i_size updates ... which is just the first commit that happens to cause fsx to reproduce the problem. fsx reproduces via zero range calls. The aforementioned commit overhauls zero range to use hole punch and fallocate. As it turns out, the problem is reproducible on demand using basic hole punch as follows: $ mkfs.xfs -f -m crc=1,finobt=1 <dev> $ mount <dev> /mnt -o uquota $ xfs_io -f -c "falloc 0 50m" /mnt/file $ for i in $(seq 1 20); do xfs_io -c "fpunch ${i}m 32k" /mnt/file; done $ rm -f /mnt/file $ repquota -us /mnt ... User used soft hard grace used soft hard grace ---------------------------------------------------------------------- root -- 32K 0K 0K 3 0 0 A file is allocated with a single 50m extent. The extent count increases via hole punches until the bmap converts to btree format. The file is removed but quota reports 32k of space usage for the user. This reservation is effectively leaked for the lifetime of the mount. The reason this occurs is because the quota block reservation tracking is confused when a transaction happens to free and allocate blocks at the same time. Consider the following sequence of events: - tp is allocated from xfs_free_file_space() and reserves several blocks for btree management. Blocks are reserved against the dquot and marked as such in the transaction (qtrx->qt_blk_res). - 8 blocks are accounted free when the 32k range is punched out. xfs_trans_mod_dquot() is called with XFS_TRANS_DQ_BCOUNT and sets ->qt_bcount_delta to -8. - Subsequently, a block is allocated against the same transaction by xfs_bmap_extents_to_btree() for btree conversion. A call to xfs_trans_mod_dquot() increases qt_blk_res_used to 1 and qt_bcount_delta to -7. - The transaction is dup'd and committed by xfs_bmap_finish(). xfs_trans_dup_dqinfo() sets the first transaction up such that it has a matching qt_blk_res and qt_blk_res_used of 1. The remaining unused reservation is transferred to the duplicate tp. When the transactions are committed, the dquots are fixed up in xfs_trans_apply_dquot_deltas() according to one of two methods: 1.) If the transaction holds a block reservation (->qt_blk_res != 0), _only_ the unused portion reservation is unaccounted from the dquot. Note that the tp duplication behavior of xfs_bmap_finish() makes it such that qt_blk_res is typically 0 for tp's with unused reservation. 2.) Otherwise, the dquot is fixed up based on the block delta (->qt_bcount_delta) created by the transaction. Therefore, if a transaction has a negative qt_bcount_delta and positive qt_blk_res_used, the former set of blocks that have been removed from the file are never factored out of the in-core dquot reservation. Instead, *_apply_dquot_deltas() sees 1 block used out of a 1 block reservation and believes there is nothing to fix up. The on-disk d_bcount is updated independently from qt_bcount_delta, and thus is correct (and allows the quota usage to correct on remount). To deal with this situation, we effectively want the "used reservation" part of the transaction to be consistent with any freed blocks with respect to quota tracking. For example, if 8 blocks are freed, the subsequent single block allocation does not need to consume the initial reservation made by the tp. Instead, it simply borrows one from the previously freed. One possible implementation of such borrowing is to avoid the blks_res_used increment when bcount_delta is negative. This alone is flawed logic in that it only handles the case where blocks are freed before allocated, however. Rather than add more complexity to manage synchronization between bcount_delta and blks_res_used, kill the latter entirely. blk_res_used is only updated in one place and always in sync with delta_bcount. Therefore, the net block reservation consumption of the transaction is always available from bcount_delta. Calculate the reservation consumption on the fly where necessary based on whether the tp has a reservation and results in a positive net block delta on the inode. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-01 07:15:37 +10:00
Brian Foster	2e588a46aa	xfs: always log the inode on unwritten extent conversion The fsync() requirements for crash consistency on XFS are to flush file data and force any in-core inode updates to the log. We currently check whether the inode is pinned to identify whether the log needs to be forced, since a non-zero pin count generally represents an inode that has transactions awaiting a flush to the on-disk log. This is not sufficient in all cases, however. Reports of xfstests test generic/311 failures on ppc64/s390x hosts have identified failures to fsync outstanding inode modifications due to the inode not being pinned at the time of the fsync. This occurs because certain bmap updates can complete by logging bmapbt buffers but without ever dirtying (and thus pinning) the core inode. The following is a specific incarnation of this problem: $ mount $dev /mnt -o noatime,nobarrier $ for i in $(seq 0 2 31); do \ xfs_io -f -c "falloc $((i * 32768)) 32k" -c fsync /mnt/file; \ done $ xfs_io -c "pwrite -S 0 80k 16k" -c fsync -c "pwrite 76k 4k" -c fsync /mnt/file; \ hexdump /mnt/file; \ ./xfstests-dev/src/godown /mnt ... 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 0013000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0014000 0000 0000 0000 0000 0000 0000 0000 0000 * 00f8000 $ umount /mnt; mount ... $ hexdump /mnt/file 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 00f8000 In short, the unwritten extent conversion for the last write is lost despite the fact that an fsync executed before the filesystem was shutdown. Note that this is impossible to reproduce on v5 supers due to unconditional time callbacks for di_changecount and highly difficult to reproduce on CONFIG_HZ=1000 kernels due to those same callbacks frequently updating cmtime prior to the bmap update. CONFIG_HZ=100 reduces timer granularity enough to increase the odds that time updates are skipped and allows this to reproduce within a handful of attempts. To deal with this problem, unconditionally log the core in the unwritten extent conversion path. Fix up logflags after the extent conversion to keep the extent update code consistent with the other extent update helpers. This fixup is not necessary for the other (hole, delay) extent helpers because they execute in the block allocation codepath, which already logs the inode for other reasons (e.g., for di_nblocks). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-06-01 07:15:23 +10:00
Chao Yu	e298e73bd7	ext4 crypto: release crypto resource on module exit Crypto resource should be released when ext4 module exits, otherwise it will cause memory leak. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-31 13:37:35 -04:00
Theodore Ts'o	abdd438b26	ext4 crypto: handle unexpected lack of encryption keys Fix up attempts by users to try to write to a file when they don't have access to the encryption key. Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-31 13:35:39 -04:00
Theodore Ts'o	4d3c4e5b8c	ext4 crypto: allocate the right amount of memory for the on-disk symlink Previously we were taking the required padding when allocating space for the on-disk symlink. This caused a buffer overrun which could trigger a krenel crash when running fsstress. Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-31 13:35:32 -04:00
Theodore Ts'o	82d0d3e7e6	ext4 crypto: clean up error handling in ext4_fname_setup_filename Fix a potential memory leak where fname->crypto_buf.name wouldn't get freed in some error paths, and also make the error handling easier to understand/audit. Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-31 13:35:22 -04:00
Theodore Ts'o	d87f6d78e9	ext4 crypto: policies may only be set on directories Thanks to Chao Yu <chao2.yu@samsung.com> for pointing out we were missing this check. Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-31 13:35:14 -04:00
Theodore Ts'o	c2faccaff6	ext4 crypto: enforce crypto policy restrictions on cross-renames Thanks to Chao Yu <chao2.yu@samsung.com> for pointing out the need for this check. Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-31 13:35:09 -04:00
Theodore Ts'o	e709e9df64	ext4 crypto: encrypt tmpfile located in encryption protected directory Factor out calls to ext4_inherit_context() and move them to __ext4_new_inode(); this fixes a problem where ext4_tmpfile() wasn't calling calling ext4_inherit_context(), so the temporary file wasn't getting protected. Since the blocks for the tmpfile could end up on disk, they really should be protected if the tmpfile is created within the context of an encrypted directory. Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-31 13:35:02 -04:00
Theodore Ts'o	6bc445e0ff	ext4 crypto: make sure the encryption info is initialized on opendir(2) Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-31 13:34:57 -04:00
Theodore Ts'o	5555702955	ext4 crypto: set up encryption info for new inodes in ext4_inherit_context() Set up the encryption information for newly created inodes immediately after they inherit their encryption context from their parent directories. Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-31 13:34:29 -04:00
Theodore Ts'o	95ea68b4c7	ext4 crypto: fix memory leaks in ext4_encrypted_zeroout ext4_encrypted_zeroout() could end up leaking a bio and bounce page. Fortunately it's not used much. While we're fixing things up, refactor out common code into the static function alloc_bounce_page() and fix up error handling if mempool_alloc() fails. Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-31 13:34:24 -04:00
Theodore Ts'o	c936e1ec28	ext4 crypto: use per-inode tfm structure As suggested by Herbert Xu, we shouldn't allocate a new tfm each time we read or write a page. Instead we can use a single tfm hanging off the inode's crypt_info structure for all of our encryption needs for that inode, since the tfm can be used by multiple crypto requests in parallel. Also use cmpxchg() to avoid races that could result in crypt_info structure getting doubly allocated or doubly freed. Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-31 13:34:22 -04:00
Theodore Ts'o	71dea01ea2	ext4 crypto: require CONFIG_CRYPTO_CTR if ext4 encryption is enabled On arm64 this is apparently needed for CTS mode to function correctly. Otherwise attempts to use CTS return ENOENT. Change-Id: I732ea9a5157acc76de5b89edec195d0365f4ca63 Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-31 13:31:37 -04:00
Theodore Ts'o	614def7013	ext4 crypto: shrink size of the ext4_crypto_ctx structure Some fields are only used when the crypto_ctx is being used on the read path, some are only used on the write path, and some are only used when the structure is on free list. Optimize memory use by using a union. Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-05-31 13:31:34 -04:00
Richard Weinberger	f74a14e870	um: Remove hppfs hppfs (honeypot procfs) was an attempt to use UML as honeypot. It was never stable nor in heavy use. As Al Viro and Christoph Hellwig pointed some major issues out it is better to let it die. Signed-off-by: Richard Weinberger <richard@nod.at>	2015-05-31 13:23:08 +02:00
Linus Torvalds	1be44e234b	xfs: update for 4.1-rc6 Changes in this update: o regression fix for new rename whiteout code o regression fixes for new superblock generic per-cpu counter code o fix for incorrect error return sign introduced in 3.17 o metadata corruption fixes that need to go back to -stable kernels -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJVaO0JAAoJEK3oKUf0dfodd4UQANRdfXnUrpyQGhVS7HFoFoVt FIQ52pPGbMu72+DqHc+Q41uvgAPe65LFB2VUL6CUGCMExstF72F5+QonzppMgkMo unPER3eB8ya03SY+Kp+803ZGgzI2Nl2M6w8Kof730/RUk56PTGYIx4eLXd6iZSli RsYjw8JDbeue5OQo5FPmLCSQ/Kr5ZJXbgWVPyWkKg9aCcXLN5YSJIV3xcMTK9Q2I LqqODkyatnGc6YxGAKddS7Xzt1ntlZgbe5mndQw04a2g0Lf6emPH5r8b0UJXIu96 advOBX0pEbad4FeFS6Mo5D+nNCaaNP4WzN7wgdb+BYNVw3ss4Ebam7+yY6Gexg6y bzZOEkk9saL4YeBDgyYICNu7kG4BRVKRQiiX220G6SFXM3nqbl7qBPb3kVFyDpcI RRuFJ0ZV0kFJ+3IQ4xVnIh6nootceRk/mvZaK5HhLhQLzklpZ8fj4HF3oBDUAnvN wNd+7GoZy7zldjCkbF4BP3GjUeW+b9ngrCNc+bFXi5cUbdECXAa2krjxyY+MlQF2 veNVVcsoRdfeM0VjJh2/piGJxMWIlRqXdKzPKsfMWnlIaJ6YyslfbSq+2K7LxgGR Ho3Sjt0oUuPMZ9F/Mjj+XDqwmzgooUHXNyDBxhGXBNBPjApcRLcb2vQ2SrWEmeGJ vZmC2R1ZoGdBJg8a55BT =w5SP -----END PGP SIGNATURE----- Merge tag 'xfs-for-linus-4.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs fixes from Dave Chinner: "This is a little larger than I'd like late in the release cycle, but all the fixes are for regressions introduced in the 4.1-rc1 merge, or are needed back in -stable kernels fairly quickly as they are filesystem corruption or userspace visible correctness issues. Changes in this update: - regression fix for new rename whiteout code - regression fixes for new superblock generic per-cpu counter code - fix for incorrect error return sign introduced in 3.17 - metadata corruption fixes that need to go back to -stable kernels" * tag 'xfs-for-linus-4.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: fix broken i_nlink accounting for whiteout tmpfile inode xfs: xfs_iozero can return positive errno xfs: xfs_attr_inactive leaves inconsistent attr fork state behind xfs: extent size hints can round up extents past MAXEXTLEN xfs: inode and free block counters need to use __percpu_counter_compare percpu_counter: batch size aware __percpu_counter_compare() xfs: use percpu_counter_read_positive for mp->m_icount	2015-05-29 16:45:45 -07:00
Andreas Gruenbacher	2f6b3879c2	nfsd: Remove dead declarations Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-05-29 11:04:04 -04:00
Arnd Bergmann	6ac75368e1	nfsd: work around a gcc-5.1 warning gcc-5.0 warns about a potential uninitialized variable use in nfsd: fs/nfsd/nfs4state.c: In function 'nfsd4_process_open2': fs/nfsd/nfs4state.c:3781:3: warning: 'old_deny_bmap' may be used uninitialized in this function [-Wmaybe-uninitialized] reset_union_bmap_deny(old_deny_bmap, stp); ^ fs/nfsd/nfs4state.c:3760:16: note: 'old_deny_bmap' was declared here unsigned char old_deny_bmap; ^ This is a false positive, the code path that is warned about cannot actually be reached. This adds an initialization for the variable to make the warning go away. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-05-29 11:04:03 -04:00
Andreas Gruenbacher	0c9d65e76a	nfsd: Checking for acl support does not require fetching any acls Whether or not a file system supports acls can be determined with IS_POSIXACL(inode) and does not require trying to fetch any acls; the code for computing the supported_attrs and aclsupport attributes can be simplified. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-05-29 11:04:02 -04:00
Andreas Gruenbacher	cc265089ce	nfsd: Disable NFSv2 timestamp workaround for NFSv3+ NFSv2 can set the atime and/or mtime of a file to specific timestamps but not to the server's current time. To implement the equivalent of utimes("file", NULL), it uses a heuristic. NFSv3 and later do support setting the atime and/or mtime to the server's current time directly. The NFSv2 heuristic is still enabled, and causes timestamps to be set wrong sometimes. Fix this by moving the heuristic into the NFSv2 specific code. We can leave it out of the create code path: the owner can always set timestamps arbitrarily, and the workaround would never trigger. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-05-29 11:04:01 -04:00
Al Viro	2159184ea0	d_walk() might skip too much when we find that a child has died while we'd been trying to ascend, we should go into the first live sibling itself, rather than its sibling. Off-by-one in question had been introduced in "deal with deadlock in d_walk()" and the fix needs to be backported to all branches this one has been backported to. Cc: stable@vger.kernel.org # 3.2 and later Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-05-28 23:45:30 -04:00
Bob Copeland	5a6b2b36a8	omfs: fix potential integer overflow in allocator Both 'i' and 'bits_per_entry' are signed integers but the result is a u64 block number. Cast i to u64 to avoid truncation on 32-bit targets. Found by Coverity (CID 200679). Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-05-28 18:25:19 -07:00
Bob Copeland	c0345ee57d	omfs: fix sign confusion for bitmap loop counter The count variable is used to iterate down to (below) zero from the size of the bitmap and handle the one-filling the remainder of the last partial bitmap block. The loop conditional expects count to be signed in order to detect when the final block is processed, after which count goes negative. Unfortunately, a recent change made this unsigned along with some other related fields. The result of is this is that during mount, omfs_get_imap will overrun the bitmap array and corrupt memory unless number of blocks happens to be a multiple of 8 * blocksize. Fix by changing count back to signed: it is guaranteed to fit in an s32 without overflow due to an enforced limit on the number of blocks in the filesystem. Signed-off-by: Bob Copeland <me@bobcopeland.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-05-28 18:25:19 -07:00
Bob Copeland	3a281f9466	omfs: set error return when d_make_root() fails A static checker found the following issue in the error path for omfs_fill_super: fs/omfs/inode.c:552 omfs_fill_super() warn: missing error code here? 'd_make_root()' failed. 'ret' = '0' Fix by returning -ENOMEM in this case. Signed-off-by: Bob Copeland <me@bobcopeland.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-05-28 18:25:18 -07:00
Sasha Levin	dcbff39da3	fs, omfs: add NULL terminator in the end up the token list match_token() expects a NULL terminator at the end of the token list so that it would know where to stop. Not having one causes it to overrun to invalid memory. In practice, passing a mount option that omfs didn't recognize would sometimes panic the system. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Bob Copeland <me@bobcopeland.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-05-28 18:25:18 -07:00
Andrew Morton	2b1d3ae940	fs/binfmt_elf.c:load_elf_binary(): return -EINVAL on zero-length mappings load_elf_binary() returns `retval', not `error'. Fixes: `a87938b2e2` ("fs/binfmt_elf.c: fix bug in loading of PIE binaries") Reported-by: James Hogan <james.hogan@imgtec.com> Cc: Michael Davidson <md@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-05-28 18:25:18 -07:00
Brian Foster	22ce1e1472	xfs: enable sparse inode chunks for v5 superblocks Enable mounting of filesystems with sparse inode support enabled. Add the incompat. feature bit to the *_ALL mask. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:26:33 +10:00
Brian Foster	09b5660413	xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() xfs_ifree_cluster() is called to mark all in-memory inodes and inode buffers as stale. This occurs after we've removed the inobt records and dropped any references of inobt data. xfs_ifree_cluster() uses the starting inode number to walk the namespace of inodes expected for a single chunk a cluster buffer at a time. The cluster buffer disk addresses are calculated by decoding the sequential inode numbers expected from the chunk. The problem with this approach is that if the inode chunk being removed is a sparse chunk, not all of the buffer addresses that are calculated as part of this sequence may be inode clusters. Attempting to acquire the buffer based on expected inode characterstics (i.e., cluster length) can lead to errors and is generally incorrect. We already use a couple variables to carry requisite state from xfs_difree() to xfs_ifree_cluster(). Rather than add a third, define a new internal structure to carry the existing parameters through these functions. Add an alloc field that represents the physical allocation bitmap of inodes in the chunk being removed. Modify xfs_ifree_cluster() to check each inode against the bitmap and skip the clusters that were never allocated as real inodes on disk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:26:03 +10:00
Brian Foster	10ae3dc7f2	xfs: only free allocated regions of inode chunks An inode chunk is currently added to the transaction free list based on a simple fsb conversion and hardcoded chunk length. The nature of sparse chunks is such that the physical chunk of inodes on disk may consist of one or more discontiguous parts. Blocks that reside in the holes of the inode chunk are not inodes and could be allocated to any other use or not allocated at all. Refactor the existing xfs_bmap_add_free() call into the xfs_difree_inode_chunk() helper. The new helper uses the existing calculation if a chunk is not sparse. Otherwise, use the inobt record holemask to free the contiguous regions of the chunk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:22:52 +10:00
Brian Foster	26dd5217de	xfs: filter out sparse regions from individual inode allocation Inode allocation from an existing record with free inodes traditionally selects the first inode available according to the ir_free mask. With sparse inode chunks, the ir_free mask could refer to an unallocated region. We must mask the unallocated regions out of ir_free before using it to select a free inode in the chunk. Update the xfs_inobt_first_free_inode() helper to find the first free inode available of the allocated regions of the inode chunk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:20:10 +10:00
Brian Foster	1cdadee11f	xfs: randomly do sparse inode allocations in DEBUG mode Sparse inode allocations generally only occur when full inode chunk allocation fails. This requires some level of filesystem space usage and fragmentation. For filesystems formatted with sparse inode chunks enabled, do random sparse inode chunk allocs when compiled in DEBUG mode to increase test coverage. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:19:29 +10:00
Brian Foster	56d1115c9b	xfs: allocate sparse inode chunks on full chunk allocation failure xfs_ialloc_ag_alloc() makes several attempts to allocate a full inode chunk. If all else fails, reduce the allocation to the sparse length and alignment and attempt to allocate a sparse inode chunk. If sparse chunk allocation succeeds, check whether an inobt record already exists that can track the chunk. If so, inherit and update the existing record. Otherwise, insert a new record for the sparse chunk. Create helpers to align sparse chunk inode records and insert or update existing records in the inode btrees. The xfs_inobt_insert_sprec() helper implements the merge or update semantics required for sparse inode records with respect to both the inobt and finobt. To update the inobt, either insert a new record or merge with an existing record. To update the finobt, use the updated inobt record to either insert or replace an existing record. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:18:32 +10:00
Brian Foster	4148c347a4	xfs: helper to convert holemask to inode alloc. bitmap The inobt record holemask field is a condensed data type designed to fit into the existing on-disk record and is zero based (allocated regions are set to 0, sparse regions are set to 1) to provide backwards compatibility. This makes the type somewhat complex for use in higher level inode manipulations such as individual inode allocation, etc. Rather than foist the complexity of dealing with this field to every bit of logic that requires inode granular information, create a helper to convert the holemask to an inode allocation bitmap. The inode allocation bitmap is inode granularity similar to the inobt record free mask and indicates which inodes of the chunk are physically allocated on disk, irrespective of whether the inode is considered allocated or free by the filesystem. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:09:05 +10:00
Brian Foster	7f43c907ad	xfs: handle sparse inode chunks in icreate log recovery Recovery of icreate transactions assumes hardcoded values for the inode count and chunk length. Sparse inode chunks are allocated in units of m_ialloc_min_blks. Update the icreate validity checks to allow for appropriately sized inode chunks and verify the inode count matches what is expected based on the extent length rather than assuming a hardcoded count. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:06:30 +10:00
Brian Foster	463958af5c	xfs: pass inode count through ordered icreate log item v5 superblocks use an ordered log item for logging the initialization of inode chunks. The icreate log item is currently hardcoded to an inode count of 64 inodes. The agbno and extent length are used to initialize the inode chunk from log recovery. While an incorrect inode count does not lead to bad inode chunk initialization, we should pass the correct inode count such that log recovery has enough data to perform meaningful validity checks on the chunk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:05:49 +10:00
Brian Foster	12d0714d4b	xfs: use actual inode count for sparse records in bulkstat/inumbers The bulkstat and inumbers mechanisms make the assumption that inode records consist of a full 64 inode chunk in several places. For example, this is used to track how many inodes have been processed overall as well as to determine whether a record has allocated inodes that must be handled. This assumption is invalid for sparse inode records. While sparse inodes will be marked as free in the ir_free mask, they are not accounted as free in ir_freecount because they cannot be allocated. Therefore, ir_freecount may be less than 64 inodes in an inode record for which all physically allocated inodes are free (and in turn ir_freecount < 64 does not signify that the record has allocated inodes). The new in-core inobt record format includes the ir_count field. This holds the number of true, physical inodes tracked by the record. The in-core ir_count field is always valid as it is hardcoded to XFS_INODES_PER_CHUNK when sparse inodes is not enabled. Use ir_count to handle inode records correctly in bulkstat in a generic manner. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:04:19 +10:00
Brian Foster	5419040fc0	xfs: introduce inode record hole mask for sparse inode chunks The inode btrees track 64 inodes per record regardless of inode size. Thus, inode chunks on disk vary in size depending on the size of the inodes. This creates a contiguous allocation requirement for new inode chunks that can be difficult to satisfy on an aged and fragmented (free space) filesystems. The inode record freecount currently uses 4 bytes on disk to track the free inode count. With a maximum freecount value of 64, only one byte is required. Convert the freecount field to a single byte and use two of the remaining 3 higher order bytes left for the hole mask field. Use the final leftover byte for the total count field. The hole mask field tracks holes in the chunks of physical space that the inode record refers to. This facilitates the sparse allocation of inode chunks when contiguous chunks are not available and allows the inode btrees to identify what portions of the chunk contain valid inodes. The total count field contains the total number of valid inodes referred to by the record. This can also be deduced from the hole mask. The count field provides clarity and redundancy for internal record verification. Note that neither of the new fields can be written to disk on fs' without sparse inode support. Doing so writes to the high-order bytes of freecount and causes corruption from the perspective of older kernels. The on-disk inobt record data structure is updated with a union to distinguish between the original, "full" format and the new, "sparse" format. The conversion routines to get, insert and update records are updated to translate to and from the on-disk record accordingly such that freecount remains a 4-byte value on non-supported fs, yet the new fields of the in-core record are always valid with respect to the record. This means that higher level code can refer to the current in-core record format unconditionally and lower level code ensures that records are translated to/from disk according to the capabilities of the fs. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 09:03:04 +10:00
Brian Foster	502a4e72b8	xfs: add fs geometry bit for sparse inode chunks Define an fs geometry bit for sparse inode chunks such that the characteristic of the fs can be identified by userspace. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 08:58:32 +10:00
Brian Foster	e5376fc15a	xfs: sparse inode chunks feature helpers and mount requirements The sparse inode chunks feature uses the helper function to enable the allocation of sparse inode chunks. The incompatible feature bit is set on disk at mkfs time to prevent mount from unsupported kernels. Also, enforce the inode alignment requirements required for sparse inode chunks at mount time. When enabled, full inode chunks (and all inode record) alignment is increased from cluster size to inode chunk size. Sparse inode alignment must match the cluster size of the fs. Both superblock alignment fields are set as such by mkfs when sparse inode support is enabled. Finally, warn that sparse inode chunks is an experimental feature until further notice. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 08:57:27 +10:00
Brian Foster	066a18845f	xfs: use sparse chunk alignment for min. inode allocation requirement xfs_ialloc_ag_select() iterates through the allocation groups looking for free inodes or free space to determine whether to allow an inode allocation to proceed. If no free inodes are available, it assumes that an AG must have an extent longer than mp->m_ialloc_blks. Sparse inode chunk support currently allows for allocations smaller than the traditional inode chunk size specified in m_ialloc_blks. The current minimum sparse allocation is set in the superblock sb_spino_align field at mkfs time. Create a new m_ialloc_min_blks field in xfs_mount and use this to represent the minimum supported allocation size for inode chunks. Initialize m_ialloc_min_blks at mount time based on whether sparse inodes are supported. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 08:55:20 +10:00
Brian Foster	fb4f2b4e5a	xfs: add sparse inode chunk alignment superblock field Add sb_spino_align to the superblock to specify sparse inode chunk alignment. This also currently represents the minimum allowable sparse chunk allocation size. Signed-off-by: Brian Foster <bfoster@redhat.com>	2015-05-29 08:54:03 +10:00
Brian Foster	bfe46d4eb9	xfs: support min/max agbno args in block allocator The block allocator supports various arguments to tweak block allocation behavior and set allocation requirements. The sparse inode chunk feature introduces a new requirement not supported by the current arguments. Sparse inode allocations must convert or merge into an inode record that describes a fixed length chunk (64 inodes x inodesize). Full inode chunk allocations by definition always result in valid inode records. Sparse chunk allocations are smaller and the associated records can refer to blocks not owned by the inode chunk. This model can result in invalid inode records in certain cases. For example, if a sparse allocation occurs near the start of an AG, the aligned inode record for that chunk might refer to agbno 0. If an allocation occurs towards the end of the AG and the AG size is not aligned, the inode record could refer to blocks beyond the end of the AG. While neither of these scenarios directly result in corruption, they both insert invalid inode records and at minimum cause repair to complain, are unlikely to merge into full chunks over time and set land mines for other areas of code. To guarantee sparse inode chunk allocation creates valid inode records, support the ability to specify an agbno range limit for XFS_ALLOCTYPE_NEAR_BNO block allocations. The min/max agbno's are specified in the allocation arguments and limit the block allocation algorithms to that range. The starting 'agbno' hint is clamped to the range if the specified agbno is out of range. If no sufficient extent is available within the range, the allocation fails. For backwards compatibility, the min/max fields can be initialized to 0 to disable range limiting (e.g., equivalent to min=0,max=agsize). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 08:53:00 +10:00
Brian Foster	999633d304	xfs: update free inode record logic to support sparse inode records xfs_difree_inobt() uses logic in a couple places that assume inobt records refer to fully allocated chunks. Specifically, the use of mp->m_ialloc_inos can cause problems for inode chunks that are sparsely allocated. Sparse inode chunks can, by definition, define a smaller number of inodes than a full inode chunk. Fix the logic that determines whether an inode record should be removed from the inobt to use the ir_free mask rather than ir_freecount. Fix the agi counters modification to use ir_freecount to add the actual number of inodes freed rather than assuming a full inode chunk. Also make sure that we preserve the behavior to not remove inode chunks if the block size is large enough for multiple inode chunks (e.g., bsize=64k, isize=512). This behavior was previously implicit in that in such configurations, ir.freecount of a single record never matches m_ialloc_inos. Hence, add some comments as well. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 08:51:37 +10:00
Brian Foster	d4cc540b08	xfs: create individual inode alloc. helper Inode allocation from sparse inode records must filter the ir_free mask against ir_holemask. In preparation for this requirement, create a helper to allocate an individual inode from an inode record. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 08:50:21 +10:00
Yunlei He	498c5e9fcd	f2fs: add default mount options to remount I use f2fs filesystem with /data partition on my Android phone by the default mount options. When I remount /data in order to adding discard option to run some benchmarks, I find the default options such as background_gc, user_xattr and acl turned off. So I introduce a function named default_options in super.c. It do some default setting, and both mount and remount operations will call this function to complete default setting. Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:58 -07:00
Jaegeuk Kim	d690358b2b	f2fs crypto: remove checking key context during lookup No matter what the key is valid or not, readdir shows the dir entries correctly. So, lookup should not failed. But, we expect further accesses should be denied from open, rename, link, and so on. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:57 -07:00
Jaegeuk Kim	edf3fb8e9e	f2fs crypto: fix missing key when reading a page 1. mount $mnt 2. cp data $mnt/ 3. umount $mnt 4. log out 5. log in 6. cat $mnt/data -> panic, due to no i_crypt_info. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:56 -07:00
Jaegeuk Kim	cbaf042a3c	f2fs crypto: add symlink encryption This patch implements encryption support for symlink. Signed-off-by: Uday Savagaonkar <savagaon@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:55 -07:00
Jaegeuk Kim	e7d5545285	f2fs crypto: add filename encryption for roll-forward recovery This patch adds a bit flag to indicate whether or not i_name in the inode is encrypted. If this name is encrypted, we can't do recover_dentry during roll-forward. So, f2fs_sync_file() needs to do checkpoint, if this will be needed in future. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:55 -07:00
Jaegeuk Kim	6e22c691ba	f2fs crypto: add filename encryption for f2fs_lookup This patch implements filename encryption support for f2fs_lookup. Note that, f2fs_find_entry should be outside of f2fs_(un)lock_op(). Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:54 -07:00
Jaegeuk Kim	d8c6822a05	f2fs crypto: add filename encryption for f2fs_readdir This patch implements filename encryption support for f2fs_readdir. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:53 -07:00
Jaegeuk Kim	9ea97163c6	f2fs crypto: add filename encryption for f2fs_add_link This patch adds filename encryption support for f2fs_add_link. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:52 -07:00
Jaegeuk Kim	4375a33664	f2fs crypto: add encryption support in read/write paths This patch adds encryption support in read and write paths. Note that, in f2fs, we need to consider cleaning operation. In cleaning procedure, we must avoid encrypting and decrypting written blocks. So, this patch implements move_encrypted_block(). Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:52 -07:00
Jaegeuk Kim	fcc85a4d86	f2fs crypto: activate encryption support for fs APIs This patch activates the following APIs for encryption support. The rules quoted by ext4 are: - An unencrypted directory may contain encrypted or unencrypted files or directories. - All files or directories in a directory must be protected using the same key as their containing directory. - Encrypted inode for regular file should not have inline_data. - Encrypted symlink and directory may have inline_data and inline_dentry. This patch activates the following APIs. 1. f2fs_link : validate context 2. f2fs_lookup : '' 3. f2fs_rename : '' 4. f2fs_create/f2fs_mkdir : inherit its dir's context 5. f2fs_direct_IO : do buffered io for regular files 6. f2fs_open : check encryption info 7. f2fs_file_mmap : '' 8. f2fs_setattr : '' 9. f2fs_file_write_iter : '' (Called by sys_io_submit) 10. f2fs_fallocate : do not support fcollapse 11. f2fs_evict_inode : free_encryption_info Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:51 -07:00
Jaegeuk Kim	6b3bd08f93	f2fs crypto: filename encryption facilities This patch adds filename encryption infra. Most of codes are copied from ext4 part, but changed to adjust f2fs directory structure. Signed-off-by: Uday Savagaonkar <savagaon@google.com> Signed-off-by: Ildar Muslukhov <ildarm@google.com> Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:50 -07:00
Jaegeuk Kim	0adda907f2	f2fs crypto: add encryption key management facilities This patch copies from encrypt_key.c in ext4, and modifies for f2fs. Use GFP_NOFS, since _f2fs_get_encryption_info is called under f2fs_lock_op. Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Ildar Muslukhov <muslukhovi@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:49 -07:00
Jaegeuk Kim	57e5055b0a	f2fs crypto: add f2fs encryption facilities Most of parts were copied from ext4, except: - add f2fs_restore_and_release_control_page which returns control page and restore control page - remove ext4_encrypted_zeroout() - remove sbi->s_file_encryption_mode & sbi->s_dir_encryption_mode - add f2fs_end_io_crypto_work for mpage_end_io Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Ildar Muslukhov <ildarm@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:49 -07:00
Jaegeuk Kim	f424f664f0	f2fs crypto: add encryption policy and password salt support This patch adds encryption policy and password salt support through ioctl implementation. It adds three ioctls: F2FS_IOC_SET_ENCRYPTION_POLICY, F2FS_IOC_GET_ENCRYPTION_POLICY, F2FS_IOC_GET_ENCRYPTION_PWSALT, which use xattr operations. Note that, these definition and codes are taken from ext4 crypto support. For f2fs, xattr operations and on-disk flags for superblock and inode were changed. Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Ildar Muslukhov <muslukhovi@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:48 -07:00
Jaegeuk Kim	b93531dd7b	f2fs crypto: add encryption xattr support This patch add some definition for enrcyption xattr. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:47 -07:00
Jaegeuk Kim	d33793fb89	f2fs crypto: add f2fs encryption Kconfig This patch adds f2fs encryption config. This patch integrates: "ext4 crypto: require CONFIG_CRYPTO_CTR if ext4 encryption is enabled On arm64 this is apparently needed for CTS mode to function correctly. Otherwise attempts to use CTS return ENOENT." Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:46 -07:00
Jaegeuk Kim	cde4de1205	f2fs crypto: declare some definitions for f2fs encryption feature This definitions will be used by inode and superblock for encyption. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:45 -07:00
Jaegeuk Kim	7f63eb77af	f2fs: report unwritten area in f2fs_fiemap This patch slightly changes f2fs_fiemap function to report unwritten area. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:45 -07:00
Jaegeuk Kim	3589a9190b	f2fs: avoid value overflow in showing current status This patch fixes overflow when do cat /sys/kernel/debug/f2fs/status. If a section is relatively large, dist value can be overflowed. Reported-by: Yossi Goldfill <ygoldfill@radianmemory.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:44 -07:00
Chao Yu	75cd4e098d	f2fs: support FALLOC_FL_ZERO_RANGE Now, FALLOC_FL_ZERO_RANGE flag in ->fallocate is supported in ext4/xfs. In commit, the semantics of this flag is descripted as following:" 1) Make sure that both offset and len are block size aligned. 2) Update the i_size of inode by len bytes. 3) Compute the file's logical block number against offset. If the computed block number is not the starting block of the extent, split the extent such that the block number is the starting block of the extent. 4) Shift all the extents which are lying between [offset, last allocated extent] towards right by len bytes. This step will make a hole of len bytes at offset." This patch implements fallocate's FALLOC_FL_ZERO_RANGE for f2fs. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:43 -07:00
Chao Yu	b4ace33703	f2fs: support FALLOC_FL_COLLAPSE_RANGE Now, FALLOC_FL_COLLAPSE_RANGE flag in ->fallocate is supported in ext4/xfs. In commit, the semantics of this flag is descripted as following:" 1) It collapses the range lying between offset and length by removing any data blocks which are present in this range and than updates all the logical offsets of extents beyond "offset + len" to nullify the hole created by removing blocks. In short, it does not leave a hole. 2) It should be used exclusively. No other fallocate flag in combination. 3) Offset and length supplied to fallocate should be fs block size aligned in case of xfs and ext4. 4) Collaspe range does not work beyond i_size." This patch implements fallocate's FALLOC_FL_COLLAPSE_RANGE for f2fs. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:42 -07:00
Chao Yu	19f106bc03	f2fs: introduce f2fs_replace_block() for reuse Introduce a generic function replace_block base on recover_data_page, and export it. So with it we can operate file's meta data which is in CP/SSA area when we invoke fallocate with FALLOC_FL_COLLAPSE_RANGE flag. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:42 -07:00
Chao Yu	d5b692b786	f2fs: do not re-lookup nat cache with same nid In set_node_addr, we try to lookup cached nat entry of inode and then set flag in it. But previously in this function, we have already grabbed nat entry with current node id, if the node id is the same as the one of inode, we do not need to lookup it in cache again. So this patch adds condition judgment for reducing unneeded lookup. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:41 -07:00
Chao Yu	402f721d2f	f2fs: remove unneeded f2fs_make_empty declaration Remove f2fs_make_empty() declaration, since the main body of this function is move into do_make_empty_dir() and the function is obsolete now. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:40 -07:00
Jaegeuk Kim	836b5a6356	f2fs: issue discard with finally produced len and minlen This patch determines to issue discard commands by comparing given minlen and the length of produced final candidates. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:39 -07:00
Jaegeuk Kim	a66cdd9855	f2fs: introduce discard_map for f2fs_trim_fs This patch adds a bitmap for discard issues from f2fs_trim_fs. There-in rule is to issue discard commands only for invalidated blocks after mount. Once mount is done, f2fs_trim_fs trims out whole invalid area. After ehn, it will not issue and discrads redundantly. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:39 -07:00
Jaegeuk Kim	d6c67a4fee	f2fs: revmove spin_lock for write_orphan_inodes This patch removes spin_lock, since this is covered by f2fs_lock_op already. And, we should avoid to use page operations inside spin_lock. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:38 -07:00
Jaegeuk Kim	43f3eae1d3	f2fs: split find_data_page according to specific purposes This patch splits find_data_page as follows. 1. f2fs_gc - use get_read_data_page() with read only 2. find_in_level - use find_data_page without locked page 3. truncate_partial_page - In the case cache_only mode, just drop cached page. - Ohterwise, use get_lock_data_page() and guarantee to truncate Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:37 -07:00
Jaegeuk Kim	2fb2c95496	f2fs: fix counting the number of inline_data inodes This patch fixes to count the missing symlink case. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:36 -07:00
Jaegeuk Kim	2dcf51ab2f	f2fs: add need_dentry_mark This patch introduces need_dentry_mark() to clean up and avoid redundant node locks. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:36 -07:00
Jaegeuk Kim	01f28610a1	f2fs: fix race on allocating and deallocating a dentry block There are two threads: f2fs_delete_entry() get_new_data_page() f2fs_reserve_block() dn.blkaddr = XXX lock_page(dentry_block) truncate_hole() dn.blkaddr = NULL unlock_page(dentry_block) lock_page(dentry_block) fill the block from XXX address add new dentries unlock_page(dentry_block) Later, f2fs_write_data_page() will truncate the dentry_block, since its block address is NULL. The reason for this was due to the wrong lock order. In this case, we should do f2fs_reserve_block() after locking its dentry block. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:35 -07:00
Jaegeuk Kim	eaa693f4dc	f2fs: introduce dot and dotdot name check This patch adds an inline function to check dot and dotdot names. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:34 -07:00
Jaegeuk Kim	c879f90da9	f2fs: move get_page for gc victims This patch moves getting victim page into move_data_page. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:33 -07:00
Jaegeuk Kim	05ca3632e5	f2fs: add sbi and page pointer in f2fs_io_info This patch adds f2fs_sb_info and page pointers in f2fs_io_info structure. With this change, we can reduce a lot of parameters for IO functions. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:32 -07:00
Jaegeuk Kim	01b960e94a	f2fs: add f2fs_may_inline_{data, dentry} This patch adds f2fs_may_inline_data and f2fs_may_inline_dentry. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:32 -07:00
Jaegeuk Kim	06957e8fe6	f2fs: clean up f2fs_lookup This patch cleans up to avoid deep indentation. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:31 -07:00
Jaegeuk Kim	f1e8866016	f2fs: expose f2fs_mpage_readpages This patch implements f2fs_mpage_readpages for further optimization on encryption support. The basic code was taken from fs/mpage.c, and changed to be simple by adjusting that block_size is equal to page_size in f2fs. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:30 -07:00
Jaegeuk Kim	26d815ad75	f2fs: introduce f2fs_commit_super This patch introduces f2fs_commit_super to write updated superblock. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:29 -07:00
Jaegeuk Kim	003a3e1d60	f2fs: add f2fs_map_blocks This patch introduces f2fs_map_blocks structure likewise ext4_map_blocks. Now, f2fs uses f2fs_map_blocks when handling get_block. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:29 -07:00
Jaegeuk Kim	76f105a2db	f2fs: add feature facility in superblock This patch introduces a feature in superblock, which will indicate any new features for f2fs. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:28 -07:00
Jaegeuk Kim	b5492af78c	f2fs: move existing definitions into f2fs.h This patch moves some inode-related definitions from node.h to f2fs.h to add new features. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-05-28 15:41:27 -07:00
Brian Foster	22419ac9fe	xfs: fix broken i_nlink accounting for whiteout tmpfile inode XFS uses the internal tmpfile() infrastructure for the whiteout inode used for RENAME_WHITEOUT operations. For tmpfile inodes, XFS allocates the inode, drops di_nlink, adds the inode to the agi unlinked list, calls d_tmpfile() which correspondingly drops i_nlink of the vfs inode, and then finishes the common inode setup (e.g., clear I_NEW and unlock). The d_tmpfile() call was originally made inxfs_create_tmpfile(), but was pulled up out of that function as part of the following commit to resolve a deadlock issue: `330033d6` xfs: fix tmpfile/selinux deadlock and initialize security As a result, callers of xfs_create_tmpfile() are responsible for either calling d_tmpfile() or fixing up i_nlink appropriately. The whiteout tmpfile allocation helper does neither. As a result, the vfs ->i_nlink becomes inconsistent with the on-disk ->di_nlink once xfs_rename() links it back into the source dentry and calls xfs_bumplink(). Update the assert in xfs_rename() to help detect this problem in the future and update xfs_rename_alloc_whiteout() to decrement the link count as part of the manual tmpfile inode setup. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 08:14:55 +10:00
Dave Chinner	cddc116228	xfs: xfs_iozero can return positive errno It was missed when we converted everything in XFs to use negative error numbers, so fix it now. Bug introduced in 3.17 by commit `2451337` ("xfs: global error sign conversion"), and should go back to stable kernels. Thanks to Brian Foster for noticing it. cc: <stable@vger.kernel.org> # 3.17, 3.18, 3.19, 4.0 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 07:40:32 +10:00
Dave Chinner	6dfe5a049f	xfs: xfs_attr_inactive leaves inconsistent attr fork state behind xfs_attr_inactive() is supposed to clean up the attribute fork when the inode is being freed. While it removes attribute fork extents, it completely ignores attributes in local format, which means that there can still be active attributes on the inode after xfs_attr_inactive() has run. This leads to problems with concurrent inode writeback - the in-core inode attribute fork is removed without locking on the assumption that nothing will be attempting to access the attribute fork after a call to xfs_attr_inactive() because it isn't supposed to exist on disk any more. To fix this, make xfs_attr_inactive() completely remove all traces of the attribute fork from the inode, regardless of it's state. Further, also remove the in-core attribute fork structure safely so that there is nothing further that needs to be done by callers to clean up the attribute fork. This means we can remove the in-core and on-disk attribute forks atomically. Also, on error simply remove the in-memory attribute fork. There's nothing that can be done with it once we have failed to remove the on-disk attribute fork, so we may as well just blow it away here anyway. cc: <stable@vger.kernel.org> # 3.12 to 4.0 Reported-by: Waiman Long <waiman.long@hp.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 07:40:08 +10:00
Dave Chinner	6dea405eee	xfs: extent size hints can round up extents past MAXEXTLEN This results in BMBT corruption, as seen by this test: # mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc .... # mount /dev/vdc /mnt/scratch # xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo which results in this failure on a debug kernel: XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211 .... Call Trace: [<ffffffff814cf0ff>] xfs_bmbt_set_allf+0x8f/0x100 [<ffffffff814cf18d>] xfs_bmbt_set_all+0x1d/0x20 [<ffffffff814f2efe>] xfs_iext_insert+0x9e/0x120 [<ffffffff814c7956>] ? xfs_bmap_add_extent_hole_real+0x1c6/0xc70 [<ffffffff814c7956>] xfs_bmap_add_extent_hole_real+0x1c6/0xc70 [<ffffffff814caaab>] xfs_bmapi_write+0x72b/0xed0 [<ffffffff811c72ac>] ? kmem_cache_alloc+0x15c/0x170 [<ffffffff814fe070>] xfs_alloc_file_space+0x160/0x400 [<ffffffff81ddcc29>] ? down_write+0x29/0x60 [<ffffffff815063eb>] xfs_file_fallocate+0x29b/0x310 [<ffffffff811d2bc8>] ? __sb_start_write+0x58/0x120 [<ffffffff811e3e18>] ? do_vfs_ioctl+0x318/0x570 [<ffffffff811cd680>] vfs_fallocate+0x140/0x260 [<ffffffff811ce6f8>] SyS_fallocate+0x48/0x80 [<ffffffff81ddec09>] system_call_fastpath+0x12/0x17 The tracepoint that indicates the extent that triggered the assert failure is: xfs_iext_insert: idx 0 offset 0 block 16777224 count 2097152 flag 1 Clearly indicating that the extent length is greater than MAXEXTLEN, which is 2097151. A prior trace point shows the allocation was an exact size match and that a length greater than MAXEXTLEN was asked for: xfs_alloc_size_done: agno 1 agbno 8 minlen 2097152 maxlen 2097152 ^^^^^^^ ^^^^^^^ We don't see this problem with extent size hints through the IO path because we can't do single IOs large enough to trigger MAXEXTLEN allocation. fallocate(), OTOH, is not limited in it's allocation sizes and so needs help here. The issue is that the extent size hint alignment is rounding up the extent size past MAXEXTLEN, because xfs_bmapi_write() is not taking into account extent size hints when calculating the maximum extent length to allocate. xfs_bmapi_reserve_delalloc() is already doing this, but direct extent allocation is not. Unfortunately, the calculation in xfs_bmapi_reserve_delalloc() is wrong, and it works only because delayed allocation extents are not limited in size to MAXEXTLEN in the in-core extent tree. hence this calculation does not work for direct allocation, and the delalloc code needs fixing. This may, in fact be the underlying bug that occassionally causes transaction overruns in delayed allocation extent conversion, so now we know it's wrong we should fix it, too. Many thanks to Brian Foster for finding this problem during review of this patch. Hence the fix, after much code reading, is to allow xfs_bmap_extsize_align() to align partial extents when full alignment would extend the alignment past MAXEXTLEN. We can safely do this because all callers have higher layer allocation loops that already handle short allocations, and so will simply run another allocation to cover the remainder of the requested allocation range that we ignored during alignment. The advantage of this approach is that it also removes the need for callers to do anything other than limit their requests to MAXEXTLEN - they don't really need to be aware of extent size hints at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 07:40:06 +10:00
Dave Chinner	8c1903d308	xfs: inode and free block counters need to use __percpu_counter_compare Because the counters use a custom batch size, the comparison functions need to be aware of that batch size otherwise the comparison does not work correctly. This leads to ASSERT failures on generic/027 like this: XFS: Assertion failed: 0, file: fs/xfs/xfs_mount.c, line: 1099 ------------[ cut here ]------------ .... Call Trace: [<ffffffff81522a39>] xfs_mod_icount+0x99/0xc0 [<ffffffff815285cb>] xfs_trans_unreserve_and_mod_sb+0x28b/0x5b0 [<ffffffff8152f941>] xfs_log_commit_cil+0x321/0x580 [<ffffffff81528e17>] xfs_trans_commit+0xb7/0x260 [<ffffffff81503d4d>] xfs_bmap_finish+0xcd/0x1b0 [<ffffffff8151da41>] xfs_inactive_ifree+0x1e1/0x250 [<ffffffff8151dbe0>] xfs_inactive+0x130/0x200 [<ffffffff81523a21>] xfs_fs_evict_inode+0x91/0xf0 [<ffffffff811f3958>] evict+0xb8/0x190 [<ffffffff811f433b>] iput+0x18b/0x1f0 [<ffffffff811e8853>] do_unlinkat+0x1f3/0x320 [<ffffffff811d548a>] ? filp_close+0x5a/0x80 [<ffffffff811e999b>] SyS_unlinkat+0x1b/0x40 [<ffffffff81e0892e>] system_call_fastpath+0x12/0x71 This is a regression introduced by commit `501ab32` ("xfs: use generic percpu counters for inode counter"). This patch fixes the same problem for both the inode counter and the free block counter in the superblocks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 07:39:34 +10:00
George Wang	74f9ce1cf2	xfs: use percpu_counter_read_positive for mp->m_icount Function percpu_counter_read just return the current counter, which can be negative. This will cause the checking of "allocated inode counts <= m_maxicount" false positive. Use percpu_counter_read_positive can solve this problem, and be consistent with the purpose to introduce percpu mechanism to xfs. Signed-off-by: George Wang <xuw2015@gmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-05-29 07:39:34 +10:00
Luis R. Rodriguez	9c27847dda	kernel/params: constify struct kernel_param_ops uses Most code already uses consts for the struct kernel_param_ops, sweep the kernel for the last offending stragglers. Other than include/linux/moduleparam.h and kernel/params.c all other changes were generated with the following Coccinelle SmPL patch. Merge conflicts between trees can be handled with Coccinelle. In the future git could get Coccinelle merge support to deal with patch --> fail --> grammar --> Coccinelle --> new patch conflicts automatically for us on patches where the grammar is available and the patch is of high confidence. Consider this a feature request. Test compiled on x86_64 against: * allnoconfig * allmodconfig * allyesconfig @ const_found @ identifier ops; @@ const struct kernel_param_ops ops = { }; @ const_not_found depends on !const_found @ identifier ops; @@ -struct kernel_param_ops ops = { +const struct kernel_param_ops ops = { }; Generated-by: Coccinelle SmPL Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Junio C Hamano <gitster@pobox.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: cocci@systeme.lip6.fr Cc: linux-kernel@vger.kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2015-05-28 11:32:10 +09:30
Linus Torvalds	de182468d1	Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Back from SambaXP - now have 8 small CIFS bug fixes to merge" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: CIFS: Fix race condition on RFC1002_NEGATIVE_SESSION_RESPONSE Fix to convert SURROGATE PAIR cifs: potential missing check for posix_lock_file_wait Fix to check Unique id and FileType when client refer file directly. CIFS: remove an unneeded NULL check [cifs] fix null pointer check Fix that several functions handle incorrect value of mapchars cifs: Don't replace dentries for dfs mounts	2015-05-27 14:09:16 -07:00
Linus Torvalds	3cfd4ba7d3	Merge branch 'overlayfs-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull two overlayfs fixes from Miklos Szeredi: "Overlayfs rmdir() failed to check for emptiness in one case; this was introduced in 4.0. The other bug was there since day one: failure to mount if upper fs is full, which bit some OpenWRT folks" * 'overlayfs-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: mount read-only if workdir can't be created ovl: don't remove non-empty opaque directory	2015-05-27 09:47:57 -07:00
Anand Jain	2421a8cd5f	Btrfs: sysfs: don't fail seeding for the sake of sysfs kobject issue Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:22 +02:00
Anand Jain	24bd69cb0f	Btrfs: sysfs: add support to add parent for fsid To support seed sysfs layout and represent seed fsid under the sprout we need the facility to create fsid under the specified parent. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:22 +02:00
Anand Jain	b7c35e81ad	Btrfs: sysfs: separate kobject and attribute creation Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:22 +02:00
Anand Jain	1d1c1be372	Btrfs: sysfs: btrfs_sysfs_remove_fsid() make it non static Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:22 +02:00
Anand Jain	ef1a0daadf	Btrfs: sysfs: make btrfs_sysfs_add_device() non static Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:22 +02:00
Anand Jain	0c10e2d482	Btrfs: sysfs: make btrfs_sysfs_add_fsid() non static Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:21 +02:00
Anand Jain	6c14a1641b	Btrfs: sysfs btrfs_kobj_rm_device() pass fs_devices instead of fs_info since btrfs_kobj_rm_device() does nothing with fs_info Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:21 +02:00
Anand Jain	1ba43816af	Btrfs: sysfs btrfs_kobj_add_device() pass fs_devices instead of fs_info btrfs_kobj_add_device() does not need fs_info any more. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:21 +02:00
Anand Jain	2e3e12815a	Btrfs: sysfs: provide framework to remove all fsid sysfs kobject Just a helper function to clean up the sysfs fsid kobjects. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:21 +02:00
Anand Jain	5a13f4308c	Btrfs: sysfs: add pointer to access fs_info from fs_devices adds fs_info pointer with struct btrfs_fs_devices. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:21 +02:00
Anand Jain	c73eccf75b	Btrfs: introduce btrfs_get_fs_uuids to get fs_uuids Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:20 +02:00
Anand Jain	2e7910d6ca	Btrfs: sysfs: move super_kobj and device_dir_kobj from fs_info to btrfs_fs_devices This patch will provide a framework and help to create attributes from the structure btrfs_fs_devices which are available even before fs_info is created. So by moving the parent kobject super_kobj from fs_info to btrfs_fs_devices, it will help to create attributes from the btrfs_fs_devices as well. Patches on top of this patch now will be able to create the sys/fs/btrfs/fsid kobject and attributes from btrfs_fs_devices when devices are scanned and registered to the kernel. Just to note, this does not change any of the existing btrfs sysfs external kobject names and its attributes and not even the life cycle of them. Changes are internal only. And to ensure the same, this path has been tested with various device operations and, checking and comparing the sysfs kobjects and attributes with sysfs kobject and attributes with out this patch, and they remain same. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:20 +02:00
Anand Jain	00c921c23f	Btrfs: sysfs: separate device kobject and its attribute creation Separate device kobject and its attribute creation so that device kobject can be created from the device discovery thread. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:20 +02:00
Anand Jain	0dd2906f72	Btrfs: sysfs: let default_attrs be separate from the kset As of now btrfs_attrs are provided using the default_attrs through the kset. Separate them and create the default_attrs using the sysfs_create_files instead. By doing this we will have the flexibility that device discovery thread could create fsid kobject. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:20 +02:00
Anand Jain	720592157e	Btrfs: sysfs: introduce function btrfs_sysfs_add_fsid() to create sysfs fsid We need it in a seperate function so that it can be called from the device discovery thread as well. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:20 +02:00
Anand Jain	3a08f3b72a	Btrfs: sysfs: rename __btrfs_sysfs_remove_one to btrfs_sysfs_remove_fsid Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:20 +02:00
Anand Jain	aaf1330516	Btrfs: sysfs: reorder the kobject creations As of now the order in which the kobjects are created at btrfs_sysfs_add_one() is.. fsid features unknown features (dynamic features) devices. Since we would move fsid and device kobject to fs_devices from fs_info structure, this patch will reorder in which the kobjects are created as below. fsid devices features unknown features (dynamic features) And hence the btrfs_sysfs_remove_one() will follow the same in reverse order. and the device kobject destroy now can be moved into the function __btrfs_sysfs_remove_one() Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:19 +02:00
Anand Jain	4d435731f9	Btrfc: sysfs: fix, check if device_dir_kobj is init before destroy Since the failure code in the btrfs_sysfs_add_one() can call btrfs_sysfs_remove_one() even before device_dir_kobj has been created we need to check if its null. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:19 +02:00
Anand Jain	8345ea31dc	Btrfs: sysfs: fix, kobject pointer clean up needed after kobject release The sysfs clean up self test like in the below code fails, since fs_info->device_dir_kobject still points to its stale kobject. Reseting this pointer will help to fix this. open_ctree() { ret = btrfs_sysfs_add_one(fs_info); :: + btrfs_sysfs_remove_one(fs_info); + ret = btrfs_sysfs_add_one(fs_info); + if (ret) { + pr_err("BTRFS: failed to init sysfs interface: %d\n", ret); + goto fail_block_groups; + } Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:19 +02:00
Anand Jain	e7e1aa9c91	Btrfs: sysfs: fix, undo sysfs device links Theoritically need to remove the device links attributes, but since its entire device kobject was removed, so there wasn't any issue of about it. Just do it nicely. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:19 +02:00
Anand Jain	4e51f005a2	Btrfs: sysfs: fix, fs_info kobject_unregister has init_completion() twice kobject_unregister is to handle the release of the kobject, its completion init is being called in btrfs_sysfs_add_one(), so we don't have to do the same in the open_ctree() again. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:19 +02:00
Anand Jain	248d200df3	Btrfs: sysfs: fix, btrfs_release_super_kobj() should to clean up the kobject data The following test case fails indicating that, thread tried to init an initialized object. kernel: [232104.016513] kobject (ffff880006c1c980): tried to init an initialized object, something is seriously wrong. btrfs_sysfs_remove_one() self test code: open_tree() { :: ret = btrfs_sysfs_add_one(fs_info); if (ret) { pr_err("BTRFS: failed to init sysfs interface: %d\n", ret); goto fail_block_groups; } + btrfs_sysfs_remove_one(fs_info); + ret = btrfs_sysfs_add_one(fs_info); + if (ret) { + pr_err("BTRFS: failed to init sysfs interface: %d\n", ret); + goto fail_block_groups; + } cleaning up the unregistered kobject fixes this. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2015-05-27 12:27:18 +02:00
Julia Lawall	f6454b049d	block: fix returnvar.cocci warnings Remove unneeded variable used to store return value. Generated by: scripts/coccinelle/misc/returnvar.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-05-26 14:02:45 -06:00
Hannes Frederic Sowa	2b514574f7	net: af_unix: implement splice for stream af_unix sockets unix_stream_recvmsg is refactored to unix_stream_read_generic in this patch and enhanced to deal with pipe splicing. The refactoring is inneglible, we mostly have to deal with a non-existing struct msghdr argument. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 00:06:59 -04:00

... 5 6 7 8 9 ...

41275 Commits