Commit Graph

34418 Commits

Author SHA1 Message Date
Wenliang Fan eb8052e015 fs/btrfs: Integer overflow in btrfs_ioctl_resize()
The local variable 'new_size' comes from userspace. If a large number
was passed, there would be an integer overflow in the following line:
	new_size = old_size + new_size;

Signed-off-by: Wenliang Fan <fanwlexca@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:11 -08:00
Josef Bacik c9ea7b24ce Btrfs: stop caching thread if extent_commit_sem is contended
We can starve out the transaction commit with a bunch of caching threads all
running at the same time.  This is because we will only drop the
extent_commit_sem if we need_resched(), which isn't likely to happen since we
will be reading a lot from the disk so have already schedule()'ed plenty.  Alex
observed that he could starve out a transaction commit for up to a minute with
32 caching threads all running at once.  This will allow us to drop the
extent_commit_sem to allow the transaction commit to swap the commit_root out
and then all the cachers will start back up. Here is an explanation provided by
Igno

So, just to fill in what happens in this loop:

                                mutex_unlock(&caching_ctl->mutex);
                                cond_resched();
                                goto again;

where 'again:' takes caching_ctl->mutex and fs_info->extent_commit_sem
again:

        again:
                mutex_lock(&caching_ctl->mutex);
                /* need to make sure the commit_root doesn't disappear */
                down_read(&fs_info->extent_commit_sem);

So, if I'm reading the code correct, there can be a fair amount of
concurrency here: there may be multiple 'caching kthreads' per filesystem
active, while there's one fs_info->extent_commit_sem per filesystem
AFAICS.

So, what happens if there are a lot of CPUs all busy holding the
->extent_commit_sem rwsem read-locked and a writer arrives? They'd all
rush to try to release the fs_info->extent_commit_sem, and they'd block in
the down_read() because there's a writer waiting.

So there's a guarantee of forward progress. This should answer akpm's
concern I think.

Thanks,

Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:10 -08:00
Miao Xie 67de11769b Btrfs: introduce the delayed inode ref deletion for the single link inode
The inode reference item is close to inode item, so we insert it simultaneously
with the inode item insertion when we create a file/directory.. In fact, we also
can handle the inode reference deletion by the same way. So we made this patch to
introduce the delayed inode reference deletion for the single link inode(At most
case, the file doesn't has hard link, so we don't take the hard link into account).

This function is based on the delayed inode mechanism. After applying this patch,
we can reduce the time of the file/directory deletion by ~10%.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:09 -08:00
Miao Xie 7cf35d91b4 Btrfs: use flags instead of the bool variants in delayed node
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:08 -08:00
Miao Xie a56dbd8940 Btrfs: remove btrfs_end_transaction_dmeta()
Two reasons:
- btrfs_end_transaction_dmeta() is the same as btrfs_end_transaction_throttle()
  so it is unnecessary.
- All the delayed items should be dealt in the current transaction, so the
  workers should not commit the transaction, instead, deal with the delayed
  items as many as possible.

So we can remove btrfs_end_transaction_dmeta()

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:08 -08:00
Miao Xie 0353808cae Btrfs: cleanup code of btrfs_balance_delayed_items()
- move the condition check for wait into a function
- use wait_event_interruptible instead of prepare-schedule-finish process

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:07 -08:00
Miao Xie 4dd466d36a Btrfs: don't run delayed nodes again after all nodes flush
If the number of the delayed items is greater than the upper limit, we will
try to flush all the delayed items. After that, it is unnecessary to run
them again because they are being dealt with by the wokers or the number of
them is less than the lower limit.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:06 -08:00
Miao Xie 74c40f925e Btrfs: remove residual code in delayed inode async helper
Before applying the patch
  commit de3cb945db
  title: Btrfs: improve the delayed inode throttling

We need requeue the async work after the current work was done, it
introduced a deadlock problem. So we wrote the code that this patch
removes to avoid the above problem. But after applying the above
patch, the deadlock problem didn't exist. So we should remove that
fix code.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:06 -08:00
Frank Holton efe120a067 Btrfs: convert printk to btrfs_ and fix BTRFS prefix
Convert all applicable cases of printk and pr_* to the btrfs_* macros.

Fix all uses of the BTRFS prefix.

Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:05 -08:00
Filipe David Borba Manana 5de865eebb Btrfs: fix tree mod logging
While running the test btrfs/004 from xfstests in a loop, it failed
about 1 time out of 20 runs in my desktop. The failure happened in
the backref walking part of the test, and the test's error message was
like this:

  btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
      --- tests/btrfs/004.out	2013-11-26 18:25:29.263333714 +0000
      +++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad	2013-12-10 15:25:10.327518516 +0000
      @@ -1,3 +1,8 @@
       QA output created by 004
       *** test backref walking
      -*** done
      +unexpected output from
      +	/home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
      +expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
      +
       ...
       (Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
  Ran: btrfs/004
  Failures: btrfs/004
  Failed 1 of 1 tests

But immediately after the test finished, the btrfs inspect-internal command
returned the expected output:

  $ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
  inode 405 offset 454656 root 258
  inode 405 offset 454656 root 5

It turned out this was because the btrfs_search_old_slot() calls performed
during backref walking (backref.c:__resolve_indirect_ref) were not finding
anything. The reason for this turned out to be that the tree mod logging
code was not logging some node multi-step operations atomically, therefore
btrfs_search_old_slot() callers iterated often over an incomplete tree that
wasn't fully consistent with any tree state from the past. Besides missing
items, this often (but not always) resulted in -EIO errors during old slot
searches, reported in dmesg like this:

[ 4299.933936] ------------[ cut here ]------------
[ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
[ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
[ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G        W  O 3.12.0-fdm-btrfs-next-16+ #70
[ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 4299.933979]  000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
[ 4299.933982]  0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
[ 4299.933984]  ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
[ 4299.933987] Call Trace:
[ 4299.933991]  [<ffffffff8176d284>] dump_stack+0x55/0x76
[ 4299.933994]  [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
[ 4299.933997]  [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
[ 4299.934003]  [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
[ 4299.934005]  [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
[ 4299.934010]  [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
[ 4299.934019]  [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
[ 4299.934027]  [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[ 4299.934034]  [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
[ 4299.934042]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934048]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934056]  [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
[ 4299.934058]  [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[ 4299.934065]  [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[ 4299.934071]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934078]  [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
[ 4299.934080]  [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
[ 4299.934083]  [<ffffffff81075563>] ? up_read+0x23/0x40
[ 4299.934085]  [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[ 4299.934088]  [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[ 4299.934090]  [<ffffffff81776e23>] ? error_sti+0x5/0x6
[ 4299.934093]  [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 4299.934096]  [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[ 4299.934098]  [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[ 4299.934100]  [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4299.934102]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934102]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
[ 4299.934378] btrfs bad fsid on block 0

These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
be performed atomically before the following commit:

  c8cc634165
  (Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)

That change removed the atomicity of such operations. This patch restores the
atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
structures, so it has to do the allocations using GFP_NOFS before acquiring
the mod log lock.

This issue has been experienced by several users recently, such as for example:

  http://www.spinics.net/lists/linux-btrfs/msg28574.html

After running the btrfs/004 test for 679 consecutive iterations with this
patch applied, I didn't ran into the issue anymore.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:04 -08:00
David Sterba 66ef7d65c3 btrfs: check balance of send_in_progress
Warn if the balance goes below zero, which appears to be unlikely
though. Otherwise cleans up the code a bit.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:04 -08:00
Wang Shilong 41ce9970a8 Btrfs: remove transaction from btrfs send
Since daivd did the work that force us to use readonly snapshot,
we can safely remove transaction protection from btrfs send.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:03 -08:00
Miao Xie 536cd96401 Btrfs: fix double initialization of the raid kobject
We met the following oops when doing space balance:
 kobject (ffff88081b590278): tried to init an initialized object, something is seriously wrong.
 ...
 Call Trace:
  [<ffffffff81937262>] dump_stack+0x49/0x5f
  [<ffffffff8137d259>] kobject_init+0x89/0xa0
  [<ffffffff8137d36a>] kobject_init_and_add+0x2a/0x70
  [<ffffffffa009bd79>] ? clear_extent_bit+0x199/0x470 [btrfs]
  [<ffffffffa005e82c>] __link_block_group+0xfc/0x120 [btrfs]
  [<ffffffffa006b9db>] btrfs_make_block_group+0x24b/0x370 [btrfs]
  [<ffffffffa00a899b>] __btrfs_alloc_chunk+0x54b/0x7e0 [btrfs]
  [<ffffffffa00a8c6f>] btrfs_alloc_chunk+0x3f/0x50 [btrfs]
  [<ffffffffa0060123>] do_chunk_alloc+0x363/0x440 [btrfs]
  [<ffffffffa00633d4>] btrfs_check_data_free_space+0x104/0x310 [btrfs]
  [<ffffffffa0069f4d>] btrfs_write_dirty_block_groups+0x48d/0x600 [btrfs]
  [<ffffffffa007aad4>] commit_cowonly_roots+0x184/0x250 [btrfs]
  ...

Steps to reproduce:
 # mkfs.btrfs -f <dev>
 # mount -o nospace_cache <dev> <mnt>
 # btrfs balance start <mnt>
 # dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1

The reason of this problem is that we initialized the raid kobject when we added
a block group into a empty raid list. As we know, when we mounted a btrfs filesystem,
the raid list was empty, we would initialize the raid kobject when we added the first
block group. But if there was not data stored in the block group, the block group
would be freed when doing balance, and the raid list would be empty. And then if we
allocated a new block group and added it into the raid list, we would initialize
the raid kobject again, the oops happened.

Fix this problem by initializing the raid kobject just when mounting the fs.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reported-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:03 -08:00
Wang Shilong 180589efde Btrfs: fix a warning when iput a file
See the warning below:

[ 1209.102076]  [<ffffffffa04721b9>] remove_extent_mapping+0x69/0x70 [btrfs]
[ 1209.102084]  [<ffffffffa0466b06>] btrfs_evict_inode+0x96/0x4d0 [btrfs]
[ 1209.102089]  [<ffffffff81073010>] ? wake_atomic_t_function+0x40/0x40
[ 1209.102092]  [<ffffffff8118ab2e>] evict+0x9e/0x190
[ 1209.102094]  [<ffffffff8118b313>] iput+0xf3/0x180
[ 1209.102101]  [<ffffffffa0461fd1>] btrfs_run_delayed_iputs+0xb1/0xd0 [btrfs]
[ 1209.102107]  [<ffffffffa045d358>] __btrfs_end_transaction+0x268/0x350 [btrfs]

clear extent bit here to avoid triggering WARN_ON() in remove_extent_mapping()

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:02 -08:00
David Sterba 2c68653787 btrfs: Check read-only status of roots during send
All the subvolues that are involved in send must be read-only during the
whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the
status to read-write and the result of send stream is undefined if the
data change unexpectedly.

Fix that by adding a refcount for all involved roots and verify that
there's no send in progress during SUBVOL_SETFLAGS ioctl call that does
read-only -> read-write transition.

We need refcounts because there are no restrictions on number of send
parallel operations currently run on a single subvolume, be it source,
parent or one of the multiple clone sources.

Kernel is silent when the RO checks fail and returns EPERM. The same set
of checks is done already in userspace before send starts.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:01 -08:00
David Sterba a8d89f5ba0 btrfs: remove unused mnt from send_ctx
Unused since ed2590953b
"Btrfs: stop using vfs_read in send".

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:01 -08:00
David Sterba 95bc79d50d btrfs: send: clean up dead code
Remove ifdefed code:

- tlv_put for 8, 16 and 32, add a generic tempalte if needed in future
- tlv_put_timespec - the btrfs_timespec fields are used
- fs_path_remove obsoleted long ago

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:00 -08:00
Filipe David Borba Manana 3fe81ce206 Btrfs: fix deadlock when iterating inode refs and running delayed inodes
While running btrfs/004 from xfstests, after 503 iterations, dmesg reported
a deadlock between tasks iterating inode refs and tasks running delayed inodes
(during a transaction commit).

It turns out that iterating inode refs implies doing one tree search and
release all nodes in the path except the leaf node, and then passing that
leaf node to btrfs_ref_to_path(), which in turn does another tree search
without releasing the lock on the leaf node it received as parameter.

This is a problem when other task wants to write to the btree as well and
ends up updating the leaf that is read locked - the writer task locks the
parent of the leaf and then blocks waiting for the leaf's lock to be
released - at the same time, the task executing btrfs_ref_to_path()
does a second tree search, without releasing the lock on the first leaf,
and wants to access a leaf (the same or another one) that is a child of
the same parent, resulting in a deadlock.

The trace reported by lockdep follows.

[84314.936373] INFO: task fsstress:11930 blocked for more than 120 seconds.
[84314.936381]       Tainted: G        W  O 3.12.0-fdm-btrfs-next-16+ #70
[84314.936383] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84314.936386] fsstress        D ffff8806e1bf8000     0 11930  11926 0x00000000
[84314.936393]  ffff8804d6d89b78 0000000000000046 ffff8804d6d89b18 ffffffff810bd8bd
[84314.936399]  ffff8806e1bf8000 ffff8804d6d89fd8 ffff8804d6d89fd8 ffff8804d6d89fd8
[84314.936405]  ffff880806308000 ffff8806e1bf8000 ffff8804d6d89c08 ffff8804deb8f190
[84314.936410] Call Trace:
[84314.936421]  [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10
[84314.936428]  [<ffffffff81774269>] schedule+0x29/0x70
[84314.936451]  [<ffffffffa0715bf5>] btrfs_tree_lock+0x75/0x270 [btrfs]
[84314.936457]  [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60
[84314.936470]  [<ffffffffa06ba231>] btrfs_search_slot+0x7f1/0x930 [btrfs]
[84314.936489]  [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs]
[84314.936504]  [<ffffffffa06d2e1f>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
[84314.936510]  [<ffffffff810bd6ef>] ? trace_hardirqs_on_caller+0x1f/0x1e0
[84314.936528]  [<ffffffffa073173c>] __btrfs_update_delayed_inode+0x4c/0x1d0 [btrfs]
[84314.936543]  [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs]
[84314.936558]  [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs]
[84314.936573]  [<ffffffffa0731c82>] __btrfs_run_delayed_items+0x192/0x1e0 [btrfs]
[84314.936589]  [<ffffffffa0731d03>] btrfs_run_delayed_items+0x13/0x20 [btrfs]
[84314.936604]  [<ffffffffa06dbcd4>] btrfs_flush_all_pending_stuffs+0x24/0x80 [btrfs]
[84314.936620]  [<ffffffffa06ddc13>] btrfs_commit_transaction+0x223/0xa20 [btrfs]
[84314.936630]  [<ffffffffa06ae5ae>] btrfs_sync_fs+0x6e/0x110 [btrfs]
[84314.936635]  [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60
[84314.936639]  [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60
[84314.936643]  [<ffffffff811d0b70>] sync_fs_one_sb+0x20/0x30
[84314.936648]  [<ffffffff811a3541>] iterate_supers+0xf1/0x100
[84314.936652]  [<ffffffff811d0c45>] sys_sync+0x55/0x90
[84314.936658]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[84314.936660] INFO: lockdep is turned off.
[84314.936663] INFO: task btrfs:11955 blocked for more than 120 seconds.
[84314.936666]       Tainted: G        W  O 3.12.0-fdm-btrfs-next-16+ #70
[84314.936668] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84314.936670] btrfs           D ffff880541729a88     0 11955  11608 0x00000000
[84314.936674]  ffff880541729a38 0000000000000046 ffff8805417299d8 ffffffff810bd8bd
[84314.936680]  ffff88075430c8a0 ffff880541729fd8 ffff880541729fd8 ffff880541729fd8
[84314.936685]  ffffffff81c104e0 ffff88075430c8a0 ffff8804de8b00b8 ffff8804de8b0000
[84314.936690] Call Trace:
[84314.936695]  [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10
[84314.936700]  [<ffffffff81774269>] schedule+0x29/0x70
[84314.936717]  [<ffffffffa0715815>] btrfs_tree_read_lock+0xd5/0x140 [btrfs]
[84314.936721]  [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60
[84314.936733]  [<ffffffffa06ba201>] btrfs_search_slot+0x7c1/0x930 [btrfs]
[84314.936746]  [<ffffffffa06bd505>] btrfs_find_item+0x55/0x160 [btrfs]
[84314.936763]  [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs]
[84314.936780]  [<ffffffffa073c9ca>] btrfs_ref_to_path+0xba/0x1e0 [btrfs]
[84314.936797]  [<ffffffffa06f9719>] ? release_extent_buffer+0xb9/0xe0 [btrfs]
[84314.936813]  [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs]
[84314.936830]  [<ffffffffa073cb50>] inode_to_path+0x60/0xd0 [btrfs]
[84314.936846]  [<ffffffffa073d365>] paths_from_inode+0x115/0x3c0 [btrfs]
[84314.936851]  [<ffffffff8118dd44>] ? kmem_cache_alloc_trace+0x114/0x200
[84314.936868]  [<ffffffffa0714494>] btrfs_ioctl+0xf14/0x2030 [btrfs]
[84314.936873]  [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[84314.936877]  [<ffffffff8116598f>] ? handle_mm_fault+0x34f/0xb00
[84314.936882]  [<ffffffff81075563>] ? up_read+0x23/0x40
[84314.936886]  [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[84314.936892]  [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[84314.936896]  [<ffffffff81776e23>] ? error_sti+0x5/0x6
[84314.936901]  [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[84314.936906]  [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[84314.936910]  [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[84314.936915]  [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[84314.936920]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[84314.936922] INFO: lockdep is turned off.
[84434.866873] INFO: task btrfs-transacti:11921 blocked for more than 120 seconds.
[84434.866881]       Tainted: G        W  O 3.12.0-fdm-btrfs-next-16+ #70
[84434.866883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84434.866886] btrfs-transacti D ffff880755b6a478     0 11921      2 0x00000000
[84434.866893]  ffff8800735b9ce8 0000000000000046 ffff8800735b9c88 ffffffff810bd8bd
[84434.866899]  ffff8805a1b848a0 ffff8800735b9fd8 ffff8800735b9fd8 ffff8800735b9fd8
[84434.866904]  ffffffff81c104e0 ffff8805a1b848a0 ffff880755b6a478 ffff8804cece78f0
[84434.866910] Call Trace:
[84434.866920]  [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10
[84434.866927]  [<ffffffff81774269>] schedule+0x29/0x70
[84434.866948]  [<ffffffffa06dd2ef>] wait_current_trans.isra.33+0xbf/0x120 [btrfs]
[84434.866954]  [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60
[84434.866970]  [<ffffffffa06dec18>] start_transaction+0x388/0x5a0 [btrfs]
[84434.866985]  [<ffffffffa06db9b5>] ? transaction_kthread+0xb5/0x280 [btrfs]
[84434.866999]  [<ffffffffa06dee97>] btrfs_attach_transaction+0x17/0x20 [btrfs]
[84434.867012]  [<ffffffffa06dba9e>] transaction_kthread+0x19e/0x280 [btrfs]
[84434.867026]  [<ffffffffa06db900>] ? open_ctree+0x2260/0x2260 [btrfs]
[84434.867030]  [<ffffffff81070dad>] kthread+0xed/0x100
[84434.867035]  [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190
[84434.867040]  [<ffffffff8177ee6c>] ret_from_fork+0x7c/0xb0
[84434.867044]  [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:59 -08:00
Wang Shilong 663df05330 Btrfs: remove dead comments for read_csums()
Chris introduced hleper function  read_csums() and this function
has been removed, but we forgot to remove its corresponding comments.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:59 -08:00
Filipe David Borba Manana e223cfcd3e Btrfs: remove field tree_mod_seq_elem from btrfs_fs_info struct
It's not used anywhere, so just drop it.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:58 -08:00
Filipe David Borba Manana fc28b62d64 Btrfs: fix use of uninitialized err variable
fs/btrfs/file.c: In function ‘prepare_pages.isra.18’:
fs/btrfs/file.c:1265:6: warning: ‘err’ may be used uninitialized in this function [-Wuninitialized]

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:57 -08:00
Wang Shilong 54eb72c05f Btrfs: remove unnecessary filemap writting and waiting after block group relocation
We have commited transaction before, remove redundant filemap writting and
waiting here, it can speed up balance relocation process.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:57 -08:00
Tsutomu Itoh 5662344b3c Btrfs: fix error check of btrfs_lookup_dentry()
Clean up btrfs_lookup_dentry() to never return NULL, but PTR_ERR(-ENOENT)
instead. This keeps the return value convention consistent.

Callers who use btrfs_lookup_dentry() require a trivial update.

create_snapshot() in particular looks like it can also lose a BUG_ON(!inode)
which is not really needed - there seems less harm in returning ENOENT to
userspace at that point in the stack than there is to crash the machine.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:56 -08:00
Filipe David Borba Manana 7835776635 Btrfs: return immediately if tree log mod is not necessary
In ctree.c:tree_mod_log_set_node_key() we were calling
__tree_mod_log_insert_key() even when the modification doesn't need
to be logged. This would allocate a tree_mod_elem structure, fill it
and pass it to  __tree_mod_log_insert(), which would just acquire
the tree mod log write lock and then free the tree_mod_elem structure
and return (that is, a no-op).

Therefore call tree_mod_log_insert() instead of __tree_mod_log_insert()
which just returns immediately if the modification doesn't need to be
logged (without allocating the structure, fill it, acquire write lock,
free structure).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:56 -08:00
Josef Bacik f28491e0a6 Btrfs: move the extent buffer radix tree into the fs_info
I need to create a fake tree to test qgroups and I don't want to have to setup a
fake btree_inode.  The fact is we only use the radix tree for the fs_info, so
everybody else who allocates an extent_io_tree is just wasting the space anyway.
This patch moves the radix tree and its lock into btrfs_fs_info so there is less
stuff I have to fake to do qgroup sanity tests.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:55 -08:00
Josef Bacik 34b41acec1 Btrfs: use a bit to track if we're in the radix tree
For creating a dummy in-memory btree I need to be able to use the radix tree to
keep track of the buffers like normal extent buffers.  With dummy buffers we
skip the radix tree step, and we still want to do that for the tree mod log
dummy buffers but for my test buffers we need to be able to remove them from the
radix tree like normal.  This will give me a way to do that.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:54 -08:00
Josef Bacik a5dee37d39 Btrfs: deal with io_tree->mapping being NULL
I need to add infrastructure to allocate dummy extent buffers for running sanity
tests, and to do this I need to not have to worry about having an
address_mapping for an io_tree, so just fix up the places where we assume that
all io_tree's have a non-NULL ->mapping.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:54 -08:00
Filipe David Borba Manana 2ef1fed285 Btrfs: more efficient push_leaf_right
Currently when finding the leaf to insert a key into a btree, if the
leaf doesn't have enough space to store the item we attempt to move
off some items from our leaf to its right neighbor leaf, and if this
fails to create enough free space in our leaf, we try to move off more
items to the left neighbor leaf as well.

When trying to move off items to the right neighbor leaf, if it has
enough room to store the new key but not not enough room to move off
at least one item from our target leaf, __push_leaf_right returns 1 and
we have to attempt to move items to the left neighbor (push_leaf_left
function) without touching the right neighbor leaf.
For the case where the right leaf has enough room to store at least 1
item from our leaf, we end up modifying (and dirtying) both our leaf
and the right leaf. This is non-optimal for the case where the new key
is greater than any key in our target leaf because it can be inserted at
slot 0 of the right neighbor leaf and we don't need to touch our leaf
at all nor to attempt to move off items to the left neighbor leaf.

Therefore this change just selects the right neighbor leaf as our new
target leaf if it has enough room for the new key without modifying our
initial target leaf - we do this only if the new key is higher than any
key in the initial target leaf.

While running the following test, push_leaf_right was called by split_leaf
4802 times. Out of those 4802 calls, for 2571 calls (53.5%) we hit this
special case (right leaf has enough room and new key is higher than any key
in the initial target leaf).

Test:

  sysbench --test=fileio --file-num=512 --file-total-size=5G \
    --file-test-mode=[seqwr|rndwr] --num-threads=512 --file-block-size=8192 \
    --max-requests=100000 --file-io-mode=sync [prepare|run]

Results:

sequential writes

Throughput before this change: 65.71Mb/sec (average of 10 runs)
Throughput after this change:  66.58Mb/sec (average of 10 runs)

random writes

Throughput before this change: 10.75Mb/sec (average of 10 runs)
Throughput after this change:  11.56Mb/sec (average of 10 runs)

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:53 -08:00
Wang Shilong cb7ab02156 Btrfs: wrap repeated code into scrub_blocked_if_needed()
Just wrap same code into one function scrub_blocked_if_needed().

This make a change that we will move waiting (@workers_pending = 0)
before we can wake up commiting transaction(atomic_inc(@scrub_paused)),
we must take carefully to not deadlock here.

Thread 1			Thread 2
				|->btrfs_commit_transaction()
					|->set trans type(COMMIT_DOING)
					|->btrfs_scrub_paused()(blocked)
|->join_transaction(blocked)

Move btrfs_scrub_paused() before setting trans type which means we can
still join a transaction when commiting_transaction is blocked.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Suggested-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:53 -08:00
Wang Shilong 3cb0929ad2 Btrfs: fix wrong super generation mismatch when scrubbing supers
We came a race condition when scrubbing superblocks, the story is:

In commiting transaction, we will update @last_trans_commited after
writting superblocks, if scrubber start after writting superblocks
and before updating @last_trans_commited, generation mismatch happens!

We fix this by checking @scrub_pause_req, and we won't start a srubber
until commiting transaction is finished.(after btrfs_scrub_continue()
finished.)

Reported-by: Sebastian Ochmann <ochmann@informatik.uni-bonn.de>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:52 -08:00
Filipe David Borba Manana 5a0f4e2c2b Btrfs: fix pass of transid with wrong endianness in send.c
fs/btrfs/send.c:2190:9: warning: incorrect type in argument 3 (different base types)
fs/btrfs/send.c:2190:9:    expected unsigned long long [unsigned] [usertype] value
fs/btrfs/send.c:2190:9:    got restricted __le64 [usertype] ctransid
fs/btrfs/send.c:2195:17: warning: incorrect type in argument 3 (different base types)
fs/btrfs/send.c:2195:17:    expected unsigned long long [unsigned] [usertype] value
fs/btrfs/send.c:2195:17:    got restricted __le64 [usertype] ctransid
fs/btrfs/send.c:3716:9: warning: incorrect type in argument 3 (different base types)
fs/btrfs/send.c:3716:9:    expected unsigned long long [unsigned] [usertype] value
fs/btrfs/send.c:3716:9:    got restricted __le64 [usertype] ctransid

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:51 -08:00
Filipe David Borba Manana d527afe1e5 Btrfs: fix extent_map block_len after merging
When merging an extent_map with its right neighbor, increment
its block_len with the neighbor's block_len.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:51 -08:00
Michal Nazarewicz 11850392ee btrfs: remove dead code
[commit 8185554d: fix incorrect inode acl reset] introduced a dead
code by adding a condition which can never be true to an else
branch.  The condition can never be true because it is already
checked by a previous if statement which causes function to return.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-By: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:50 -08:00
Filipe David Borba Manana 878f2d2cb3 Btrfs: fix max dir item size calculation
We were accounting for sizeof(struct btrfs_item) twice, once
in the data_size variable and another time in the if statement
below.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:49 -08:00
Filipe David Borba Manana 12cfbad90e Btrfs: more efficient extent state insertions
Currently we do 2 traversals of an inode's extent_io_tree
before inserting an extent state structure: 1 to see if a
matching extent state already exists and 1 to do the insertion
if the fist traversal didn't found such extent state.

This change just combines those tree traversals into a single one.
While running sysbench tests (random writes) I captured the number
of elements in extent_io_tree trees for a while (into a procfs file
backed by a seq_list from seq_file module) and got this histogram:

Count: 9310
Range: 51.000 - 21386.000; Mean: 11785.243; Median: 18743.500; Stddev: 8923.688
Percentiles:  90th: 20985.000; 95th: 21155.000; 99th: 21369.000
  51.000 -   93.933:   693 ########
  93.933 -  172.314:   938 ##########
 172.314 -  315.408:   856 #########
 315.408 -  576.646:    95 #
 576.646 - 6415.830:   888 ##########
6415.830 - 11713.809:  1024 ###########
11713.809 - 21386.000:  4816 #####################################################

So traversing such trees can take some significant time that can
easily be avoided.

Ran the following sysbench tests, 5 times each, for sequential and
random writes, and got the following results:

  sysbench --test=fileio --file-num=1 --file-total-size=2G \
    --file-test-mode=seqwr --num-threads=16 --file-block-size=65536 \
    --max-requests=0 --max-time=60 --file-io-mode=sync

  sysbench --test=fileio --file-num=1 --file-total-size=2G \
    --file-test-mode=rndwr --num-threads=16 --file-block-size=65536 \
    --max-requests=0 --max-time=60 --file-io-mode=sync

Before this change:

sequential writes: 69.28Mb/sec (average of 5 runs)
random writes:     4.14Mb/sec  (average of 5 runs)

After this change:

sequential writes: 69.91Mb/sec (average of 5 runs)
random writes:     5.69Mb/sec  (average of 5 runs)

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:49 -08:00
Filipe David Borba Manana c42ac0bc95 Btrfs: add missing extent state caching calls
When we didn't find a matching extent state, we inserted a new one
but didn't cache it in the **cached_state parameter, which makes a
subsequent call do a tree lookup to get it.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:48 -08:00
Filipe David Borba Manana 32193c147f Btrfs: faster and more efficient extent map insertion
Before this change, adding an extent map to the extent map tree of an
inode required 2 tree nevigations:

1) doing a tree navigation to search for an existing extent map starting
   at the same offset or an extent map that overlaps the extent map we
   want to insert;

2) Another tree navigation to add the extent map to the tree (if the
   former tree search didn't found anything).

This change just merges these 2 steps into a single one.
While running first few btrfs xfstests I had noticed these trees easily
had a few hundred elements, and then with the following sysbench test it
reached over 1100 elements very often.

Test:

  sysbench --test=fileio --file-num=32 --file-total-size=10G \
    --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
    --max-requests=1000000 --file-io-mode=sync [prepare|run]

(fs created with mkfs.btrfs -l 4096 -f /dev/sdb3 before each sysbench
prepare phase)

Before this patch:

run 1 - 41.894Mb/sec
run 2 - 40.527Mb/sec
run 3 - 40.922Mb/sec
run 4 - 49.433Mb/sec
run 5 - 40.959Mb/sec

average - 42.75Mb/sec

After this patch:

run 1 - 48.036Mb/sec
run 2 - 50.21Mb/sec
run 3 - 50.929Mb/sec
run 4 - 46.881Mb/sec
run 5 - 53.192Mb/sec

average - 49.85Mb/sec

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:48 -08:00
Filipe David Borba Manana 68ba990f7d Btrfs: fix extent boundary check in bio_readpage_error
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:47 -08:00
Filipe David Borba Manana 5a4267ca20 Btrfs: try harder to avoid btree node splits
When attempting to move items from our target leaf to its neighbor
leaves (right and left), we only need to free data_size - free_space
bytes from our leaf in order to add the new item (which has size of
data_size bytes). Therefore attempt to move items to the right and
left leaves if they have at least data_size - free_space bytes free,
instead of data_size bytes free.

After 5 runs of the following test, I got a smaller number of btree
node splits overall:

sysbench --test=fileio --file-num=512 --file-total-size=5G \
  --file-test-mode=seqwr --num-threads=512 \
   --file-block-size=8192 --max-requests=100000 --file-io-mode=sync

Before this change:
* 6171 splits (average of 5 test runs)
* 61.508Mb/sec of throughput (average of 5 test runs)

After this change:
* 6036 splits (average of 5 test runs)
* 63.533Mb/sec of throughput (average of 5 test runs)

An ideal test would not just have multiple threads/processes writing
to a file (insertion of file extent items) but also do other operations
that result in insertion of items with varied sizes, like file/directory
creations, creation of links, symlinks, xattrs, etc.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:46 -08:00
Filipe David Borba Manana 1b8e7e45e5 Btrfs: avoid unnecessary ordered extent cache resets
After an ordered extent completes, don't blindly reset the
inode's ordered tree last accessed ordered extent pointer.

While running the xfstests I noticed that about 29% of the
time the ordered extent to which tree->last pointed was not
the same as our just completed ordered extent. After that I
ran the following sysbench test (after a prepare phase) and
noticed that about 68% of the time tree->last pointed to
a different ordered extent too.

sysbench --test=fileio --file-num=32 --file-total-size=4G \
    --file-test-mode=rndwr --num-threads=512 \
    --file-block-size=32768 --max-time=60 --max-requests=0 run

Therefore reset tree->last on ordered extent removal only if
it pointed to the ordered extent we're removing from the tree.

Results from 4 runs of the following test before and after
applying this patch:

$ sysbench --test=fileio --file-num=32 --file-total-size=4G \
  --file-test-mode=seqwr --num-threads=512 \
  --file-block-size=32768 --max-time=60 --file-io-mode=sync prepare
$ sysbench --test=fileio --file-num=32 --file-total-size=4G \
  --file-test-mode=seqwr --num-threads=512 \
  --file-block-size=32768 --max-time=60 --file-io-mode=sync run

Before this path:

run 1 - 64.049Mb/sec
run 2 - 63.455Mb/sec
run 3 - 64.656Mb/sec
run 4 - 63.833Mb/sec

After this patch:

run 1 - 66.149Mb/sec
run 2 - 68.459Mb/sec
run 3 - 66.338Mb/sec
run 4 - 66.176Mb/sec

With random writes (--file-test-mode=rndwr) I had huge fluctuations
on the results (+- 35% easily).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:46 -08:00
Jeff Mahoney e453d989e0 btrfs: fix leaks during sysfs teardown
Filipe noticed that we were leaking the features attribute group
after umount. His fix of just calling sysfs_remove_group() wasn't enough
since that removes just the supported features and not the unsupported
features.

This patch changes the unknown feature handling to add them individually
so we can skip the kmalloc and uses the same iteration to tear them down
later.

We also fix the error handling during mount so that we catch the
failing creation of the per-super kobject, and handle proper teardown
of a half-setup sysfs context.

Tested properly with kmemleak enabled this time.

Reported-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Tested-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:45 -08:00
Jeff Mahoney 1b8e5df6d9 btrfs: fix static checker warnings
This patch fixes the following warnings:
fs/btrfs/extent-tree.c:6201:12: sparse: symbol 'get_raid_name' was not declared. Should it be static?
fs/btrfs/extent-tree.c:8430:9: error: format not a string literal and no format arguments [-Werror=format-security] get_raid_name(index));

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:44 -08:00
Filipe David Borba Manana 131e404a2a Btrfs: fix very slow inode eviction and fs unmount
The inode eviction can be very slow, because during eviction we
tell the VFS to truncate all of the inode's pages. This results
in calls to btrfs_invalidatepage() which in turn does calls to
lock_extent_bits() and clear_extent_bit(). These calls result in
too many merges and splits of extent_state structures, which
consume a lot of time and cpu when the inode has many pages. In
some scenarios I have experienced umount times higher than 15
minutes, even when there's no pending IO (after a btrfs fs sync).

A quick way to reproduce this issue:

$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ cd /mnt/btrfs
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
    --file-test-mode=seqwr --num-threads=128 \
    --file-block-size=16384 --max-time=60 --max-requests=0 run
$ time btrfs fi sync .
FSSync '.'

real	0m25.457s
user	0m0.000s
sys	0m0.092s
$ cd ..
$ time umount /mnt/btrfs

real	1m38.234s
user	0m0.000s
sys	1m25.760s

The same test on ext4 runs much faster:

$ mkfs.ext4 /dev/sdb3
$ mount /dev/sdb3 /mnt/ext4
$ cd /mnt/ext4
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
    --file-test-mode=seqwr --num-threads=128 \
    --file-block-size=16384 --max-time=60 --max-requests=0 run
$ sync
$ cd ..
$ time umount /mnt/ext4

real	0m3.626s
user	0m0.004s
sys	0m3.012s

After this patch, the unmount (inode evictions) is much faster:

$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ cd /mnt/btrfs
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
    --file-test-mode=seqwr --num-threads=128 \
    --file-block-size=16384 --max-time=60 --max-requests=0 run
$ time btrfs fi sync .
FSSync '.'

real	0m26.774s
user	0m0.000s
sys	0m0.084s
$ cd ..
$ time umount /mnt/btrfs

real	0m1.811s
user	0m0.000s
sys	0m1.564s

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:44 -08:00
Wang Shilong 0647bf564f Btrfs: improve forever loop when doing balance relocation
We hit a forever loop when doing balance relocation,the reason
is that we firstly reserve 4M(node size is 16k).and within transaction
we will try to add extra reservation for snapshot roots,this will
return -EAGAIN if there has been a thread flushing space to reserve
space.We will do this again and again with filesystem becoming nearly
full.

If the above '-EAGAIN' case happens, we try to refill reservation more
outsize of transaction, and this will return eariler in enospc case,however,
this dosen't really hurt because it makes no sense doing balance relocation
with the filesystem nearly full.

Miao Xie helped a lot to track this issue, thanks.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:43 -08:00
Filipe David Borba Manana 6126e3caf7 Btrfs: fix ordered extent check in btrfs_punch_hole
If the ordered extent's last byte was 1 less than our region's
start byte, we would unnecessarily wait for the completion of
that ordered extent, because it doesn't intersect our target
range.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:42 -08:00
Miao Xie 376cc685cb Btrfs: fix the reserved space leak caused by the race between nonlock dio and buffered io
When we ran sysbench on the fs with compression, the following WARN_ONs were
triggered:
 fs/btrfs/inode.c:7829	WARN_ON(BTRFS_I(inode)->outstanding_extents);
 fs/btrfs/inode.c:7830	WARN_ON(BTRFS_I(inode)->reserved_extents);
 fs/btrfs/inode.c:7832	WARN_ON(BTRFS_I(inode)->csum_bytes);

Steps to reproduce:
 # mkfs.btrfs -f <dev>
 # mount -o compress <dev> <mnt>
 # cd <mnt>
 # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
 > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
 > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
 > --file-test-mode=sync prepare
 # cd -
 # umount <mnt>
 # mount -o compress <dev> <mnt>
 # cd <mnt>
 # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
 > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
 > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
 > --file-test-mode=sync run
 # cd -
 # umount <mnt>

The reason of this problem is:
Task0				Task1
btrfs_direct_IO
  unlock(&inode->i_mutex)
				lock(&inode->i_mutex)
				reserve_space()
				prepare_pages()
				  lock_extent()
				  clear_extent()
				  unlock_extent()
  lock_extent()
  test_extent(uptodate)
    return false
				copy_data()
				set_delalloc_extent()
  extent need compress
    go back to buffered write
  clear_extent(DELALLOC | DIRTY)
  unlock_extent()

Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which
was set by task1, it made the dirty pages in that extents couldn't be flushed
into the disk, so the reserved space for that extent was not released at
the end.

This patch fixes the above bug by unlocking the extent after the delalloc.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:42 -08:00
Miao Xie b37392ea86 Btrfs: cleanup unnecessary parameter and variant of prepare_pages()
- the caller has gotten the inode object, needn't pass the file object.
  And if so, we needn't define a inode pointer variant.
- the position should be aligned by the page size not sector size, so
  we also needn't pass the root object into prepare_pages().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:41 -08:00
David Sterba cc37bb0420 btrfs: replace BUG in can_modify_feature
We don't need to crash hard here, it's just reading a sysfs file. The
values considered in switch are from a fixed set, the default case
should not happen at all.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:41 -08:00
David Sterba 43d87fa231 btrfs: reserve no transaction units in btrfs_feature_attr_store
Added in patch "btrfs: add ability to change features via sysfs",
modifications to superblock don't need to reserve metadata blocks when
starting a transaction.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:40 -08:00
Frank Holton 27a0dd61a5 Btrfs: make btrfs_debug match pr_debug handling related to DEBUG
The kernel macro pr_debug is defined as a empty statement when DEBUG is
not defined. Make btrfs_debug match pr_debug to avoid spamming
the kernel log with debug messages

Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:39 -08:00
Sergei Trofimovich 33b98f2271 btrfs: cleanup: removed unused 'btrfs_get_inode_ref_index'
Found by uselex.rb:
> btrfs_get_inode_ref_index: [R]: exported from:
fs/btrfs/inode-item.o fs/btrfs/btrfs.o fs/btrfs/built-in.o

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: David Stebra <dsterba@suse.cz>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:39 -08:00
Kelley Nielsen 3f870c2899 btrfs: expand btrfs_find_item() to include find_orphan_item functionality
This is the third step in bootstrapping the btrfs_find_item interface.
The function find_orphan_item(), in orphan.c, is similar to the two
functions already replaced by the new interface. It uses two parameters,
which are already present in the interface, and is nearly identical to
the function brought in in the previous patch.

Replace the two calls to find_orphan_item() with calls to
btrfs_find_item(), with the defined objectid and type that was used
internally by find_orphan_item(), a null path, and a null key. Add a
test for a null path to btrfs_find_item, and if it passes, allocate and
free the path. Finally, remove find_orphan_item().

Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:37 -08:00
Kelley Nielsen 75ac2dd907 btrfs: expand btrfs_find_item() to include find_root_ref functionality
This patch is the second step in bootstrapping the btrfs_find_item
interface. The btrfs_find_root_ref() is similar to the former
__inode_info(); it accepts four of its parameters, and duplicates the
first half of its functionality.

Replace the one former call to btrfs_find_root_ref() with a call to
btrfs_find_item(), along with the defined key type that was used
internally by btrfs_find_root ref, and a null found key. In
btrfs_find_item(), add a test for the null key at the place where
the functionality of btrfs_find_root_ref() ends; btrfs_find_item()
then returns if the test passes. Finally, remove btrfs_find_root_ref().

Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:36 -08:00
Kelley Nielsen e33d5c3d6d btrfs: bootstrap generic btrfs_find_item interface
There are many btrfs functions that manually search the tree for an
item. They all reimplement the same mechanism and differ in the
conditions that they use to find the item. __inode_info() is one such
example. Zach Brown proposed creating a new interface to take the place
of these functions.

This patch is the first step to creating the interface. A new function,
btrfs_find_item, has been added to ctree.c and prototyped in ctree.h.
It is identical to __inode_info, except that the order of the parameters
has been rearranged to more closely those of similar functions elsewhere
in the code (now, root and path come first, then the objectid, offset
and type, and the key to be filled in last). __inode_info's callers have
been set to call this new function instead, and __inode_info itself has
been removed.

Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:36 -08:00
Valentina Giusti a3df41ee37 btrfs: fix unused variables in qgroup.c
Use otherwise unused local variables slot in update_qgroup_limit_item and
in update_qgroup_info_item, and remove unused variable ins from
btrfs_qgroup_account_ref.

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:35 -08:00
Valentina Giusti e94acd86d4 btrfs: replace path->slots[0] with otherwise unused variable 'slot'
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:35 -08:00
Valentina Giusti ce3e7f1073 btrfs: remove unused variable from scrub_fixup_nodatasum
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:34 -08:00
Valentina Giusti f0265bb409 btrfs: remove unused variable from setup_cluster_no_bitmap
The variable window_start in setup_cluster_no_bitmap is not used since commit
1bb91902dc
(Btrfs: revamp clustered allocation logic)

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:33 -08:00
Valentina Giusti 50892bac3b btrfs: remove unused variables from extent_io.c
Remove unused variables:
* tree from end_bio_extent_writepage,
* item from extent_fiemap.

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:33 -08:00
Valentina Giusti 4b447bfac6 btrfs: remove unused variable from find_free_extent
The variable found_uncached_bg in find_free_extent is not used since commit
285ff5af6c
(Btrfs: remove the ideal caching code)

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:32 -08:00
Valentina Giusti 71db2a7751 btrfs: remove unused variables from disk-io.c
Remove unused variables:
* tree from csum_dirty_buffer,
* tree from btree_readpage_end_io_hook,
* tree from btree_writepages,
* bytenr from btrfs_create_tree,
* fs_info from end_workqueue_fn.

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:31 -08:00
Valentina Giusti 99e22f783b btrfs: remove unused variable from btrfs_new_inode
Variable owner in btrfs_new_inode is unused since commit
d82a6f1d7e
(Btrfs: kill BTRFS_I(inode)->block_group)

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:31 -08:00
Jeff Mahoney f8ba9c11f8 btrfs: publish fs label in sysfs
This adds a writeable attribute which describes the label.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:30 -08:00
Jeff Mahoney 29e5be240a btrfs: publish device membership in sysfs
Now that we have the infrastructure for per-super attributes, we can
publish device membership in /sys/fs/btrfs/<fsid>/devices. The information
is published as symlinks to the block devices.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:29 -08:00
Jeff Mahoney 6ab0a2029c btrfs: publish allocation data in sysfs
While trying to debug ENOSPC issues, it's helpful to understand what the
kernel's view of the available space is. We export this information
via ioctl, but sysfs files are more easily used.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:29 -08:00
Jeff Mahoney 01e219e806 btrfs: add ioctl to export size of global metadata reservation
btrfs filesystem df output will show the size of the metadata space
and how much of it is used, and the user assumes that the difference
is all usable space. Since that's not actually the case due to the
global metadata reservation, we should provide the full picture to the
user.

This patch adds an ioctl that exports the size of the global metadata
reservation so that btrfs filesystem df can report it.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:28 -08:00
Jeff Mahoney 3b02a68a63 btrfs: use feature attribute names to print better error messages
Now that we have the feature name strings available in the kernel via
the sysfs attributes, we can use them for printing better failure
messages from the ioctl path.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:28 -08:00
Jeff Mahoney ba631941ef btrfs: add ability to change features via sysfs
This patch adds the ability to change (set/clear) features while the file
system is mounted. A bitmask is added for each feature set for the
support to set and clear the bits. A message indicating which bit
has been set or cleared is issued when it's been changed and also when
permission or support for a particular bit has been denied.

Since the the attributes can now be writable, we need to introduce
another struct attribute to hold the different permissions.

If neither set or clear is supported, the file will have 0444 permissions.
If either set or clear is supported, the file will have 0644 permissions
and the store handler will filter out the write based on the bitmask.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:27 -08:00
Jeff Mahoney 79da4fa4d9 btrfs: publish unknown feature bits in sysfs
With the compat and compat-ro bits, it's possible for file systems to
exist that have features that aren't supported by the kernel's file system
implementation yet still be mountable.

This patch publishes read-only info on those features using a prefix:number
format, where the number is the bit number rather than the shifted value.
e.g. "compat:12"

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:26 -08:00
Jeff Mahoney 510d73600a btrfs: publish per-super features in sysfs
This patch publishes information on which features are enabled in the
file system on a per-super basis. At this point, it only publishes
information on features supported by the file system implementation.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:26 -08:00
Jeff Mahoney 5ac1d209f1 btrfs: publish per-super attributes in sysfs
This patch adds per-super attributes to sysfs.

It doesn't publish any attributes yet, but does the proper lifetime
handling as well as the basic infrastructure to add new attributes.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:25 -08:00
Jeff Mahoney 079b72bca3 btrfs: publish supported featured in sysfs
This patch adds the ability to publish supported features to sysfs under
/sys/fs/btrfs/features.

The files are module-wide and export which features the kernel supports.

The content, for now, is just "0\n".

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:25 -08:00
Jeff Mahoney 2eaa055fab btrfs: add ioctls to query/change feature bits online
There are some feature bits that require no offline setup and can
be enabled online. I've only reviewed extended irefs, but there will
probably be more.

We introduce three new ioctls:
- BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features.
- BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs
  basis, as well as querying for which features are changeable with mounted.
- BTRFS_IOC_SET_FEATURES: change features on a per-fs basis.

We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that
allow us to define which features are safe to change at runtime.

The failure modes for BTRFS_IOC_SET_FEATURES are as follows:
- Enabling a completely unsupported feature: warns and returns -ENOTSUPP
- Enabling a feature that can only be done offline: warns and returns -EPERM

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:23 -08:00
Liu Bo 9e5ac13acb Btrfs: skip merge part for delayed data refs
When we have data deduplication on, we'll hang on the merge part
because it needs to verify every queued delayed data refs related to
this disk offset but we may have millions refs.

And in the case of delayed data refs, we don't usually have too much
data refs to merge.

So it's safe to shut it down for data refs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:23 -08:00
Liu Bo c46effa601 Btrfs: introduce a head ref rbtree
The way how we process delayed refs is
1) get a bunch of head refs,
2) pick up one head ref,
3) go one node back for any delayed ref updates.

The head ref is also linked in the same rbtree as the delayed ref is,
so in 1) stage, we have to walk one by one including not only head refs, but
delayed refs.

When we have a great number of delayed refs pending to process,
this'll cost time a lot.

Here we introduce a head ref specific rbtree, it only has head refs, so troubles
go away.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:22 -08:00
Josef Bacik e20d6c5ba3 Btrfs: fix check-integrity to look at the referenced data properly
We were looking at file_extent_num_bytes unconditionally when looking at
referenced data bytes, but this isn't correct for compression.  Fix this by
checking the compression of the file extent we are and setting num_bytes to
disk_num_bytes in the case of compression so that we are marking the proper
bytes as referenced.  This fixes check_int_data freaking out when running
btrfs/004.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:21 -08:00
Josef Bacik 16e7549f04 Btrfs: incompatible format change to remove hole extents
Btrfs has always had these filler extent data items for holes in inodes.  This
has made somethings very easy, like logging hole punches and sending hole
punches.  However for large holey files these extent data items are pure
overhead.  So add an incompatible feature to no longer add hole extents to
reduce the amount of metadata used by these sort of files.  This has a few
changes for logging and send obviously since they will need to detect holes and
log/send the holes if there are any.  I've tested this thoroughly with xfstests
and it doesn't cause any issues with and without the incompat format set.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:21 -08:00
Linus Torvalds 48ba620aab Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace fixes from Eric Biederman:
 "This is a set of 3 regression fixes.

  This fixes /proc/mounts when using "ip netns add <netns>" to display
  the actual mount point.

  This fixes a regression in clone that broke lxc-attach.

  This fixes a regression in the permission checks for mounting /proc
  that made proc unmountable if binfmt_misc was in use.  Oops.

  My apologies for sending this pull request so late.  Al Viro gave
  interesting review comments about the d_path fix that I wanted to
  address in detail before I sent this pull request.  Unfortunately a
  bad round of colds kept from addressing that in detail until today.
  The executive summary of the review was:

  Al: Is patching d_path really sufficient?
      The prepend_path, d_path, d_absolute_path, and __d_path family of
      functions is a really mess.

  Me: Yes, patching d_path is really sufficient.  Yes, the code is mess.
      No it is not appropriate to rewrite all of d_path for a regression
      that has existed for entirely too long already, when a two line
      change will do"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  vfs: Fix a regression in mounting proc
  fork:  Allow CLONE_PARENT after setns(CLONE_NEWPID)
  vfs: In d_path don't call d_dname on a mount point
2014-01-17 17:29:36 -08:00
Linus Torvalds 70b23ce347 fix data corruption on NFS writeback
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJS080OAAoJECvKgwp+S8JaIdUQAJKNZTzXKylUjUZty42t57Jh
 1qRrQeJ6ha+JVSpYX4jJz/mSzUdJdjoFg7J3O54OnVFj/CnlcY7GRZj3VMel9ijf
 uhlf8DcU6JsThcFK4Q6mqXtdAHDPkQ1jkQHLNCe7bow9AjCzHymAZWJix4YvEsXF
 zeJJURMqSaJeo/44MynnXyn/h5RRhg+5HWErhoFiVUzDzHR3RoQqmt3lPVVJkdj1
 iokHLMzGui2vs52vUJj2yx7m9kaoDx/6bJpqR61qHfk5S4wjLkUI+1ID8dsTNVF2
 4O3THb0nUDWx4wuJIxrAKoPiYjiemX1KmQXlUVr3IsfhDiiBbLyviVyn4aRaFIxV
 IRCVXCj1CWw+cFLeCA5E+/WvpxjLfKs4WNBxIqjes5YRPM4PLpU3MDiabssaUzHI
 0VPbU8TQ05hqH0wbs0hIgXyvED6yNn9d3sPHS2Lb5i2tp3E0FzVEoh2EH2jn8lmQ
 1DAdi+ezk9EiJs8AFiN6MSIBpAZosX3Nq+RTmYGKqLZMGnxlJ30YspNlipiBPFpC
 4xokkMZAZ0+wzpVabOMie36Rc/AaOAqiOjS1C6UIoOSrBTgtwWL7Ft2Da3SKb0KX
 XQhNWCHNYcgOn9/DDDmGxzwt6HsEzOIYinMwrG37LSass5KEvopssmiLCXn8wry+
 QXUoiFFFAPpg8iXaqj4X
 =AHdo
 -----END PGP SIGNATURE-----

Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux

Pull writeback fix from Wu Fengguang:
 "Fix data corruption on NFS writeback.

  It has been in linux-next for one month"

* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: Fix data corruption on NFS
2014-01-16 08:23:34 +07:00
Andreas Rohner 70f2fe3a26 nilfs2: fix segctor bug that causes file system corruption
There is a bug in the function nilfs_segctor_collect, which results in
active data being written to a segment, that is marked as clean.  It is
possible, that this segment is selected for a later segment
construction, whereby the old data is overwritten.

The problem shows itself with the following kernel log message:

  nilfs_sufile_do_cancel_free: segment 6533 must be clean

Usually a few hours later the file system gets corrupted:

  NILFS: bad btree node (blocknr=8748107): level = 0, flags = 0x0, nchildren = 0
  NILFS error (device sdc1): nilfs_bmap_last_key: broken bmap (inode number=114660)

The issue can be reproduced with a file system that is nearly full and
with the cleaner running, while some IO intensive task is running.
Although it is quite hard to reproduce.

This is what happens:

 1. The cleaner starts the segment construction
 2. nilfs_segctor_collect is called
 3. sc_stage is on NILFS_ST_SUFILE and segments are freed
 4. sc_stage is on NILFS_ST_DAT current segment is full
 5. nilfs_segctor_extend_segments is called, which
    allocates a new segment
 6. The new segment is one of the segments freed in step 3
 7. nilfs_sufile_cancel_freev is called and produces an error message
 8. Loop around and the collection starts again
 9. sc_stage is on NILFS_ST_SUFILE and segments are freed
    including the newly allocated segment, which will contain active
    data and can be allocated at a later time
10. A few hours later another segment construction allocates the
    segment and causes file system corruption

This can be prevented by simply reordering the statements.  If
nilfs_sufile_cancel_freev is called before nilfs_segctor_extend_segments
the freed segments are marked as dirty and cannot be allocated any more.

Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Reviewed-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-15 14:19:42 +07:00
Linus Torvalds e2bc44706f xfs: bugfixes for 3.13-rc8
- fix off-by-one in xfs_attr3_rmt_verify
 - fix missing destroy_work_on_stack() in xfs_bmapi_allocate
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJS0ECWAAoJENaLyazVq6ZOgn0QAKSC/pkP4km+QbmL0R7SqSJH
 ZSSj16gIjR5lHlwI3PQzv5BgyEC9BcRDKWXN6dy+GHHuMtP4qYK8cLWFcyl7EysH
 HAyDBnaJVphXt23C5iIzk+iseNfRYXA2LOpYSH6qfhZ5bxEeYzQS42zL4YhxZrXq
 kzLHojcTLUx0IzJ+4oHn5AXSgPt+PXxNz3s+TU9virFnfSMlw2qYukxQtG49nbQr
 kQjNHgeTIBKzeHdlnxmv5Rd2bD//397w5aWXxmaUh8fk6Z7VJi40ALAG4Pks81HF
 +TEgMtF9/xTXdlwrYJDoHp++vUs6HANCX+wSAb4MdrBQvjh/USytK2WFwOeMyyR6
 L/iogfPXHHizTkoYSzPwPdEmCCFhzidvBEqNX68+ojlJnDtoart7IgkOcm9LvaQI
 j//u76CPRcd8tFh+1fDNaXn1ykJ6/CepSY13/yOnbpc7JoDbtqK2R8HFxdSlkDDg
 UooLF2AfQ6lX280cUWwV0flqGO6iTIM3Fw1mIq3z8X4usNn+bMnlOu/DUnCbF5bB
 YJCV4uT7f04w7oJqin9a7LHaHKRD56tWQun/OCEd7ZV/hJ1YRYlhhLfSdWdX7+SX
 oIawXJy7NvCPQLaTwycD3h2gDlaxw17GAc9rA3AcCknxBsgNosv1ETQnEPC4iIAq
 QsVal7p6oMLZ/qx6mvX7
 =Xpq3
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-v3.13-rc8' of git://oss.sgi.com/xfs/xfs

Pull xfs bugfixes from Ben Myers:
 "Here we have a bugfix for an off-by-one in the remote attribute
  verifier that results in a forced shutdown which you can hit with v5
  superblock by creating a 64k xattr, and a fix for a missing
  destroy_work_on_stack() in the allocation worker.

  It's a bit late, but they are both fairly straightforward"

* tag 'xfs-for-linus-v3.13-rc8' of git://oss.sgi.com/xfs/xfs:
  xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
  xfs: fix off-by-one error in xfs_attr3_rmt_verify
2014-01-11 06:33:03 +07:00
Chuansheng Liu 1f4a63bf01 xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to
call destroy_work_on_stack() which frees the debug object to pair
with INIT_WORK_ONSTACK().

Signed-off-by: Liu, Chuansheng <chuansheng.liu@intel.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 6f96b3063c)
2014-01-10 12:39:38 -06:00
Jie Liu bba719b500 xfs: fix off-by-one error in xfs_attr3_rmt_verify
With CRC check is enabled, if trying to set an attributes value just
equal to the maximum size of XATTR_SIZE_MAX would cause the v3 remote
attr write verification procedure failure, which would yield the back
trace like below:

<snip>
XFS (sda7): Internal error xfs_attr3_rmt_write_verify at line 191 of file fs/xfs/xfs_attr_remote.c
<snip>
Call Trace:
[<ffffffff816f0042>] dump_stack+0x45/0x56
[<ffffffffa0d99c8b>] xfs_error_report+0x3b/0x40 [xfs]
[<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffffa0d99ce5>] xfs_corruption_error+0x55/0x80 [xfs]
[<ffffffffa0dbef6b>] xfs_attr3_rmt_write_verify+0x14b/0x1a0 [xfs]
[<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d96edd>] _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffff81184cda>] ? vm_map_ram+0x31a/0x460
[<ffffffff81097230>] ? wake_up_state+0x20/0x20
[<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d9726b>] xfs_buf_iorequest+0x6b/0xc0 [xfs]
[<ffffffffa0d97315>] xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d97906>] xfs_bwrite+0x46/0x80 [xfs]
[<ffffffffa0dbfa94>] xfs_attr_rmtval_set+0x334/0x490 [xfs]
[<ffffffffa0db84aa>] xfs_attr_leaf_addname+0x24a/0x410 [xfs]
[<ffffffffa0db8893>] xfs_attr_set_int+0x223/0x470 [xfs]
[<ffffffffa0db8b76>] xfs_attr_set+0x96/0xb0 [xfs]
[<ffffffffa0db13b2>] xfs_xattr_set+0x42/0x70 [xfs]
[<ffffffff811df9b2>] generic_setxattr+0x62/0x80
[<ffffffff811e0213>] __vfs_setxattr_noperm+0x63/0x1b0
[<ffffffff81307afe>] ? evm_inode_setxattr+0xe/0x10
[<ffffffff811e0415>] vfs_setxattr+0xb5/0xc0
[<ffffffff811e054e>] setxattr+0x12e/0x1c0
[<ffffffff811c6e82>] ? final_putname+0x22/0x50
[<ffffffff811c708b>] ? putname+0x2b/0x40
[<ffffffff811cc4bf>] ? user_path_at_empty+0x5f/0x90
[<ffffffff811bdfd9>] ? __sb_start_write+0x49/0xe0
[<ffffffff81168589>] ? vm_mmap_pgoff+0x99/0xc0
[<ffffffff811e07df>] SyS_setxattr+0x8f/0xe0
[<ffffffff81700c2d>] system_call_fastpath+0x1a/0x1f

Tests:
    setfattr -n user.longxattr -v `perl -e 'print "A"x65536'` testfile

This patch fix it to check the remote EA size is greater than the
XATTR_SIZE_MAX rather than more than or equal to it, because it's
valid if the specified EA value size is equal to the limitation as
per VFS setxattr interface.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 85dd0707f0)
2014-01-10 12:38:41 -06:00
Linus Torvalds ef350bb7c5 Fix a regression introduced in v3.13-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABCAAGBQJSyv4GAAoJENNvdpvBGATwwG4QANhHQupchnt4vnyetcvTZvs3
 x0BlnZnGDwzBqRhiZa8tORABn8z/8JuzsepqZOjKfmuULDHO4hjv42DSYhiBf7l7
 rIjrPhDuSoP6aVIiaLxllaVe+d18/fLUeoJ4bw2/Np9lLTjALA7j7zpVfy9RsIrr
 mreIh5Nu7ay8R/5Mts7ApJwQTtHHEOWm+NcsisZIFoCyuJKGDyeWOutlpGcgSn2T
 W3pUTF/iuN3trXAr+VYWfn/yqewWYlQ9hEifFYiqef7dEo9ITgO7Gn0Ig12PhUcX
 KuaRcAXsx+ynB3gLjsocCPHfHRouqeEN1jzfbVLn9GIHlgU9JEYAhyZB7eKvjoAH
 kf7IKWEOOVVQcRpJLmr3cXiY3ut5gqyytwftc4lntJG5nLbDAw3MihScSRdm1DBR
 ELHD60IDHvFtGGeCqgu11MCXoXBM2HG7iQWnCQfsktlAWPSmnAtqZO6vYzPtJHvD
 iMv2WuIBlEB0Qx/JJwi27vCb8PXmfj3mT/mvr8UhlMSl1W5cBG3nXOmbiZC33tmE
 nlB0j1UC21VFOUoA9BxYP5imfpUvD5fvGCmr6QE/mw9Y675q4QIWGo8P9EJiaCBL
 zx1VbgofNLs/DoCPzMlqR9qrO4w2SwWOuwY1dJq7EkP3cfQd2mv/cq3UrqDEtxju
 otnvt2Z2dadntRq3vGv3
 =6Ghs
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfix from Ted Ts'o:
 "Fix a regression introduced in v3.13-rc6"

* tag 'ext4_for_linus_stable' of http://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix bigalloc regression
2014-01-07 08:22:42 +08:00
Eric Whitney d0abafac8c ext4: fix bigalloc regression
Commit f5a44db5d2 introduced a regression on filesystems created with
the bigalloc feature (cluster size > blocksize).  It causes xfstests
generic/006 and /013 to fail with an unexpected JBD2 failure and
transaction abort that leaves the test file system in a read only state.
Other xfstests run on bigalloc file systems are likely to fail as well.

The cause is the accidental use of a cluster mask where a cluster
offset was needed in ext4_ext_map_blocks().

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
2014-01-06 14:00:23 -05:00
Linus Torvalds 06f055f394 Merge branch 'akpm' (incoming from Andrew)
Merge patches from Andrew Morton:
 "Ten fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL
  sh: add EXPORT_SYMBOL(min_low_pfn) and EXPORT_SYMBOL(max_low_pfn) to sh_ksyms_32.c
  drivers/dma/ioat/dma.c: check DMA mapping error in ioat_dma_self_test()
  mm/memory-failure.c: transfer page count from head page to tail page after split thp
  MAINTAINERS: set up proper record for Xilinx Zynq
  mm: remove bogus warning in copy_huge_pmd()
  memcg: fix memcg_size() calculation
  mm: fix use-after-free in sys_remap_file_pages
  mm: munlock: fix deadlock in __munlock_pagevec()
  mm: munlock: fix a bug where THP tail page is encountered
2014-01-02 14:40:38 -08:00
Jason Baron 4ff36ee94d epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL
The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock.
That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with
epoll_ctl(b, EPOLL_CTL_DEL, a, x).  The deadlock was introduced with
commmit 67347fe4e6 ("epoll: do not take global 'epmutex' for simple
topologies").

The acquistion of the ep->mtx for the destination 'ep' was added such
that a concurrent EPOLL_CTL_ADD operation would see the correct state of
the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links')

However, by simply not acquiring the lock, we do not serialize behind
the ep->mtx from the add path, and thus may perform a full path check
when if we had waited a little longer it may not have been necessary.
However, this is a transient state, and performing the full loop
checking in this case is not harmful.

The important point is that we wouldn't miss doing the full loop
checking when required, since EPOLL_CTL_ADD always locks any 'ep's that
its operating upon.  The reason we don't need to do lock ordering in the
add path, is that we are already are holding the global 'epmutex'
whenever we do the double lock.  Further, the original posting of this
patch, which was tested for the intended performance gains, did not
perform this additional locking.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Nelson Elhage <nelhage@nelhage.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-02 14:40:30 -08:00
Linus Torvalds 152b734a9e Here is a set of small fixes for GFS2. There is a fix to drop
s_umount which is copied in from the core vfs, two patches
 relate to a hard to hit "use after free" and memory leak.
 Two patches related to using DIO and buffered I/O on the same
 file to ensure correct operation in relation to glock state
 changes. The final patch adds an RCU read lock to ensure
 correct locking on an error path.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJSxVxyAAoJEMrg3m4a/8jSP1kQAKkyW6DgevgZ+IHlm5+mhTeZ
 Bpdy3l6DdxZIiqoG0VqJo6DoeR4td+1q7TfyjpvFvgxjU/m/nLhKFcNd1A6TN3OK
 G9Y6q0k0aWsAUPUjg3Y6gFRAlXHQaGXQ3nMDmoTCdYSYqid8gB+oqPbfwf5uHAgU
 GVPgKxqSsJmzxPYTjpjx8mdpgiwCHa+iB+reoqxNSdxJnAk93GrBA7efonNoxKB1
 r8VJlgkJubMjxGMu6xQYLMyt1Xed85sbiASOdE+Thw700tBA/ZAtKuB8xZ4+X1Fd
 M5osKYnqodde+A3aSi6P7b+M6N+WyA/7bHhckbaQy8cwpC9xhgEqsEsIEFm0eJjB
 wbdGe2tsCTUvLy37++D5e88cF9O2F6Ku0MJJtb7KsTLZPFD9XXs/6/xx4vSSNKQt
 FC7BF5dkQiLDJvy1xvcHK43+PbOaS7/8WM1NuoNAS/L/3RYFrrHby3LqBo+kcUbV
 L9HoL8aJd60bsX7PceXA9UzaH8yk/yTgeyOtd2+VCiRVldvNtx32ylTJLUqqxeRi
 AL/tZWgxwPKb54AJMptPZ0fGP5A+pUhQgTm7fJCwrUdXQXWUW0YYK2sV3H9BZ8px
 Ga0PuJtjxj8OkGFwnugEtuQNGQ9M5uCX4UiELqP3rVRNpq4e8UkOZRqtHsU7urSB
 ezufwdI+b+uHUucva31D
 =KSDi
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes

Pull GFS2 fixes from Steven Whitehouse:
 "Here is a set of small fixes for GFS2.  There is a fix to drop
  s_umount which is copied in from the core vfs, two patches relate to a
  hard to hit "use after free" and memory leak.  Two patches related to
  using DIO and buffered I/O on the same file to ensure correct
  operation in relation to glock state changes.  The final patch adds an
  RCU read lock to ensure correct locking on an error path"

* tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
  GFS2: Fix unsafe dereference in dump_holder()
  GFS2: Wait for async DIO in glock state changes
  GFS2: Fix incorrect invalidation for DIO/buffered I/O
  GFS2: Fix slab memory leak in gfs2_bufdata
  GFS2: Fix use-after-free race when calling gfs2_remove_from_ail
  GFS2: don't hold s_umount over blkdev_put
2014-01-02 12:45:47 -08:00
Tetsuo Handa 0b3a2c9968 GFS2: Fix unsafe dereference in dump_holder()
GLOCK_BUG_ON() might call this function without RCU read lock. Make sure that
RCU read lock is held when using task_struct returned from pid_task().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-02 12:18:04 +00:00
Shirish Pargaonkar f1e3268126 cifs: set FILE_CREATED
Set FILE_CREATED on O_CREAT|O_EXCL.

cifs code didn't change during commit 116cc02253

Kernel bugzilla 66251

Signed-off-by: Shirish Pargaonkar <spargaonkar@suse.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-12-27 15:14:45 -06:00
Sachin Prabhu 750b8de6c4 cifs: We do not drop reference to tlink in CIFSCheckMFSymlink()
When we obtain tcon from cifs_sb, we use cifs_sb_tlink() to first obtain
tlink which also grabs a reference to it. We do not drop this reference
to tlink once we are done with the call.

The patch fixes this issue by instead passing tcon as a parameter and
avoids having to obtain a reference to the tlink. A lookup for the tcon
is already made in the calling functions and this way we avoid having to
re-run the lookup. This is also consistent with the argument list for
other similar calls for M-F symlinks.

We should also return an ENOSYS when we do not find a protocol specific
function to lookup the MF Symlink data.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-12-27 15:14:44 -06:00
Steve French ebcc943c11 Add missing end of line termination to some cifs messages
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Gregor Beck <gbeck@sernet.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2013-12-27 15:14:44 -06:00
Linus Torvalds f41bfc9423 A collection of bug fixes destined for stable and some printk cleanups
and a patch so that instead of BUG'ing we use the ext4_error()
 framework to mark the file system is corrupted.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABCAAGBQJSuSZnAAoJENNvdpvBGATwp/8P/R/VjV1O4IhDmRxbc7mNAkoB
 Mfh2Utnk9daGdMpSMhzWW6m3oyohA0ICledBusQ3ax6Ymg8jIcmGwm3rJ8gAXvR1
 4g0rQ1nw3JEGROId58FnKB3fsEmOPlt4T/LKL4boY6BfER4yu1htH0zSKBuKqykt
 feH9dMiaR1KMQ613eWY6GEonYaP8+nI1GxEfvrymInxznDPVuaLgR4oBMAmR8R76
 9vfJfFHYjbk1wQ5UEv94tic8Hi055PGCRfsLc79QwxMr5KyKz+NydDUIjgKjP9pu
 9sz8iuV79M5/hUguZY7HH9Xd0byZ+jPuNrpkrDqSNZYuArfIcsXKZM/dm7HOgFGQ
 dQzf9S/kBzJvcSHuUchhS2cm6kxCsHaqo16Fxs5kP3TmB3TrVr7EV6uBS4cm53PJ
 x6IdAORhbURfuJCRQOi/TDNUrb+ZHvIx7Gc1ujizczC3An7QurfYo7XY/rWfdj41
 eIVy0+1gqvWJsbXGInni1hKbXMU3yTJ0MqQm05A7MW/G2G6eIgEVpz8MElm33jEE
 VvC6KyZxpviRYPUmxxTbSa1vl0UG1rZdZXslgmlSyY1yItVmyTCIAG217JOTyhTX
 Ae1aZEgzLYh6dQAwweme79WF4WsBPP28TOmW2xoOH7t04gMG0t+9b/RbUDtTPItc
 HXNmIlFP9CULIQ1c2Cvh
 =KPNa
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "A collection of bug fixes destined for stable and some printk cleanups
  and a patch so that instead of BUG'ing we use the ext4_error()
  framework to mark the file system is corrupted"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: add explicit casts when masking cluster sizes
  ext4: fix deadlock when writing in ENOSPC conditions
  jbd2: rename obsoleted msg JBD->JBD2
  jbd2: revise KERN_EMERG error messages
  jbd2: don't BUG but return ENOSPC if a handle runs out of space
  ext4: Do not reserve clusters when fs doesn't support extents
  ext4: fix del_timer() misuse for ->s_err_report
  ext4: check for overlapping extents in ext4_valid_extent_entries()
  ext4: fix use-after-free in ext4_mb_new_blocks
  ext4: call ext4_error_inode() if jbd2_journal_dirty_metadata() fails
2013-12-26 09:26:12 -08:00
Linus Torvalds dc0a6b4fee Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2 fix from Jan Kara:
 "One simple fix of oops in ext2 which was recently hit by Christoph"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2: Fix oops in ext2_get_block() called from ext2_quota_write()
2013-12-23 17:24:38 -08:00
Linus Torvalds a8472b4bb1 Merge git://git.kvack.org/~bcrl/aio-next
Pull AIO leak fixes from Ben LaHaise:
 "I've put these two patches plus Linus's change through a round of
  tests, and it passes millions of iterations of the aio numa
  migratepage test, as well as a number of repetitions of a few simple
  read and write tests.

  The first patch fixes the memory leak Kent introduced, while the
  second patch makes aio_migratepage() much more paranoid and robust"

* git://git.kvack.org/~bcrl/aio-next:
  aio/migratepages: make aio migrate pages sane
  aio: fix kioctx leak introduced by "aio: Fix a trinity splat"
2013-12-22 11:03:49 -08:00
Linus Torvalds 3dc9acb676 aio: clean up and fix aio_setup_ring page mapping
Since commit 36bc08cc01 ("fs/aio: Add support to aio ring pages
migration") the aio ring setup code has used a special per-ring backing
inode for the page allocations, rather than just using random anonymous
pages.

However, rather than remembering the pages as it allocated them, it
would allocate the pages, insert them into the file mapping (dirty, so
that they couldn't be free'd), and then forget about them.  And then to
look them up again, it would mmap the mapping, and then use
"get_user_pages()" to get back an array of the pages we just created.

Now, not only is that incredibly inefficient, it also leaked all the
pages if the mmap failed (which could happen due to excessive number of
mappings, for example).

So clean it all up, making it much more straightforward.  Also remove
some left-overs of the previous (broken) mm_populate() usage that was
removed in commit d6c355c7da ("aio: fix race in ring buffer page
lookup introduced by page migration support") but left the pointless and
now misleading MAP_POPULATE flag around.

Tested-and-acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-22 11:03:08 -08:00
Benjamin LaHaise 8e321fefb0 aio/migratepages: make aio migrate pages sane
The arbitrary restriction on page counts offered by the core
migrate_page_move_mapping() code results in rather suspicious looking
fiddling with page reference counts in the aio_migratepage() operation.
To fix this, make migrate_page_move_mapping() take an extra_count parameter
that allows aio to tell the code about its own reference count on the page
being migrated.

While cleaning up aio_migratepage(), make it validate that the old page
being passed in is actually what aio_migratepage() expects to prevent
misbehaviour in the case of races.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-12-21 17:56:08 -05:00
Benjamin LaHaise 1881686f84 aio: fix kioctx leak introduced by "aio: Fix a trinity splat"
e34ecee2ae reworked the percpu reference
counting to correct a bug trinity found.  Unfortunately, the change lead
to kioctxes being leaked because there was no final reference count to
put.  Add that reference count back in to fix things.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
2013-12-21 15:57:09 -05:00
Linus Torvalds a6ddeee32d xfs: bugfixes for 3.13-rc5
- fix memory leak in xfs_dir2_node_removename
 - fix quota assertion in xfs_setattr_size
 - fix quota assertions in xfs_qm_vop_create_dqattach
 - fix for hang when disabling group and project quotas before
   disabling user quotas
 - fix Dave Chinner's email address in MAINTAINERS
 - fix for file allocation alignment
 - fix for assertion in xfs_buf_stale by removing xfsbdstrat
 - fix for alignment with swalloc mount option
 - fix for "retry forever" semantics on IO errors
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJSs0PZAAoJENaLyazVq6ZOELgP/Rcx5JdjfCdvZZ7HFfzabLU6
 TOpyEpc0TJso8C92+UNZJUZWNdToEn/v1VRh6dQ+cCz3RxQfOeOKVKXU1XkCBRQO
 JxW7Pucb+SRoVf+uv6qZCCJUO1oY6JByZ8+9GuBGWK5Ul2ByxTPI50Et0Qy4wM3z
 cDvQVyjtA5+63ToUS0sR8yBSKK+8c9SkjVkdLqa+AoFJHYC+meNrZ0J1PRV2ILWu
 bFJtKFe/tO4jj/UJ1uj6ZjvVQ0jm9JH1ZE4m3tbjPcDCTHyxHu5vSBVSlPO4WbAb
 Tfaj4eB7rQy05yno2/mAjn2koaqTSg1cP5V14TMP1GzBQUpwQDAWsNGkorXPfRIn
 Xsrznxk33fTCTqVSkSnVsXKZhizzPydyVCcvf00YJssYh9IEjVdWVpxedLFVJDmO
 jatsMaEAe7Z8avtah6u5vDGTQCEPQjhHPEqhW/EUfCNG1uK6DjyMG4dDsCMufJ7N
 Ze646oXD6zd45hSPQxMV1r8ZvlQoubUgctOBNqs/nDhOblRQ7MRqkRHhPRvvzsBG
 ffVB145l5v1cud0IcpIbfWPtosnPAvoqYS+qglkXkmXmU7rk0APePDYP7XLh4+qy
 8ROkJQ0rsgmC2cyC/fmwtwWQCMCRUrI9YB2X1zRiBS6TwwATP2uIomtT7GwAfK4+
 AmCwxwy6XPMhUd3xn3Vx
 =32uU
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfs

Pull xfs bugfixes from Ben Myers:
 "This contains fixes for some asserts
   related to project quotas, a memory leak, a hang when disabling group or
   project quotas before disabling user quotas, Dave's email address, several
   fixes for the alignment of file allocation to stripe unit/width geometry, a
   fix for an assertion with xfs_zero_remaining_bytes, and the behavior of
   metadata writeback in the face of IO errors.

   Details:
   - fix memory leak in xfs_dir2_node_removename
   - fix quota assertion in xfs_setattr_size
   - fix quota assertions in xfs_qm_vop_create_dqattach
   - fix for hang when disabling group and project quotas before
     disabling user quotas
   - fix Dave Chinner's email address in MAINTAINERS
   - fix for file allocation alignment
   - fix for assertion in xfs_buf_stale by removing xfsbdstrat
   - fix for alignment with swalloc mount option
   - fix for "retry forever" semantics on IO errors"

* tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfs:
  xfs: abort metadata writeback on permanent errors
  xfs: swalloc doesn't align allocations properly
  xfs: remove xfsbdstrat error
  xfs: align initial file allocations correctly
  MAINTAINERS: fix incorrect mail address of XFS maintainer
  xfs: fix infinite loop by detaching the group/project hints from user dquot
  xfs: fix assertion failure at xfs_setattr_nonsize
  xfs: fix false assertion at xfs_qm_vop_create_dqattach
  xfs: fix memory leak in xfs_dir2_node_removename
2013-12-20 15:48:45 -08:00
Luck, Tony df36ac1bc2 pstore: Don't allow high traffic options on fragile devices
Some pstore backing devices use on board flash as persistent
storage. These have limited numbers of write cycles so it
is a poor idea to use them from high frequency operations.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-20 13:12:01 -08:00