Commit Graph

35129 Commits

Author SHA1 Message Date
Wang Shilong f05c474688 Btrfs: fix memory leaks on walking backrefs failure
When walking backrefs, we may iterate every inode's extent
and add/merge them into ulist, and the caller will free memory
from ulist.

However, if we fail to allocate inode's extents element
memory or ulist_add() fail to allocate memory, we won't
add allocated memory into ulist, and the caller won't
free some allocated memory thus memory leaks happen.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:26 -08:00
Filipe David Borba Manana bf54f412f0 Btrfs: fix send file hole detection leading to data corruption
There was a case where file hole detection was incorrect and it would
cause an incremental send to override a section of a file with zeroes.

This happened in the case where between the last leaf we processed which
contained a file extent item for our current inode and the leaf we're
currently are at (and has a file extent item for our current inode) there
are only leafs containing exclusively file extent items for our current
inode, and none of them was updated since the previous send operation.
The file hole detection code would incorrectly consider the file range
covered by these leafs as a hole.

A test case for xfstests follows soon.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:25 -08:00
Wang Shilong bca1a29003 Btrfs: add a reschedule point in btrfs_find_all_roots()
I can easily trigger the following warnings when enabling quota
in my virtual machine(running Opensuse), Steps are firstly creating
a subvolume full of fragment extents, and then create many snapshots
(500 in my test case).

[ 2362.808459] BUG: soft lockup - CPU#0 stuck for 22s! [btrfs-qgroup-re:1970]

[ 2362.809023] task: e4af8450 ti: e371c000 task.ti: e371c000
[ 2362.809026] EIP: 0060:[<fa38f4ae>] EFLAGS: 00000246 CPU: 0
[ 2362.809049] EIP is at __merge_refs+0x5e/0x100 [btrfs]
[ 2362.809051] EAX: 00000000 EBX: cfadbcf0 ECX: 00000000 EDX: cfadbcb0
[ 2362.809052] ESI: dd8d3370 EDI: e371dde0 EBP: e371dd6c ESP: e371dd5c
[ 2362.809054]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 2362.809055] CR0: 80050033 CR2: ac454d50 CR3: 009a9000 CR4: 001407d0
[ 2362.809099] Stack:
[ 2362.809100]  00000001 e371dde0 dfcc6890 f29f8000 e371de28 fa39016d 00000011 00000001
[ 2362.809105]  99bfc000 00000000 93928000 00000000 00000001 00000050 e371dda8 00000001
[ 2362.809109]  f3a31000 f3413000 00000001 e371ddb8 000040a8 00000202 00000000 00000023
[ 2362.809113] Call Trace:
[ 2362.809136]  [<fa39016d>] find_parent_nodes+0x34d/0x1280 [btrfs]
[ 2362.809156]  [<fa391172>] btrfs_find_all_roots+0xb2/0x110 [btrfs]
[ 2362.809174]  [<fa3934a8>] btrfs_qgroup_rescan_worker+0x358/0x7a0 [btrfs]
[ 2362.809180]  [<c024d0ce>] ? lock_timer_base.isra.39+0x1e/0x40
[ 2362.809199]  [<fa3648df>] worker_loop+0xff/0x470 [btrfs]
[ 2362.809204]  [<c027a88a>] ? __wake_up_locked+0x1a/0x20
[ 2362.809221]  [<fa3647e0>] ? btrfs_queue_worker+0x2b0/0x2b0 [btrfs]
[ 2362.809225]  [<c025ebbc>] kthread+0x9c/0xb0
[ 2362.809229]  [<c06b487b>] ret_from_kernel_thread+0x1b/0x30
[ 2362.809233]  [<c025eb20>] ? kthread_create_on_node+0x110/0x110

By adding a reschedule point at the end of btrfs_find_all_roots(), i no longer
hit these warnings.

Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:25 -08:00
Filipe David Borba Manana 7fdd29d02e Btrfs: make send's file extent item search more efficient
Instead of looking for a file extent item, process it, release the path
and do a btree search for the next file extent item, just process all
file extent items in a leaf without intermediate btree searches. This way
we save cpu and we're not blocking other tasks or affecting concurrency on
the btree, because send's paths use the commit root and skip btree node/leaf
locking.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:24 -08:00
Wang Shilong 95def2ede1 Btrfs: fix to catch all errors when resolving indirect ref
We can only tolerate ENOENT here, for other errors, we should
return directly.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:23 -08:00
Wang Shilong 538f72cdf0 Btrfs: fix protection between walking backrefs and root deletion
There is a race condition between resolving indirect ref and root deletion,
and we should gurantee that root can not be destroyed to avoid accessing
broken tree here.

Here we fix it by holding @subvol_srcu, and we will release it as soon
as we have held root node lock.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:23 -08:00
Gui Hecheng 3c9665df0c btrfs: fix warning while merging two adjacent extents
When we have two adjacent extents in relink_extent_backref,
we try to merge them. When we use btrfs_search_slot to locate the
slot for the current extent, we shouldn't set "ins_len = 1",
because we will merge it into the previous extent rather than
insert a new item. Otherwise, we may happen to create a new leaf
in btrfs_search_slot and path->slot[0] will be 0. Then we try to
fetch the previous item using "path->slots[0]--", and it will cause
a warning as follows:

	[  145.713385] WARNING: CPU: 3 PID: 1796 at fs/btrfs/extent_io.c:5043 map_private_extent_buffer+0xd4/0xe0
	[  145.713387] btrfs bad mapping eb start 5337088 len 4096, wanted 167772306 8
	...
	[  145.713462]  [<ffffffffa034b1f4>] map_private_extent_buffer+0xd4/0xe0
	[  145.713476]  [<ffffffffa030097a>] ? btrfs_free_path+0x2a/0x40
	[  145.713485]  [<ffffffffa0340864>] btrfs_get_token_64+0x64/0xf0
	[  145.713498]  [<ffffffffa033472c>] relink_extent_backref+0x41c/0x820
	[  145.713508]  [<ffffffffa0334d69>] btrfs_finish_ordered_io+0x239/0xa80

I encounter this warning when running defrag having mkfs.btrfs
with option -M. At the same time there are read/writes & snapshots
running at background.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:22 -08:00
Filipe David Borba Manana 9f03740a95 Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.

This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:

  WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
  Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
  CPU: 1 PID: 5705 Comm: btrfs Tainted: G           O 3.13.0-rc7-fdm-btrfs-next-18+ #3
  Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
  [ 5381.660441]  00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
  [ 5381.660447]  0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
  [ 5381.660452]  0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
  [ 5381.660457] Call Trace:
  [ 5381.660464]  [<ffffffff81777434>] dump_stack+0x4e/0x68
  [ 5381.660471]  [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
  [ 5381.660476]  [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
  [ 5381.660480]  [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
  [ 5381.660487]  [<ffffffff8108313f>] ? local_clock+0x4f/0x60
  [ 5381.660491]  [<ffffffff811430e8>] ? free_one_page+0x98/0x440
  [ 5381.660495]  [<ffffffff8108313f>] ? local_clock+0x4f/0x60
  [ 5381.660502]  [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
  [ 5381.660508]  [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
  [ 5381.660515]  [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
  [ 5381.660520]  [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
  [ 5381.660524]  [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
  [ 5381.660530]  [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
  [ 5381.660536]  [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
  [ 5381.660560]  [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
  [ 5381.660564]  [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
  [ 5381.660569]  [<ffffffff811580ef>] krealloc+0x6f/0xb0
  [ 5381.660586]  [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
  [ 5381.660601]  [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
  [ 5381.660615]  [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
  [ 5381.660628]  [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]

Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:

  $ mkfs.btrfs -f /dev/sdb3
  $ mount /dev/sdb3 /mnt/btrfs
  $ mkdir -p /mnt/btrfs/a/b/c/d
  $ mkdir /mnt/btrfs/a/b/c2
  $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
  $ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
  $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
  $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
  $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send

The structure of the filesystem when the first snapshot is taken is:

	 .                       (ino 256)
	 |-- a                   (ino 257)
	     |-- b               (ino 258)
	         |-- c           (ino 259)
	         |   |-- d       (ino 260)
                 |
	         |-- c2          (ino 261)

And its structure when the second snapshot is taken is:

	 .                       (ino 256)
	 |-- a                   (ino 257)
	     |-- b               (ino 258)
	         |-- c2          (ino 261)
	             |-- d2      (ino 260)
	                 |-- cc  (ino 259)

Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.

A test case for xfstests, with a more complex scenario, will follow soon.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:22 -08:00
Jan Kara 8581679424 fanotify: Fix use after free for permission events
Currently struct fanotify_event_info has been destroyed immediately
after reporting its contents to userspace. However that is wrong for
permission events because those need to stay around until userspace
provides response which is filled back in fanotify_event_info. So change
to code to free permission events only after we have got the response
from userspace.

Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz>
Reported-and-tested-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-01-29 13:57:17 +01:00
Jan Kara 83c0e1b442 fsnotify: Do not return merged event from fsnotify_add_notify_event()
The event returned from fsnotify_add_notify_event() cannot ever be used
safely as the event may be freed by the time the function returns (after
dropping notification_mutex). So change the prototype to just return
whether the event was added or merged into some existing event.

Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz>
Reported-and-tested-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-01-29 13:57:10 +01:00
Jan Kara 13116dfd13 fanotify: Fix use after free in mask checking
We cannot use the event structure returned from
fsnotify_add_notify_event() because that event can be freed by the time
that function returns. Use the mask argument passed into the event
handler directly instead. This also fixes a possible problem when we
could unnecessarily wait for permission response for a normal fanotify
event which got merged with a permission event.

We also disallow merging of permission event with any other event so
that we know the permission event which we just created is the one on
which we should wait for permission response.

Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz>
Reported-and-tested-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-01-29 13:57:04 +01:00
Linus Torvalds 0e47c969c6 MTD updates for 3.14:
- Add me (Brian Norris) as an additional MTD maintainer (it'd be nice to get
    David's "ack" for this; I'm sure he approves, but he's been pretty silent
    lately)
  - Add Ezequiel Garcie as maintainer for the pxa3xx NAND driver
  - Last (?) round of pxa3xx improvements for supporting Armada 370/XP
  - Typical churn in driver boilerplate (OOM messages, printk()'s, devm_*, etc.)
  - Quad read mode support for SPI NOR driver (m25p80)
  - Update Davinci NAND driver to prepare for use on new platforms
  - Begin to kill off NAND_MAX_{PAGE,OOB}SIZE macros; more work is pending
  - Miscellaneous NAND device support (new IDs)
  - Add READ RETRY support for Micron MLC NAND
  - Support new GPMI NAND ECC layout device-tree binding
  - Avoid mapping stack/vmalloc() memory for GPMI NAND DMA
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJS50wMAAoJEFySrpd9RFgt1I0P/3fxzmHMmTRPjNRmtVqE5buW
 L3CjI1sPypA5PgCjjwABQdiw8WztBt6KVlMv5Gs6gaoi/WYAyAWz40PfwuBWFklD
 L8+cuKuuslsFfqh+gdkrBkNKnrFfMJpTDx57E5+dciInmWWB1w14sHMAYk5A/ZHK
 MBfXCcgfIECeQp1zSKmBtQxkhqn20hOS232gsCRWu2NWWu4UCzEcRLHGCBP1qqi3
 6joQhrICiQJwH09D+I9YfyS/oisM+75Df+79xephcjMaJYAAE3tLVrKM84CBso+R
 R/wjOvX+xFVyvf3n+CftxbmbFVXtRAWqCEEHu2yKffiF9oipp9xOm6gqJb9SYX/7
 n54fexx5cM1AE8OeDCxgbVCNfgoqDS6hk03d5oVMRmASShr0Ye4irG77Dy/KTMGC
 UdB2gb5HoKPC2N2IWWRlpwRoyLR9Xhf/+iCHjac0kB7oYQ0IdMyC0d4h7Prp5r/4
 zaz/pBNNr1lL8DOFrIog02U1DL7jOBtMmlaCUnfrPkXOiwG04Inkfy7Gfug0uTIh
 o1JhwCWmrY/CLYmpMqKFK9V4gr72CZTGIjMeNpb2NUAM0XlRbf3AQ8x7P7Q9UjK6
 ARZC/3lPrkSE8BDgSP8cT/oXB5zsIAU0jJbu3yIixD/v9WwMyKjoS4E8crl7qVB4
 KDAzQNR3rjwmE9lOjtsx
 =3nT9
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20140127' of git://git.infradead.org/linux-mtd

Pull MTD updates from Brian Norris:
 - Add me (Brian Norris) as an additional MTD maintainer (it'd be nice to get
   David's "ack" for this; I'm sure he approves, but he's been pretty silent
   lately)
 - Add Ezequiel Garcie as maintainer for the pxa3xx NAND driver
 - Last (?) round of pxa3xx improvements for supporting Armada 370/XP
 - Typical churn in driver boilerplate (OOM messages, printk()'s, devm_*, etc.)
 - Quad read mode support for SPI NOR driver (m25p80)
 - Update Davinci NAND driver to prepare for use on new platforms
 - Begin to kill off NAND_MAX_{PAGE,OOB}SIZE macros; more work is pending
 - Miscellaneous NAND device support (new IDs)
 - Add READ RETRY support for Micron MLC NAND
 - Support new GPMI NAND ECC layout device-tree binding
 - Avoid mapping stack/vmalloc() memory for GPMI NAND DMA

* tag 'for-linus-20140127' of git://git.infradead.org/linux-mtd: (151 commits)
  mtd: gpmi: add sanity check when mapping DMA for read_buf/write_buf
  mtd: gpmi: allocate a proper buffer for non ECC read/write
  mtd: m25p80: Set rx_nbits for Quad SPI transfers
  mtd: m25p80: Enable Quad SPI read transfers for s25fl512s
  mtd: s3c2410: Merge plat/regs-nand.h into s3c2410.c
  mtd: mtdram: add missing 'const'
  mtd: m25p80: assign default read command
  mtd: nuc900_nand: remove redundant return value check of platform_get_resource()
  mtd: plat_nand: remove redundant return value check of platform_get_resource()
  mtd: nand: add Intel manufacturer ID
  mtd: nand: add SanDisk manufacturer ID
  mtd: nand: add support for Samsung K9LCG08U0B
  mtd: nand: pxa3xx: Add support for 2048 bytes page size devices
  mtd: m25p80: Use OPCODE_QUAD_READ_4B for 4-byte addressing
  mtd: nand: don't use {read,write}_buf for 8-bit transfers
  mtd: nand: use __packed shorthand
  mtd: nand: support Micron READ RETRY
  mtd: nand: add generic READ RETRY support
  mtd: nand: add ONFI vendor block for Micron
  mtd: nand: localize ECC failures per page
  ...
2014-01-28 18:56:37 -08:00
Linus Torvalds f1499382f1 xfs: update #2 for v3.14-rc1
- allow logical sector sized direct io on 'advanced format' 4k/512 disk.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJS6C5lAAoJENaLyazVq6ZOFFMP/Ag3IzyGv9zQ1I1ebqxdGNmU
 8bNVvurN+Az1ISgXkMdcbqNgpVxcWaAcpbJwxWsRBaKBxtBmBWa0TF33XEmrFdmO
 Qqw3qLSsPzIDP5l518/bpn1Q0XhMWO2zss9HtHogw/7WBkB5qQYn94DAwIF1Zrga
 cimpuiVv7u73BnlMJeQl+mqONkuEUl9Oc9lOyDP7lqbAlp3mBDylTs0q99kug8fF
 LSQ+f+06PFKshPzTtcZHhqwMtK7Pd+wPp3B3i0iGvv1GcWWL1adM5HfWPLLQ/85l
 CvdTnKnIWpNvmaEbPtpB+kuz7V3CtnLFIBM1oRiYZoNRNEUG3SkkqH+xSkXY+Wfr
 Wiuplt50vHUvn1tiL6n2GE0QAyTNhcy2RiWzpkIim6S2LUXCmdOKvT9hEEs5ymVl
 Y+VSXwdPEbCHSfhqkg+PXUbWxk+WTkDe9rOMFK4VcTjDuxiWUAiFgVpEHrxjPV4A
 rFPwNgfUpaRMou5SzTZHLuM8YQ75DckDH+O3XpWD4boEfeTGeV3LhlRzLvIkL0I2
 8VB24kV9TPUj+rphh9AckdU3kfN1wIb7ZOjwr008I5udfgPwjoj+kSgnd+k2yPtr
 r/ryHTV+RchbsiYEdriibSgUOtJNdNWnboba1DEDZDSvqALsq5eKWmpm3InE85Ix
 2bN0SdjsTphp/Mc1Mllg
 =c0tN
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-v3.14-rc1-2' of git://oss.sgi.com/xfs/xfs

Pull second xfs update from Ben Myers:
 "Allow logical sector sized direct io on 'advanced format' 4k/512 disk"

* tag 'xfs-for-linus-v3.14-rc1-2' of git://oss.sgi.com/xfs/xfs:
  xfs: allow logical-sector sized O_DIRECT
  xfs: rename xfs_buftarg structure members
  xfs: clean up xfs_buftarg
2014-01-28 18:21:22 -08:00
Linus Torvalds 4db658ea0c ceph: Fix up after semantic merge conflict
The previous ceph-client merge resulted in ceph not even building,
because there was a merge conflict that wasn't visible as an actual data
conflict: commit 7221fe4c2e ("ceph: add acl for cephfs") added support
for POSIX ACL's into Ceph, but unluckily we also had the VFS tree change
a lot of the POSIX ACL helper functions to be much more helpful to
filesystems (see for example commits 2aeccbe957 "fs: add generic
xattr_acl handlers", 5bf3258fd2 "fs: make posix_acl_chmod more useful"
and 37bc15392a "fs: make posix_acl_create more useful")

The reason this conflict wasn't obvious was many-fold: because it was a
semantic conflict rather than a data conflict, it wasn't visible in the
git merge as a conflict.  And because the VFS tree hadn't been in
linux-next, people hadn't become aware of it that way.  And because I
was at jury duty this morning, I was using my laptop and as a result not
doing constant "allmodconfig" builds.

Anyway, this fixes the build and generally removes a fair chunk of the
Ceph POSIX ACL support code, since the improved helpers seem to match
really well for Ceph too.  But I don't actually have any way to *test*
the end result, and I was really hoping for some ACK's for this.  Oh,
well.

Not compiling certainly doesn't make things easier to test, so I'm
committing this without the acks after having waited for four hours...
Plus it's what I would have done for the merge had I noticed the
semantic conflict..

Reported-by: Dave Jones <davej@redhat.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Guangliang Zhao <lucienchao@gmail.com>
Cc: Li Wang <li.wang@ubuntykylin.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-28 18:06:18 -08:00
Anand Jain 2365dd3ca0 btrfs: undo sysfs when open_ctree() fails
reproducer:
mkfs.btrfs -f /dev/sdb &&\
mount /dev/sdb /btrfs &&\
btrfs dev add -f /dev/sdc /btrfs &&\
umount /btrfs &&\
wipefs -a /dev/sdc &&\
mount -o degraded /dev/sdb /btrfs
//above mount fails so try with RO
mount -o degraded,ro /dev/sdb /btrfs

------
sysfs: cannot create duplicate filename '/fs/btrfs/3f48c79e-5ed0-4e87-b189-86e749e503f4'
::

dump_stack+0x49/0x5e
warn_slowpath_common+0x87/0xb0
warn_slowpath_fmt+0x41/0x50
strlcat+0x69/0x80
sysfs_warn_dup+0x87/0xa0
sysfs_add_one+0x40/0x50
create_dir+0x76/0xc0
sysfs_create_dir_ns+0x7a/0xc0
kobject_add_internal+0xad/0x220
kobject_add_varg+0x38/0x60
kobject_init_and_add+0x53/0x70
mutex_lock+0x11/0x40
__free_pages+0x25/0x30
free_pages+0x41/0x50
selinux_sb_copy_data+0x14e/0x1e0
mount_fs+0x3e/0x1a0
vfs_kern_mount+0x71/0x120
do_mount+0x3f7/0x980
SyS_mount+0x8b/0xe0
system_call_fastpath+0x16/0x1b
------

further 'modprobe -r btrfs' fails as well

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:44 -08:00
Filipe David Borba Manana f74b86d855 Btrfs: fix snprintf usage by send's gen_unique_name
The buffer size argument passed to snprintf must account for the
trailing null byte added by snprintf, and it returns a value >= then
sizeof(buffer) when the string can't fit in the buffer.

Since our buffer has a size of 64 characters, and the maximum orphan
name we can generate is 63 characters wide, we must pass 64 as the
buffer size to snprintf, and not 63.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:43 -08:00
Justin Maggard c41570c9d2 btrfs: fix defrag 32-bit integer overflow
When defragging a very large file, the cluster variable can wrap its 32-bit
signed int type and become negative, which eventually gets passed to
btrfs_force_ra() as a very large unsigned long value.  On 32-bit platforms,
this eventually results in an Oops from the SLAB allocator.

Change the cluster and max_cluster signed int variables to unsigned long to
match the readahead functions.  This also allows the min() comparison in
btrfs_defrag_file() to work as intended.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:43 -08:00
David Sterba c736c095de btrfs: sysfs: list the NO_HOLES feature
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:42 -08:00
David Sterba 66b4bbd4f5 btrfs: sysfs: don't show reserved incompat feature
The COMPRESS_LZOv2 incompat featue is currently not implemented, the bit
is only reserved, no point to list it in sysfs.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:41 -08:00
David Sterba bd60ea0fe9 btrfs: call permission checks earlier in ioctls and return EPERM
The owner and capability checks in IOC_SUBVOL_SETFLAGS and
SET_RECEIVED_SUBVOL should be called before any other checks are done.

Also unify the error code to EPERM.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:41 -08:00
David Sterba d024206133 btrfs: restrict snapshotting to own subvolumes
Currently, any user can snapshot any subvolume if the path is accessible and
thus indirectly create and keep files he does not own under his direcotries.
This is not possible with traditional directories.

In security context, a user can snapshot root filesystem and pin any
potentially buggy binaries, even if the updates are applied.

All the snapshots are visible to the administrator, so it's possible to
verify if there are suspicious snapshots.

Another more practical problem is that any user can pin the space used
by eg. root and cause ENOSPC.

Original report:
https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/484786

CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:40 -08:00
Miao Xie 89d4346a36 Btrfs: fix wrong block group in trace during the free space allocation
We allocate the free space from the former block group, not the current
one, so should use the former one to output the trace information.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:40 -08:00
Miao Xie 215a63d139 Btrfs: cleanup the code of used_block_group in find_free_extent()
used_block_group is just used for the space cluster which doesn't
belong to the current block group, the other place needn't use it.
Or the logic of code seems unclear.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:39 -08:00
Miao Xie 920e4a58d2 Btrfs: cleanup the redundant code for the block group allocation and init
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:38 -08:00
Miao Xie 26b47ff65b Btrfs: change the members' order of btrfs_space_info structure to reduce the cache miss
It is better that the position of the lock is close to the data which is
protected by it, because they may be in the same cache line, we will load
less cache lines when we access them. So we rearrange the members' position
of btrfs_space_info structure to make the lock be closer to the its data.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:38 -08:00
Wang Shilong ffcfaf8179 Btrfs: fix wrong search path initialization before searching tree root
To search tree root without transaction protection, we should neither search commit
root nor skip locking here, fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:37 -08:00
Miao Xie 23c671a588 Btrfs: flush the dirty pages of the ordered extent aggressively during logging csum
The performance of fsync dropped down suddenly sometimes, the main reason
of this problem was that we might only flush part dirty pages in a ordered
extent, then got that ordered extent, wait for the csum calcucation. But if
no task flushed the left part, we would wait until the flusher flushed them,
sometimes we need wait for several seconds, it made the performance drop
down suddenly. (On my box, it drop down from 56MB/s to 4-10MB/s)

This patch improves the above problem by flushing left dirty pages aggressively.

Test Environment:
CPU:		2CPU * 2Cores
Memory:		4GB
Partition:	20GB(HDD)

Test Command:
 # sysbench --num-threads=8 --test=fileio --file-num=1 \
 > --file-total-size=8G --file-block-size=32768 \
 > --file-io-mode=sync --file-fsync-freq=100 \
 > --file-fsync-end=no --max-requests=10000 \
 > --file-test-mode=rndwr run

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:37 -08:00
Wang Shilong 2c21b4d733 Btrfs: fix transaction abortion when remounting btrfs from RW to RO
Steps to reproduce:
 # mkfs.btrfs -f /dev/sda8
 # mount /dev/sda8 /mnt -o flushoncommit
 # dd if=/dev/zero of=/mnt/data bs=4k count=102400 &
 # mount /dev/sda8 /mnt -o remount, ro

When remounting RW to RO, the logic is to firstly set flag
to RO and then commit transaction, however with option
flushoncommit enabled,we will do RO check within committing
transaction, so we get a transaction abortion here.

Actually,here check is wrong, we should check if FS_STATE_ERROR
is set, fix it.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Suggested-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:36 -08:00
Filipe David Borba Manana e4355f34ef Btrfs: faster file extent item search in clone ioctl
When we are looking for file extent items that intersect the cloning
range, for each one that falls completely outside the range, don't
release the path and do another full tree search - just move on
to the next slot and copy the file extent item into our buffer only
if the item intersects the cloning range.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:35 -08:00
Liu Bo 1a4319cc3c Btrfs: fix extent state leak on transaction abortion
When transaction is aborted, we fail to commit transaction, instead we do
cleanup work.  After that when we umount btrfs, we get to free fs roots' log
trees respectively, but that happens after we unpin extents, so those extents
pinned by freeing log trees will remain in memory and lead to the leak.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:35 -08:00
Qu Wenruo 078025347c btrfs: Cleanup the btrfs_parse_options for remount.
Since remount will pending the new mount options to the original mount
options, which will make btrfs_parse_options check the old options then
new options, causing some stupid output like "enabling XXX" following by
"disable XXX".

This patch will add extra check before every btrfs_info to skip the
output from old options checking.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:34 -08:00
Qu Wenruo 3818aea275 btrfs: Add noinode_cache mount option
Add noinode_cache mount option for btrfs.

Since inode map cache involves all the btrfs_find_free_ino/return_ino
things and if just trigger the mount_opt,
an inode number get from inode map cache will not returned to inode map
cache.

To keep the find and return inode both in the same behavior,
a new bit in mount_opt, CHANGE_INODE_CACHE, is introduced for this idea.
CHANGE_INODE_CACHE is set/cleared in remounting, and the original
INODE_MAP_CACHE is set/cleared according to CHANGE_INODE_CACHE after a
success transaction.
Since find/return inode is all done between btrfs_start_transaction and
btrfs_commit_transaction, this will keep consistent behavior.

Also noinode_cache mount option will not stop the caching_kthread.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:33 -08:00
Wang Shilong ade2e0b3ee Btrfs: fix to search previous metadata extent item since skinny metadata
There is a bug that using btrfs_previous_item() to search metadata extent item.
This is because in btrfs_previous_item(), we need type match, however, since
skinny metada was introduced by josef, we may mix this two types. So just
use btrfs_previous_item() is not working right.

To keep btrfs_previous_item() like normal tree search, i introduce another
function btrfs_previous_extent_item().

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:33 -08:00
Wang Shilong 7c76edb77c Btrfs: fix missing skinny metadata check in scrub_stripe()
Check if we support skinny metadata firstly and fix to use
right type to search.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:32 -08:00
Filipe David Borba Manana 28e5dd8f35 Btrfs: fix send to not send non-aligned clone operations
It is possible for the send feature to send clone operations that
request a cloning range (offset + length) that is not aligned with
the block size. This makes the btrfs receive command send issue a
clone ioctl call that will fail, as the ioctl will return an -EINVAL
error because of the unaligned range.

Fix this by not sending clone operations for non block aligned ranges,
and instead send regular write operation for these (less common) cases.

The following xfstest reproduces this issue, which fails on the second
btrfs receive command without this change:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  tmp=`mktemp -d`

  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -fr $tmp
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _need_to_be_root

  rm -f $seqres.full

  _scratch_mkfs >/dev/null 2>&1
  _scratch_mount

  $XFS_IO_PROG -f -c "truncate 819200" $SCRATCH_MNT/foo | _filter_xfs_io
  $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch

  $XFS_IO_PROG -c "falloc -k 819200 667648" $SCRATCH_MNT/foo | _filter_xfs_io
  $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch

  $XFS_IO_PROG -f -c "pwrite 1482752 2978" $SCRATCH_MNT/foo | _filter_xfs_io
  $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch

  $BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1 | \
      _filter_scratch

  $XFS_IO_PROG -f -c "truncate 883305" $SCRATCH_MNT/foo | _filter_xfs_io
  $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch

  $BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2 | \
      _filter_scratch

  $BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap 2>&1 | _filter_scratch
  $BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \
      -f $tmp/2.snap 2>&1 | _filter_scratch

  md5sum $SCRATCH_MNT/foo | _filter_scratch
  md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch
  md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch

  _scratch_unmount
  _check_btrfs_filesystem $SCRATCH_DEV
  _scratch_mkfs >/dev/null 2>&1
  _scratch_mount

  $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap
  md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch

  $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap
  md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch

  _scratch_unmount
  _check_btrfs_filesystem $SCRATCH_DEV

  status=0
  exit

The tests expected output is:

  QA output created by 025
  FSSync 'SCRATCH_MNT'
  FSSync 'SCRATCH_MNT'
  wrote 2978/2978 bytes at offset 1482752
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  FSSync 'SCRATCH_MNT'
  Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap1'
  FSSync 'SCRATCH_MNT'
  Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap2'
  At subvol SCRATCH_MNT/mysnap1
  At subvol SCRATCH_MNT/mysnap2
  129b8eaee8d3c2bcad49bec596591cb3  SCRATCH_MNT/foo
  42b6369eae2a8725c1aacc0440e597aa  SCRATCH_MNT/mysnap1/foo
  129b8eaee8d3c2bcad49bec596591cb3  SCRATCH_MNT/mysnap2/foo
  At subvol mysnap1
  42b6369eae2a8725c1aacc0440e597aa  SCRATCH_MNT/mysnap1/foo
  At snapshot mysnap2
  129b8eaee8d3c2bcad49bec596591cb3  SCRATCH_MNT/mysnap2/foo

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:32 -08:00
Filipe David Borba Manana 14a958e678 Btrfs: fix btrfs boot when compiled as built-in
After the change titled "Btrfs: add support for inode properties", if
btrfs was built-in the kernel (i.e. not as a module), it would cause a
kernel panic, as reported recently by Fengguang:

[    2.024722] BUG: unable to handle kernel NULL pointer dereference at           (null)
[    2.027814] IP: [<ffffffff81501594>] crc32c+0xc/0x6b
[    2.028684] PGD 0
[    2.028684] Oops: 0000 [#1] SMP
[    2.028684] Modules linked in:
[    2.028684] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc7-04795-ga7b57c2 #1
[    2.028684] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    2.028684] task: ffff88000edba100 ti: ffff88000edd6000 task.ti: ffff88000edd6000
[    2.028684] RIP: 0010:[<ffffffff81501594>]  [<ffffffff81501594>] crc32c+0xc/0x6b
[    2.028684] RSP: 0000:ffff88000edd7e58  EFLAGS: 00010246
[    2.028684] RAX: 0000000000000000 RBX: ffffffff82295550 RCX: 0000000000000000
[    2.028684] RDX: 0000000000000011 RSI: ffffffff81efe393 RDI: 00000000fffffffe
[    2.028684] RBP: ffff88000edd7e60 R08: 0000000000000003 R09: 0000000000015d20
[    2.028684] R10: ffffffff81ef225e R11: ffffffff811b0222 R12: ffffffffffffffff
[    2.028684] R13: 0000000000000239 R14: 0000000000000000 R15: 0000000000000000
[    2.028684] FS:  0000000000000000(0000) GS:ffff88000fa00000(0000) knlGS:0000000000000000
[    2.028684] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    2.028684] CR2: 0000000000000000 CR3: 000000000220c000 CR4: 00000000000006f0
[    2.028684] Stack:
[    2.028684]  ffffffff82295550 ffff88000edd7e80 ffffffff8238af62 ffffffff8238ac05
[    2.028684]  0000000000000000 ffff88000edd7e98 ffffffff8238ac0f ffffffff8238ac05
[    2.028684]  ffff88000edd7f08 ffffffff810002ba ffff88000edd7f00 ffffffff810e2404
[    2.028684] Call Trace:
[    2.028684]  [<ffffffff8238af62>] btrfs_props_init+0x4f/0x96
[    2.028684]  [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
[    2.028684]  [<ffffffff8238ac0f>] init_btrfs_fs+0xa/0xf0
[    2.028684]  [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
[    2.028684]  [<ffffffff810002ba>] do_one_initcall+0xa4/0x13a
[    2.028684]  [<ffffffff810e2404>] ? parse_args+0x25f/0x33d
[    2.028684]  [<ffffffff8234cf75>] kernel_init_freeable+0x1aa/0x230
[    2.028684]  [<ffffffff8234c785>] ? do_early_param+0x88/0x88
[    2.028684]  [<ffffffff819f61b5>] ? rest_init+0x89/0x89
[    2.028684]  [<ffffffff819f61c3>] kernel_init+0xe/0x109

The issue here is that the initialization function of btrfs (super.c:init_btrfs_fs)
started using crc32c (from lib/libcrc32c.c). But when it needs to call crc32c (as
part of the properties initialization routine), the libcrc32c is not yet initialized,
so crc32c derreferenced a NULL pointer (lib/libcrc32c.c:tfm), causing the kernel
panic on boot.

The approach to fix this is to use crypto component directly to use its crc32c (which
is basically what lib/libcrc32c.c is, a wrapper around crypto). This is what ext4 is
doing as well, it uses crypto directly to get crc32c functionality.

Verified this works both when btrfs is built-in and when it's loadable kernel module.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:31 -08:00
Filipe David Borba Manana c57c2b3ed2 Btrfs: unlock inodes in correct order in clone ioctl
In the clone ioctl, when the source and target inodes are different,
we can acquire their mutexes in 2 possible different orders. After
we're done cloning, we were releasing the mutexes always in the same
order - the most correct way of doing it is to release them by the
reverse order they were acquired.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:30 -08:00
Wang Shilong f499e40fd9 Btrfs: optimize to remove unnecessary removal with ulist reallocation
Here we are not going to free memory, no need to remove every node
one by one, just init root node here is ok.

Cc:  Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:30 -08:00
Liu Bo de6e820066 Btrfs: release subvolume's block_rsv before transaction commit
We don't have to keep subvolume's block_rsv during transaction commit,
and within transaction commit, we may also need the free space reclaimed
from this block_rsv to process delayed refs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:29 -08:00
Miao Xie f1de968376 Btrfs: fix the race between write back and nocow buffered write
When we ran the 274th case of xfstests with nodatacow mount option,
We met the following warning message:
WARNING: CPU: 1 PID: 14185 at fs/btrfs/extent-tree.c:3734 btrfs_free_reserved_data_space+0xa6/0xd0

It is caused by the race between the write back and nocow buffered
write:
  Task1				Task2
  __btrfs_buffered_write()
    skip data reservation
    reserve the metadata space
    copy the data
    dirty the pages
    unlock the pages
				write back the pages
				release the data space
   				  becasue there is no
				  noreserve flag
   set the noreserve flag

This patch fixes this problem by unlocking the pages after
the noreserve flag is set.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:28 -08:00
Josef Bacik 7ef81ac86c Btrfs: only process as many file extents as there are refs
The backref walking code will search down to the key it is looking for and then
proceed to walk _all_ of the extents on the file until it hits the end.  This is
suboptimal with large files, we only need to look for as many extents as we have
references for that inode.  I have a testcase that creates a randomly written 4
gig file and before this patch it took 6min 30sec to do the initial send, with
this patch it takes 2min 30sec to do the intial send.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:28 -08:00
Josef Bacik 3a6d75e846 Btrfs: fix qgroup rescan to work with skinny metadata
Could have sworn I fixed this before but apparently not.  This makes us pass
btrfs/022 with skinny metadata enabled.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:27 -08:00
Josef Bacik 580f0a678e Btrfs: fix extent_from_logical to deal with skinny metadata
I don't think this is an issue and I've not seen it in practice but
extent_from_logical will fail to find a skinny extent because it uses
btrfs_previous_item and gives it the normal extent item type.  This is just not
a place to use btrfs_previous_item since we care about either normal extents or
skinny extents, so open code btrfs_previous_item to properly check.  This would
only affect metadata and the only place this is used for metadata is scrub and
I'm pretty sure it's just for printing stuff out, not actually doing any work so
hopefully it was never a problem other than a cosmetic one.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:27 -08:00
Josef Bacik 0a2b2a844a Btrfs: throttle delayed refs better
On one of our gluster clusters we noticed some pretty big lag spikes.  This
turned out to be because our transaction commit was taking like 3 minutes to
complete.  This is because we have like 30 gigs of metadata, so our global
reserve would end up being the max which is like 512 mb.  So our throttling code
would allow a ridiculous amount of delayed refs to build up and then they'd all
get run at transaction commit time, and for a cold mounted file system that
could take up to 3 minutes to run.  So fix the throttling to be based on both
the size of the global reserve and how long it takes us to run delayed refs.
This patch tracks the time it takes to run delayed refs and then only allows 1
seconds worth of outstanding delayed refs at a time.  This way it will auto-tune
itself from cold cache up to when everything is in memory and it no longer has
to go to disk.  This makes our transaction commits take much less time to run.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:26 -08:00
Josef Bacik d7df2c796d Btrfs: attach delayed ref updates to delayed ref heads
Currently we have two rb-trees, one for delayed ref heads and one for all of the
delayed refs, including the delayed ref heads.  When we process the delayed refs
we have to hold onto the delayed ref lock for all of the selecting and merging
and such, which results in quite a bit of lock contention.  This was solved by
having a waitqueue and only one flusher at a time, however this hurts if we get
a lot of delayed refs queued up.

So instead just have an rb tree for the delayed ref heads, and then attach the
delayed ref updates to an rb tree that is per delayed ref head.  Then we only
need to take the delayed ref lock when adding new delayed refs and when
selecting a delayed ref head to process, all the rest of the time we deal with a
per delayed ref head lock which will be much less contentious.

The locking rules for this get a little more complicated since we have to lock
up to 3 things to properly process delayed refs, but I will address that problem
later.  For now this passes all of xfstests and my overnight stress tests.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:25 -08:00
Josef Bacik 5039eddc19 Btrfs: make fsync latency less sucky
Looking into some performance related issues with large amounts of metadata
revealed that we can have some pretty huge swings in fsync() performance.  If we
have a lot of delayed refs backed up (as you will tend to do with lots of
metadata) fsync() will wander off and try to run some of those delayed refs
which can result in reading from disk and such.  Since the actual act of fsync()
doesn't create any delayed refs there is no need to make it throttle on delayed
ref stuff, that will be handled by other people.  With this patch we get much
smoother fsync performance with large amounts of metadata.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:25 -08:00
Filipe David Borba Manana 63541927c8 Btrfs: add support for inode properties
This change adds infrastructure to allow for generic properties for
inodes. Properties are name/value pairs that can be associated with
inodes for different purposes. They are stored as xattrs with the
prefix "btrfs."

Properties can be inherited - this means when a directory inode has
inheritable properties set, these are added to new inodes created
under that directory. Further, subvolumes can also have properties
associated with them, and they can be inherited from their parent
subvolume. Naturally, directory properties have priority over subvolume
properties (in practice a subvolume property is just a regular
property associated with the root inode, objectid 256, of the
subvolume's fs tree).

This change also adds one specific property implementation, named
"compression", whose values can be "lzo" or "zlib" and it's an
inheritable property.

The corresponding changes to btrfs-progs were also implemented.
A patch with xfstests for this feature will follow once there's
agreement on this change/feature.

Further, the script at the bottom of this commit message was used to
do some benchmarks to measure any performance penalties of this feature.

Basically the tests correspond to:

Test 1 - create a filesystem and mount it with compress-force=lzo,
then sequentially create N files of 64Kb each, measure how long it took
to create the files, unmount the filesystem, mount the filesystem and
perform an 'ls -lha' against the test directory holding the N files, and
report the time the command took.

Test 2 - create a filesystem and don't use any compression option when
mounting it - instead set the compression property of the subvolume's
root to 'lzo'. Then create N files of 64Kb, and report the time it took.
The unmount the filesystem, mount it again and perform an 'ls -lha' like
in the former test. This means every single file ends up with a property
(xattr) associated to it.

Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
compression property, have no real effect other than adding more work
when inheriting properties and taking more btree leaf space.

Test 4 - same as test 3 but with 10 properties per file.

Results (in seconds, and averages of 5 runs each), for different N
numbers of files follow.

* Without properties (test 1)

                    file creation time        ls -lha time
10 000 files              3.49                   0.76
100 000 files            47.19                   8.37
1 000 000 files         518.51                 107.06

* With 1 property (compression property set to lzo - test 2)

                    file creation time        ls -lha time
10 000 files              3.63                    0.93
100 000 files            48.56                    9.74
1 000 000 files         537.72                  125.11

* With 4 properties (test 3)

                    file creation time        ls -lha time
10 000 files              3.94                    1.20
100 000 files            52.14                   11.48
1 000 000 files         572.70                  142.13

* With 10 properties (test 4)

                    file creation time        ls -lha time
10 000 files              4.61                    1.35
100 000 files            58.86                   13.83
1 000 000 files         656.01                  177.61

The increased latencies with properties are essencialy because of:

*) When creating an inode, we now synchronously write 1 more item
   (an xattr item) for each property inherited from the parent dir
   (or subvolume). This could be done in an asynchronous way such
   as we do for dir intex items (delayed-inode.c), which could help
   reduce the file creation latency;

*) With properties, we now have larger fs trees. For this particular
   test each xattr item uses 75 bytes of leaf space in the fs tree.
   This could be less by using a new item for xattr items, instead of
   the current btrfs_dir_item, since we could cut the 'location' and
   'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
   total of 26 bytes per xattr item) from the btrfs_dir_item type.

Also tried batching the xattr insertions (ignoring proper hash
collision handling, since it didn't exist) when creating files that
inherit properties from their parent inode/subvolume, but the end
results were (surprisingly) essentially the same.

Test script:

$ cat test.pl
  #!/usr/bin/perl -w

  use strict;
  use Time::HiRes qw(time);
  use constant NUM_FILES => 10_000;
  use constant FILE_SIZES => (64 * 1024);
  use constant DEV => '/dev/sdb4';
  use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
  use constant TEST_DIR => (MNT_POINT . '/testdir');

  system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";

  # following line for testing without properties
  #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";

  # following 2 lines for testing with properties
  system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
  system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";

  system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
  my ($t1, $t2);

  $t1 = time();
  for (my $i = 1; $i <= NUM_FILES; $i++) {
      my $p = TEST_DIR . '/file_' . $i;
      open(my $f, '>', $p) or die "Error opening file!";
      $f->autoflush(1);
      for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
          print $f ('A' x 4096) or die "Error writing to file!";
      }
      close($f);
  }
  $t2 = time();
  print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
  system("umount", DEV) == 0 or die "umount failed!";
  system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";

  $t1 = time();
  system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
  $t2 = time();
  print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
  system("umount", DEV) == 0 or die "umount failed!";

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:24 -08:00
Filipe David Borba Manana 1acae57b16 Btrfs: faster file extent item replace operations
When writing to a file we drop existing file extent items that cover the
write range and then add a new file extent item that represents that write
range.

Before this change we were doing a tree lookup to remove the file extent
items, and then after we did another tree lookup to insert the new file
extent item.
Most of the time all the file extent items we need to drop are located
within a single leaf - this is the leaf where our new file extent item ends
up at. Therefore, in this common case just combine these 2 operations into
a single one.

By avoiding the second btree navigation for insertion of the new file extent
item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
COW operations, CPU time on btree node/leaf key binary searches, etc.

Besides for file writes, this is an operation that happens for file fsync's
as well. However log btrees are much less likely to big as big as regular
fs btrees, therefore the impact of this change is smaller.

The following benchmark was performed against an SSD drive and a
HDD drive, both for random and sequential writes:

  sysbench --test=fileio --file-num=4096 --file-total-size=8G \
     --file-test-mode=[rndwr|seqwr] --num-threads=512 \
     --file-block-size=8192 \ --max-requests=1000000 \
     --file-fsync-freq=0 --file-io-mode=sync [prepare|run]

All results below are averages of 10 runs of the respective test.

** SSD sequential writes

Before this change: 225.88 Mb/sec
After this change:  277.26 Mb/sec

** SSD random writes

Before this change: 49.91 Mb/sec
After this change:  56.39 Mb/sec

** HDD sequential writes

Before this change: 68.53 Mb/sec
After this change:  69.87 Mb/sec

** HDD random writes

Before this change: 13.04 Mb/sec
After this change:  14.39 Mb/sec

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:23 -08:00
Wang Shilong 90515e7f5d Btrfs: handle EAGAIN case properly in btrfs_drop_snapshot()
We may return early in btrfs_drop_snapshot(), we shouldn't
call btrfs_std_err() for this case, fix it.

Cc: stable@vger.kernel.org
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:23 -08:00
Wang Shilong 8e56338d7d Btrfs: remove unnecessary transaction commit before send
We will finish orphan cleanups during snapshot, so we don't
have to commit transaction here.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:22 -08:00
Wang Shilong 18f687d538 Btrfs: fix protection between send and root deletion
We should gurantee that parent and clone roots can not be destroyed
during send, for this we have two ideas.

1.by holding @subvol_sem, this might be a nightmare, because it will
block all subvolumes deletion for a long time.

2.Miao pointed out we can reuse @send_in_progress, that mean we will
skip snapshot deletion if root sending is in progress.

Here we adopt the second approach since it won't block other subvolumes
deletion for a long time.

Besides in btrfs_clean_one_deleted_snapshot(), we only check first root
, if this root is involved in send, we return directly rather than
continue to check.There are several reasons about it:

1.this case happen seldomly.
2.after sending,cleaner thread can continue to drop that root.
3.make code simple

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:21 -08:00
Wang Shilong 896c14f97f Btrfs: fix wrong send_in_progress accounting
Steps to reproduce:
 # mkfs.btrfs -f /dev/sda8
 # mount /dev/sda8 /mnt
 # btrfs sub snapshot -r /mnt /mnt/snap1
 # btrfs sub snapshot -r /mnt /mnt/snap2
 # btrfs send /mnt/snap1 -p /mnt/snap2 -f /mnt/1
 # dmesg

The problem is that we will sort clone roots(include @send_root), it
might push @send_root before thus @send_root's @send_in_progress will
be decreased twice.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:21 -08:00
Qu Wenruo a88998f291 btrfs: Add treelog mount option.
Add treelog mount option to enable tree log with
remount option.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:20 -08:00
Qu Wenruo d399167d88 btrfs: Add datasum mount option.
Add datasum mount option to enable checksum with
remount option.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:20 -08:00
Qu Wenruo a258af7a3e btrfs: Add datacow mount option.
Add datacow mount option to enable copy-on-write with
remount option.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:19 -08:00
Qu Wenruo bd0330ad21 btrfs: Add acl mount option.
Add acl mount option to enable acl with remount option.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:19 -08:00
Qu Wenruo 2c9ee85671 btrfs: Add noflushoncommit mount option.
Add noflushoncommit mount option to disable flush on commit with
remount option.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:18 -08:00
Qu Wenruo 5303629343 btrfs: Add noenospc_debug mount option.
Add noenospc_debug mount option to disable ENOSPC debug with
remount option.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:17 -08:00
Qu Wenruo e07a2ade44 btrfs: Add nodiscard mount option.
Add nodiscard mount option to disable discard with remount option.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:17 -08:00
Qu Wenruo fc0ca9af18 btrfs: Add noautodefrag mount option.
Btrfs has autodefrag mount option but no pairing noautodefrag option,
which makes it impossible to disable autodefrag without umount.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:16 -08:00
Qu Wenruo 842bef5891 btrfs: Add "barrier" option to support "-o remount,barrier"
Btrfs can be remounted without barrier, but there is no "barrier" option
so nobody can remount btrfs back with barrier on. Only umount and
mount again can re-enable barrier.(Quite awkward)

Also the mount options in the document is also changed slightly for the
further pairing options changes.

Reported-by: Daniel Blueman <daniel@quora.org>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Mike Fleetwood <mike.fleetwood@googlemail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:16 -08:00
Wang Shilong e8117c26b2 Btrfs: only fua the first superblock when writting supers
We only intent to fua the first superblock in every device from
comments, fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:15 -08:00
Liu Bo 17504584f5 Btrfs: return free space to global_rsv as much as possible
@full is not protected within global_rsv.lock, so we may think global_rsv
is already full but in fact it's not, so we miss the opportunity to return
free space to global_rsv directly when we release other block_rsvs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:14 -08:00
Wang Shilong 1708cc5723 Btrfs: fix an oops when we fail to relocate tree blocks
During balance test, we hit an oops:
[ 2013.841551] kernel BUG at fs/btrfs/relocation.c:1174!

The problem is that if we fail to relocate tree blocks, we should
update backref cache, otherwise, some pending nodes are not updated
while snapshot check @cache->last_trans is within one transaction
and won't update it and then oops happen.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:14 -08:00
Miao Xie e77751aad1 Btrfs: fix the wrong nocow range check
The following warning message was outputed when running the 274th case
of xfstests with nodatacow option:
 BUG: Bad page state in process kswapd0  pfn:1c66f
 page:ffffea0000636848 count:0 mapcount:0 mapping:(null) index:0x78000
 page flags: 0x1000000000100a(error|uptodate|private_2)

It is because the check of nocow range was wrong, we should compare the
start and end position of the extent with the write position to verify
if the write position was in the extent, but the current code just used
the start postion to do the check, so we got the wrong extent and told
the caller that it was a nocow write. And then when we write back the
dirty pages, we found we should cow the extent, but at that time, there
was no space in the fs, we had to the error flag for the page. When
someone reclaimed that page, the above warning outputed. Fix it.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:13 -08:00
Wang Shilong 25e293c2a2 Btrfs: fix an oops when we fail to merge reloc roots
Previously, we will free reloc root memory and then force filesystem
to be readonly. The problem is that there may be another thread commiting
transaction which will try to access freed reloc root during merging reloc
roots process.

To keep consistency snapshots shared space, we should allow snapshot
finished if possible, so here we don't free reloc root memory.

signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:13 -08:00
Wang Shilong dc4103f933 Btrfs: remove unused argument from select_reloc_root()
@nr is no longer used, remove it from select_reloc_root()

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:12 -08:00
Filipe David Borba Manana eb653de159 Btrfs: reduce btree node locking duration on item update
If we do a btree search with the goal of updating an existing item
without changing its size (ins_len == 0 and cow == 1), then we never
need to hold locks on upper level nodes (even when slot == 0) after we
COW their child nodes/leaves, as we won't have node splits or merges
in this scenario (that is, no key additions, removals or shifts on any
nodes or leaves).

Therefore release the locks immediately after COWing the child nodes/leaves
while navigating the btree, even if their parent slot is 0, instead of
returning a path to the caller with those nodes locked, which would get
released only when the caller releases or frees the path (or if it calls
btrfs_unlock_up_safe).

This is a common scenario, for example when updating inode items in fs
trees and block group items in the extent tree.

The following benchmarks were performed on a quad core machine with 32Gb
of ram, using a leaf/node size of 4Kb (to generate deeper fs trees more
quickly).

  sysbench --test=fileio --file-num=131072 --file-total-size=8G \
    --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
    --max-requests=100000 --file-io-mode=sync [prepare|run]

Before this change:  49.85Mb/s (average of 5 runs)
After this change:   50.38Mb/s (average of 5 runs)

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:11 -08:00
Wenliang Fan eb8052e015 fs/btrfs: Integer overflow in btrfs_ioctl_resize()
The local variable 'new_size' comes from userspace. If a large number
was passed, there would be an integer overflow in the following line:
	new_size = old_size + new_size;

Signed-off-by: Wenliang Fan <fanwlexca@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:11 -08:00
Josef Bacik c9ea7b24ce Btrfs: stop caching thread if extent_commit_sem is contended
We can starve out the transaction commit with a bunch of caching threads all
running at the same time.  This is because we will only drop the
extent_commit_sem if we need_resched(), which isn't likely to happen since we
will be reading a lot from the disk so have already schedule()'ed plenty.  Alex
observed that he could starve out a transaction commit for up to a minute with
32 caching threads all running at once.  This will allow us to drop the
extent_commit_sem to allow the transaction commit to swap the commit_root out
and then all the cachers will start back up. Here is an explanation provided by
Igno

So, just to fill in what happens in this loop:

                                mutex_unlock(&caching_ctl->mutex);
                                cond_resched();
                                goto again;

where 'again:' takes caching_ctl->mutex and fs_info->extent_commit_sem
again:

        again:
                mutex_lock(&caching_ctl->mutex);
                /* need to make sure the commit_root doesn't disappear */
                down_read(&fs_info->extent_commit_sem);

So, if I'm reading the code correct, there can be a fair amount of
concurrency here: there may be multiple 'caching kthreads' per filesystem
active, while there's one fs_info->extent_commit_sem per filesystem
AFAICS.

So, what happens if there are a lot of CPUs all busy holding the
->extent_commit_sem rwsem read-locked and a writer arrives? They'd all
rush to try to release the fs_info->extent_commit_sem, and they'd block in
the down_read() because there's a writer waiting.

So there's a guarantee of forward progress. This should answer akpm's
concern I think.

Thanks,

Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:10 -08:00
Miao Xie 67de11769b Btrfs: introduce the delayed inode ref deletion for the single link inode
The inode reference item is close to inode item, so we insert it simultaneously
with the inode item insertion when we create a file/directory.. In fact, we also
can handle the inode reference deletion by the same way. So we made this patch to
introduce the delayed inode reference deletion for the single link inode(At most
case, the file doesn't has hard link, so we don't take the hard link into account).

This function is based on the delayed inode mechanism. After applying this patch,
we can reduce the time of the file/directory deletion by ~10%.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:09 -08:00
Miao Xie 7cf35d91b4 Btrfs: use flags instead of the bool variants in delayed node
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:08 -08:00
Miao Xie a56dbd8940 Btrfs: remove btrfs_end_transaction_dmeta()
Two reasons:
- btrfs_end_transaction_dmeta() is the same as btrfs_end_transaction_throttle()
  so it is unnecessary.
- All the delayed items should be dealt in the current transaction, so the
  workers should not commit the transaction, instead, deal with the delayed
  items as many as possible.

So we can remove btrfs_end_transaction_dmeta()

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:08 -08:00
Miao Xie 0353808cae Btrfs: cleanup code of btrfs_balance_delayed_items()
- move the condition check for wait into a function
- use wait_event_interruptible instead of prepare-schedule-finish process

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:07 -08:00
Miao Xie 4dd466d36a Btrfs: don't run delayed nodes again after all nodes flush
If the number of the delayed items is greater than the upper limit, we will
try to flush all the delayed items. After that, it is unnecessary to run
them again because they are being dealt with by the wokers or the number of
them is less than the lower limit.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:06 -08:00
Miao Xie 74c40f925e Btrfs: remove residual code in delayed inode async helper
Before applying the patch
  commit de3cb945db
  title: Btrfs: improve the delayed inode throttling

We need requeue the async work after the current work was done, it
introduced a deadlock problem. So we wrote the code that this patch
removes to avoid the above problem. But after applying the above
patch, the deadlock problem didn't exist. So we should remove that
fix code.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:06 -08:00
Frank Holton efe120a067 Btrfs: convert printk to btrfs_ and fix BTRFS prefix
Convert all applicable cases of printk and pr_* to the btrfs_* macros.

Fix all uses of the BTRFS prefix.

Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:05 -08:00
Filipe David Borba Manana 5de865eebb Btrfs: fix tree mod logging
While running the test btrfs/004 from xfstests in a loop, it failed
about 1 time out of 20 runs in my desktop. The failure happened in
the backref walking part of the test, and the test's error message was
like this:

  btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
      --- tests/btrfs/004.out	2013-11-26 18:25:29.263333714 +0000
      +++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad	2013-12-10 15:25:10.327518516 +0000
      @@ -1,3 +1,8 @@
       QA output created by 004
       *** test backref walking
      -*** done
      +unexpected output from
      +	/home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
      +expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
      +
       ...
       (Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
  Ran: btrfs/004
  Failures: btrfs/004
  Failed 1 of 1 tests

But immediately after the test finished, the btrfs inspect-internal command
returned the expected output:

  $ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
  inode 405 offset 454656 root 258
  inode 405 offset 454656 root 5

It turned out this was because the btrfs_search_old_slot() calls performed
during backref walking (backref.c:__resolve_indirect_ref) were not finding
anything. The reason for this turned out to be that the tree mod logging
code was not logging some node multi-step operations atomically, therefore
btrfs_search_old_slot() callers iterated often over an incomplete tree that
wasn't fully consistent with any tree state from the past. Besides missing
items, this often (but not always) resulted in -EIO errors during old slot
searches, reported in dmesg like this:

[ 4299.933936] ------------[ cut here ]------------
[ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
[ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
[ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G        W  O 3.12.0-fdm-btrfs-next-16+ #70
[ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 4299.933979]  000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
[ 4299.933982]  0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
[ 4299.933984]  ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
[ 4299.933987] Call Trace:
[ 4299.933991]  [<ffffffff8176d284>] dump_stack+0x55/0x76
[ 4299.933994]  [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
[ 4299.933997]  [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
[ 4299.934003]  [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
[ 4299.934005]  [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
[ 4299.934010]  [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
[ 4299.934019]  [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
[ 4299.934027]  [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[ 4299.934034]  [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
[ 4299.934042]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934048]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934056]  [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
[ 4299.934058]  [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[ 4299.934065]  [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[ 4299.934071]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934078]  [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
[ 4299.934080]  [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
[ 4299.934083]  [<ffffffff81075563>] ? up_read+0x23/0x40
[ 4299.934085]  [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[ 4299.934088]  [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[ 4299.934090]  [<ffffffff81776e23>] ? error_sti+0x5/0x6
[ 4299.934093]  [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 4299.934096]  [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[ 4299.934098]  [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[ 4299.934100]  [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4299.934102]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934102]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
[ 4299.934378] btrfs bad fsid on block 0

These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
be performed atomically before the following commit:

  c8cc634165
  (Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)

That change removed the atomicity of such operations. This patch restores the
atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
structures, so it has to do the allocations using GFP_NOFS before acquiring
the mod log lock.

This issue has been experienced by several users recently, such as for example:

  http://www.spinics.net/lists/linux-btrfs/msg28574.html

After running the btrfs/004 test for 679 consecutive iterations with this
patch applied, I didn't ran into the issue anymore.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:04 -08:00
David Sterba 66ef7d65c3 btrfs: check balance of send_in_progress
Warn if the balance goes below zero, which appears to be unlikely
though. Otherwise cleans up the code a bit.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:04 -08:00
Wang Shilong 41ce9970a8 Btrfs: remove transaction from btrfs send
Since daivd did the work that force us to use readonly snapshot,
we can safely remove transaction protection from btrfs send.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:03 -08:00
Miao Xie 536cd96401 Btrfs: fix double initialization of the raid kobject
We met the following oops when doing space balance:
 kobject (ffff88081b590278): tried to init an initialized object, something is seriously wrong.
 ...
 Call Trace:
  [<ffffffff81937262>] dump_stack+0x49/0x5f
  [<ffffffff8137d259>] kobject_init+0x89/0xa0
  [<ffffffff8137d36a>] kobject_init_and_add+0x2a/0x70
  [<ffffffffa009bd79>] ? clear_extent_bit+0x199/0x470 [btrfs]
  [<ffffffffa005e82c>] __link_block_group+0xfc/0x120 [btrfs]
  [<ffffffffa006b9db>] btrfs_make_block_group+0x24b/0x370 [btrfs]
  [<ffffffffa00a899b>] __btrfs_alloc_chunk+0x54b/0x7e0 [btrfs]
  [<ffffffffa00a8c6f>] btrfs_alloc_chunk+0x3f/0x50 [btrfs]
  [<ffffffffa0060123>] do_chunk_alloc+0x363/0x440 [btrfs]
  [<ffffffffa00633d4>] btrfs_check_data_free_space+0x104/0x310 [btrfs]
  [<ffffffffa0069f4d>] btrfs_write_dirty_block_groups+0x48d/0x600 [btrfs]
  [<ffffffffa007aad4>] commit_cowonly_roots+0x184/0x250 [btrfs]
  ...

Steps to reproduce:
 # mkfs.btrfs -f <dev>
 # mount -o nospace_cache <dev> <mnt>
 # btrfs balance start <mnt>
 # dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1

The reason of this problem is that we initialized the raid kobject when we added
a block group into a empty raid list. As we know, when we mounted a btrfs filesystem,
the raid list was empty, we would initialize the raid kobject when we added the first
block group. But if there was not data stored in the block group, the block group
would be freed when doing balance, and the raid list would be empty. And then if we
allocated a new block group and added it into the raid list, we would initialize
the raid kobject again, the oops happened.

Fix this problem by initializing the raid kobject just when mounting the fs.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reported-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:03 -08:00
Wang Shilong 180589efde Btrfs: fix a warning when iput a file
See the warning below:

[ 1209.102076]  [<ffffffffa04721b9>] remove_extent_mapping+0x69/0x70 [btrfs]
[ 1209.102084]  [<ffffffffa0466b06>] btrfs_evict_inode+0x96/0x4d0 [btrfs]
[ 1209.102089]  [<ffffffff81073010>] ? wake_atomic_t_function+0x40/0x40
[ 1209.102092]  [<ffffffff8118ab2e>] evict+0x9e/0x190
[ 1209.102094]  [<ffffffff8118b313>] iput+0xf3/0x180
[ 1209.102101]  [<ffffffffa0461fd1>] btrfs_run_delayed_iputs+0xb1/0xd0 [btrfs]
[ 1209.102107]  [<ffffffffa045d358>] __btrfs_end_transaction+0x268/0x350 [btrfs]

clear extent bit here to avoid triggering WARN_ON() in remove_extent_mapping()

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:02 -08:00
David Sterba 2c68653787 btrfs: Check read-only status of roots during send
All the subvolues that are involved in send must be read-only during the
whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the
status to read-write and the result of send stream is undefined if the
data change unexpectedly.

Fix that by adding a refcount for all involved roots and verify that
there's no send in progress during SUBVOL_SETFLAGS ioctl call that does
read-only -> read-write transition.

We need refcounts because there are no restrictions on number of send
parallel operations currently run on a single subvolume, be it source,
parent or one of the multiple clone sources.

Kernel is silent when the RO checks fail and returns EPERM. The same set
of checks is done already in userspace before send starts.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:01 -08:00
David Sterba a8d89f5ba0 btrfs: remove unused mnt from send_ctx
Unused since ed2590953b
"Btrfs: stop using vfs_read in send".

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:01 -08:00
David Sterba 95bc79d50d btrfs: send: clean up dead code
Remove ifdefed code:

- tlv_put for 8, 16 and 32, add a generic tempalte if needed in future
- tlv_put_timespec - the btrfs_timespec fields are used
- fs_path_remove obsoleted long ago

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:00 -08:00
Filipe David Borba Manana 3fe81ce206 Btrfs: fix deadlock when iterating inode refs and running delayed inodes
While running btrfs/004 from xfstests, after 503 iterations, dmesg reported
a deadlock between tasks iterating inode refs and tasks running delayed inodes
(during a transaction commit).

It turns out that iterating inode refs implies doing one tree search and
release all nodes in the path except the leaf node, and then passing that
leaf node to btrfs_ref_to_path(), which in turn does another tree search
without releasing the lock on the leaf node it received as parameter.

This is a problem when other task wants to write to the btree as well and
ends up updating the leaf that is read locked - the writer task locks the
parent of the leaf and then blocks waiting for the leaf's lock to be
released - at the same time, the task executing btrfs_ref_to_path()
does a second tree search, without releasing the lock on the first leaf,
and wants to access a leaf (the same or another one) that is a child of
the same parent, resulting in a deadlock.

The trace reported by lockdep follows.

[84314.936373] INFO: task fsstress:11930 blocked for more than 120 seconds.
[84314.936381]       Tainted: G        W  O 3.12.0-fdm-btrfs-next-16+ #70
[84314.936383] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84314.936386] fsstress        D ffff8806e1bf8000     0 11930  11926 0x00000000
[84314.936393]  ffff8804d6d89b78 0000000000000046 ffff8804d6d89b18 ffffffff810bd8bd
[84314.936399]  ffff8806e1bf8000 ffff8804d6d89fd8 ffff8804d6d89fd8 ffff8804d6d89fd8
[84314.936405]  ffff880806308000 ffff8806e1bf8000 ffff8804d6d89c08 ffff8804deb8f190
[84314.936410] Call Trace:
[84314.936421]  [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10
[84314.936428]  [<ffffffff81774269>] schedule+0x29/0x70
[84314.936451]  [<ffffffffa0715bf5>] btrfs_tree_lock+0x75/0x270 [btrfs]
[84314.936457]  [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60
[84314.936470]  [<ffffffffa06ba231>] btrfs_search_slot+0x7f1/0x930 [btrfs]
[84314.936489]  [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs]
[84314.936504]  [<ffffffffa06d2e1f>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
[84314.936510]  [<ffffffff810bd6ef>] ? trace_hardirqs_on_caller+0x1f/0x1e0
[84314.936528]  [<ffffffffa073173c>] __btrfs_update_delayed_inode+0x4c/0x1d0 [btrfs]
[84314.936543]  [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs]
[84314.936558]  [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs]
[84314.936573]  [<ffffffffa0731c82>] __btrfs_run_delayed_items+0x192/0x1e0 [btrfs]
[84314.936589]  [<ffffffffa0731d03>] btrfs_run_delayed_items+0x13/0x20 [btrfs]
[84314.936604]  [<ffffffffa06dbcd4>] btrfs_flush_all_pending_stuffs+0x24/0x80 [btrfs]
[84314.936620]  [<ffffffffa06ddc13>] btrfs_commit_transaction+0x223/0xa20 [btrfs]
[84314.936630]  [<ffffffffa06ae5ae>] btrfs_sync_fs+0x6e/0x110 [btrfs]
[84314.936635]  [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60
[84314.936639]  [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60
[84314.936643]  [<ffffffff811d0b70>] sync_fs_one_sb+0x20/0x30
[84314.936648]  [<ffffffff811a3541>] iterate_supers+0xf1/0x100
[84314.936652]  [<ffffffff811d0c45>] sys_sync+0x55/0x90
[84314.936658]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[84314.936660] INFO: lockdep is turned off.
[84314.936663] INFO: task btrfs:11955 blocked for more than 120 seconds.
[84314.936666]       Tainted: G        W  O 3.12.0-fdm-btrfs-next-16+ #70
[84314.936668] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84314.936670] btrfs           D ffff880541729a88     0 11955  11608 0x00000000
[84314.936674]  ffff880541729a38 0000000000000046 ffff8805417299d8 ffffffff810bd8bd
[84314.936680]  ffff88075430c8a0 ffff880541729fd8 ffff880541729fd8 ffff880541729fd8
[84314.936685]  ffffffff81c104e0 ffff88075430c8a0 ffff8804de8b00b8 ffff8804de8b0000
[84314.936690] Call Trace:
[84314.936695]  [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10
[84314.936700]  [<ffffffff81774269>] schedule+0x29/0x70
[84314.936717]  [<ffffffffa0715815>] btrfs_tree_read_lock+0xd5/0x140 [btrfs]
[84314.936721]  [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60
[84314.936733]  [<ffffffffa06ba201>] btrfs_search_slot+0x7c1/0x930 [btrfs]
[84314.936746]  [<ffffffffa06bd505>] btrfs_find_item+0x55/0x160 [btrfs]
[84314.936763]  [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs]
[84314.936780]  [<ffffffffa073c9ca>] btrfs_ref_to_path+0xba/0x1e0 [btrfs]
[84314.936797]  [<ffffffffa06f9719>] ? release_extent_buffer+0xb9/0xe0 [btrfs]
[84314.936813]  [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs]
[84314.936830]  [<ffffffffa073cb50>] inode_to_path+0x60/0xd0 [btrfs]
[84314.936846]  [<ffffffffa073d365>] paths_from_inode+0x115/0x3c0 [btrfs]
[84314.936851]  [<ffffffff8118dd44>] ? kmem_cache_alloc_trace+0x114/0x200
[84314.936868]  [<ffffffffa0714494>] btrfs_ioctl+0xf14/0x2030 [btrfs]
[84314.936873]  [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[84314.936877]  [<ffffffff8116598f>] ? handle_mm_fault+0x34f/0xb00
[84314.936882]  [<ffffffff81075563>] ? up_read+0x23/0x40
[84314.936886]  [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[84314.936892]  [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[84314.936896]  [<ffffffff81776e23>] ? error_sti+0x5/0x6
[84314.936901]  [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[84314.936906]  [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[84314.936910]  [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[84314.936915]  [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[84314.936920]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[84314.936922] INFO: lockdep is turned off.
[84434.866873] INFO: task btrfs-transacti:11921 blocked for more than 120 seconds.
[84434.866881]       Tainted: G        W  O 3.12.0-fdm-btrfs-next-16+ #70
[84434.866883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84434.866886] btrfs-transacti D ffff880755b6a478     0 11921      2 0x00000000
[84434.866893]  ffff8800735b9ce8 0000000000000046 ffff8800735b9c88 ffffffff810bd8bd
[84434.866899]  ffff8805a1b848a0 ffff8800735b9fd8 ffff8800735b9fd8 ffff8800735b9fd8
[84434.866904]  ffffffff81c104e0 ffff8805a1b848a0 ffff880755b6a478 ffff8804cece78f0
[84434.866910] Call Trace:
[84434.866920]  [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10
[84434.866927]  [<ffffffff81774269>] schedule+0x29/0x70
[84434.866948]  [<ffffffffa06dd2ef>] wait_current_trans.isra.33+0xbf/0x120 [btrfs]
[84434.866954]  [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60
[84434.866970]  [<ffffffffa06dec18>] start_transaction+0x388/0x5a0 [btrfs]
[84434.866985]  [<ffffffffa06db9b5>] ? transaction_kthread+0xb5/0x280 [btrfs]
[84434.866999]  [<ffffffffa06dee97>] btrfs_attach_transaction+0x17/0x20 [btrfs]
[84434.867012]  [<ffffffffa06dba9e>] transaction_kthread+0x19e/0x280 [btrfs]
[84434.867026]  [<ffffffffa06db900>] ? open_ctree+0x2260/0x2260 [btrfs]
[84434.867030]  [<ffffffff81070dad>] kthread+0xed/0x100
[84434.867035]  [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190
[84434.867040]  [<ffffffff8177ee6c>] ret_from_fork+0x7c/0xb0
[84434.867044]  [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:59 -08:00
Wang Shilong 663df05330 Btrfs: remove dead comments for read_csums()
Chris introduced hleper function  read_csums() and this function
has been removed, but we forgot to remove its corresponding comments.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:59 -08:00
Filipe David Borba Manana e223cfcd3e Btrfs: remove field tree_mod_seq_elem from btrfs_fs_info struct
It's not used anywhere, so just drop it.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:58 -08:00
Filipe David Borba Manana fc28b62d64 Btrfs: fix use of uninitialized err variable
fs/btrfs/file.c: In function ‘prepare_pages.isra.18’:
fs/btrfs/file.c:1265:6: warning: ‘err’ may be used uninitialized in this function [-Wuninitialized]

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:57 -08:00
Wang Shilong 54eb72c05f Btrfs: remove unnecessary filemap writting and waiting after block group relocation
We have commited transaction before, remove redundant filemap writting and
waiting here, it can speed up balance relocation process.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:57 -08:00
Tsutomu Itoh 5662344b3c Btrfs: fix error check of btrfs_lookup_dentry()
Clean up btrfs_lookup_dentry() to never return NULL, but PTR_ERR(-ENOENT)
instead. This keeps the return value convention consistent.

Callers who use btrfs_lookup_dentry() require a trivial update.

create_snapshot() in particular looks like it can also lose a BUG_ON(!inode)
which is not really needed - there seems less harm in returning ENOENT to
userspace at that point in the stack than there is to crash the machine.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:56 -08:00
Filipe David Borba Manana 7835776635 Btrfs: return immediately if tree log mod is not necessary
In ctree.c:tree_mod_log_set_node_key() we were calling
__tree_mod_log_insert_key() even when the modification doesn't need
to be logged. This would allocate a tree_mod_elem structure, fill it
and pass it to  __tree_mod_log_insert(), which would just acquire
the tree mod log write lock and then free the tree_mod_elem structure
and return (that is, a no-op).

Therefore call tree_mod_log_insert() instead of __tree_mod_log_insert()
which just returns immediately if the modification doesn't need to be
logged (without allocating the structure, fill it, acquire write lock,
free structure).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:56 -08:00
Josef Bacik f28491e0a6 Btrfs: move the extent buffer radix tree into the fs_info
I need to create a fake tree to test qgroups and I don't want to have to setup a
fake btree_inode.  The fact is we only use the radix tree for the fs_info, so
everybody else who allocates an extent_io_tree is just wasting the space anyway.
This patch moves the radix tree and its lock into btrfs_fs_info so there is less
stuff I have to fake to do qgroup sanity tests.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:55 -08:00
Josef Bacik 34b41acec1 Btrfs: use a bit to track if we're in the radix tree
For creating a dummy in-memory btree I need to be able to use the radix tree to
keep track of the buffers like normal extent buffers.  With dummy buffers we
skip the radix tree step, and we still want to do that for the tree mod log
dummy buffers but for my test buffers we need to be able to remove them from the
radix tree like normal.  This will give me a way to do that.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:54 -08:00
Josef Bacik a5dee37d39 Btrfs: deal with io_tree->mapping being NULL
I need to add infrastructure to allocate dummy extent buffers for running sanity
tests, and to do this I need to not have to worry about having an
address_mapping for an io_tree, so just fix up the places where we assume that
all io_tree's have a non-NULL ->mapping.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:54 -08:00
Filipe David Borba Manana 2ef1fed285 Btrfs: more efficient push_leaf_right
Currently when finding the leaf to insert a key into a btree, if the
leaf doesn't have enough space to store the item we attempt to move
off some items from our leaf to its right neighbor leaf, and if this
fails to create enough free space in our leaf, we try to move off more
items to the left neighbor leaf as well.

When trying to move off items to the right neighbor leaf, if it has
enough room to store the new key but not not enough room to move off
at least one item from our target leaf, __push_leaf_right returns 1 and
we have to attempt to move items to the left neighbor (push_leaf_left
function) without touching the right neighbor leaf.
For the case where the right leaf has enough room to store at least 1
item from our leaf, we end up modifying (and dirtying) both our leaf
and the right leaf. This is non-optimal for the case where the new key
is greater than any key in our target leaf because it can be inserted at
slot 0 of the right neighbor leaf and we don't need to touch our leaf
at all nor to attempt to move off items to the left neighbor leaf.

Therefore this change just selects the right neighbor leaf as our new
target leaf if it has enough room for the new key without modifying our
initial target leaf - we do this only if the new key is higher than any
key in the initial target leaf.

While running the following test, push_leaf_right was called by split_leaf
4802 times. Out of those 4802 calls, for 2571 calls (53.5%) we hit this
special case (right leaf has enough room and new key is higher than any key
in the initial target leaf).

Test:

  sysbench --test=fileio --file-num=512 --file-total-size=5G \
    --file-test-mode=[seqwr|rndwr] --num-threads=512 --file-block-size=8192 \
    --max-requests=100000 --file-io-mode=sync [prepare|run]

Results:

sequential writes

Throughput before this change: 65.71Mb/sec (average of 10 runs)
Throughput after this change:  66.58Mb/sec (average of 10 runs)

random writes

Throughput before this change: 10.75Mb/sec (average of 10 runs)
Throughput after this change:  11.56Mb/sec (average of 10 runs)

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:53 -08:00
Wang Shilong cb7ab02156 Btrfs: wrap repeated code into scrub_blocked_if_needed()
Just wrap same code into one function scrub_blocked_if_needed().

This make a change that we will move waiting (@workers_pending = 0)
before we can wake up commiting transaction(atomic_inc(@scrub_paused)),
we must take carefully to not deadlock here.

Thread 1			Thread 2
				|->btrfs_commit_transaction()
					|->set trans type(COMMIT_DOING)
					|->btrfs_scrub_paused()(blocked)
|->join_transaction(blocked)

Move btrfs_scrub_paused() before setting trans type which means we can
still join a transaction when commiting_transaction is blocked.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Suggested-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:53 -08:00
Wang Shilong 3cb0929ad2 Btrfs: fix wrong super generation mismatch when scrubbing supers
We came a race condition when scrubbing superblocks, the story is:

In commiting transaction, we will update @last_trans_commited after
writting superblocks, if scrubber start after writting superblocks
and before updating @last_trans_commited, generation mismatch happens!

We fix this by checking @scrub_pause_req, and we won't start a srubber
until commiting transaction is finished.(after btrfs_scrub_continue()
finished.)

Reported-by: Sebastian Ochmann <ochmann@informatik.uni-bonn.de>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:52 -08:00
Filipe David Borba Manana 5a0f4e2c2b Btrfs: fix pass of transid with wrong endianness in send.c
fs/btrfs/send.c:2190:9: warning: incorrect type in argument 3 (different base types)
fs/btrfs/send.c:2190:9:    expected unsigned long long [unsigned] [usertype] value
fs/btrfs/send.c:2190:9:    got restricted __le64 [usertype] ctransid
fs/btrfs/send.c:2195:17: warning: incorrect type in argument 3 (different base types)
fs/btrfs/send.c:2195:17:    expected unsigned long long [unsigned] [usertype] value
fs/btrfs/send.c:2195:17:    got restricted __le64 [usertype] ctransid
fs/btrfs/send.c:3716:9: warning: incorrect type in argument 3 (different base types)
fs/btrfs/send.c:3716:9:    expected unsigned long long [unsigned] [usertype] value
fs/btrfs/send.c:3716:9:    got restricted __le64 [usertype] ctransid

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:51 -08:00
Filipe David Borba Manana d527afe1e5 Btrfs: fix extent_map block_len after merging
When merging an extent_map with its right neighbor, increment
its block_len with the neighbor's block_len.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:51 -08:00
Michal Nazarewicz 11850392ee btrfs: remove dead code
[commit 8185554d: fix incorrect inode acl reset] introduced a dead
code by adding a condition which can never be true to an else
branch.  The condition can never be true because it is already
checked by a previous if statement which causes function to return.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-By: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:50 -08:00
Filipe David Borba Manana 878f2d2cb3 Btrfs: fix max dir item size calculation
We were accounting for sizeof(struct btrfs_item) twice, once
in the data_size variable and another time in the if statement
below.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:49 -08:00
Filipe David Borba Manana 12cfbad90e Btrfs: more efficient extent state insertions
Currently we do 2 traversals of an inode's extent_io_tree
before inserting an extent state structure: 1 to see if a
matching extent state already exists and 1 to do the insertion
if the fist traversal didn't found such extent state.

This change just combines those tree traversals into a single one.
While running sysbench tests (random writes) I captured the number
of elements in extent_io_tree trees for a while (into a procfs file
backed by a seq_list from seq_file module) and got this histogram:

Count: 9310
Range: 51.000 - 21386.000; Mean: 11785.243; Median: 18743.500; Stddev: 8923.688
Percentiles:  90th: 20985.000; 95th: 21155.000; 99th: 21369.000
  51.000 -   93.933:   693 ########
  93.933 -  172.314:   938 ##########
 172.314 -  315.408:   856 #########
 315.408 -  576.646:    95 #
 576.646 - 6415.830:   888 ##########
6415.830 - 11713.809:  1024 ###########
11713.809 - 21386.000:  4816 #####################################################

So traversing such trees can take some significant time that can
easily be avoided.

Ran the following sysbench tests, 5 times each, for sequential and
random writes, and got the following results:

  sysbench --test=fileio --file-num=1 --file-total-size=2G \
    --file-test-mode=seqwr --num-threads=16 --file-block-size=65536 \
    --max-requests=0 --max-time=60 --file-io-mode=sync

  sysbench --test=fileio --file-num=1 --file-total-size=2G \
    --file-test-mode=rndwr --num-threads=16 --file-block-size=65536 \
    --max-requests=0 --max-time=60 --file-io-mode=sync

Before this change:

sequential writes: 69.28Mb/sec (average of 5 runs)
random writes:     4.14Mb/sec  (average of 5 runs)

After this change:

sequential writes: 69.91Mb/sec (average of 5 runs)
random writes:     5.69Mb/sec  (average of 5 runs)

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:49 -08:00
Filipe David Borba Manana c42ac0bc95 Btrfs: add missing extent state caching calls
When we didn't find a matching extent state, we inserted a new one
but didn't cache it in the **cached_state parameter, which makes a
subsequent call do a tree lookup to get it.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:48 -08:00
Filipe David Borba Manana 32193c147f Btrfs: faster and more efficient extent map insertion
Before this change, adding an extent map to the extent map tree of an
inode required 2 tree nevigations:

1) doing a tree navigation to search for an existing extent map starting
   at the same offset or an extent map that overlaps the extent map we
   want to insert;

2) Another tree navigation to add the extent map to the tree (if the
   former tree search didn't found anything).

This change just merges these 2 steps into a single one.
While running first few btrfs xfstests I had noticed these trees easily
had a few hundred elements, and then with the following sysbench test it
reached over 1100 elements very often.

Test:

  sysbench --test=fileio --file-num=32 --file-total-size=10G \
    --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
    --max-requests=1000000 --file-io-mode=sync [prepare|run]

(fs created with mkfs.btrfs -l 4096 -f /dev/sdb3 before each sysbench
prepare phase)

Before this patch:

run 1 - 41.894Mb/sec
run 2 - 40.527Mb/sec
run 3 - 40.922Mb/sec
run 4 - 49.433Mb/sec
run 5 - 40.959Mb/sec

average - 42.75Mb/sec

After this patch:

run 1 - 48.036Mb/sec
run 2 - 50.21Mb/sec
run 3 - 50.929Mb/sec
run 4 - 46.881Mb/sec
run 5 - 53.192Mb/sec

average - 49.85Mb/sec

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:48 -08:00
Filipe David Borba Manana 68ba990f7d Btrfs: fix extent boundary check in bio_readpage_error
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:47 -08:00
Filipe David Borba Manana 5a4267ca20 Btrfs: try harder to avoid btree node splits
When attempting to move items from our target leaf to its neighbor
leaves (right and left), we only need to free data_size - free_space
bytes from our leaf in order to add the new item (which has size of
data_size bytes). Therefore attempt to move items to the right and
left leaves if they have at least data_size - free_space bytes free,
instead of data_size bytes free.

After 5 runs of the following test, I got a smaller number of btree
node splits overall:

sysbench --test=fileio --file-num=512 --file-total-size=5G \
  --file-test-mode=seqwr --num-threads=512 \
   --file-block-size=8192 --max-requests=100000 --file-io-mode=sync

Before this change:
* 6171 splits (average of 5 test runs)
* 61.508Mb/sec of throughput (average of 5 test runs)

After this change:
* 6036 splits (average of 5 test runs)
* 63.533Mb/sec of throughput (average of 5 test runs)

An ideal test would not just have multiple threads/processes writing
to a file (insertion of file extent items) but also do other operations
that result in insertion of items with varied sizes, like file/directory
creations, creation of links, symlinks, xattrs, etc.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:46 -08:00
Filipe David Borba Manana 1b8e7e45e5 Btrfs: avoid unnecessary ordered extent cache resets
After an ordered extent completes, don't blindly reset the
inode's ordered tree last accessed ordered extent pointer.

While running the xfstests I noticed that about 29% of the
time the ordered extent to which tree->last pointed was not
the same as our just completed ordered extent. After that I
ran the following sysbench test (after a prepare phase) and
noticed that about 68% of the time tree->last pointed to
a different ordered extent too.

sysbench --test=fileio --file-num=32 --file-total-size=4G \
    --file-test-mode=rndwr --num-threads=512 \
    --file-block-size=32768 --max-time=60 --max-requests=0 run

Therefore reset tree->last on ordered extent removal only if
it pointed to the ordered extent we're removing from the tree.

Results from 4 runs of the following test before and after
applying this patch:

$ sysbench --test=fileio --file-num=32 --file-total-size=4G \
  --file-test-mode=seqwr --num-threads=512 \
  --file-block-size=32768 --max-time=60 --file-io-mode=sync prepare
$ sysbench --test=fileio --file-num=32 --file-total-size=4G \
  --file-test-mode=seqwr --num-threads=512 \
  --file-block-size=32768 --max-time=60 --file-io-mode=sync run

Before this path:

run 1 - 64.049Mb/sec
run 2 - 63.455Mb/sec
run 3 - 64.656Mb/sec
run 4 - 63.833Mb/sec

After this patch:

run 1 - 66.149Mb/sec
run 2 - 68.459Mb/sec
run 3 - 66.338Mb/sec
run 4 - 66.176Mb/sec

With random writes (--file-test-mode=rndwr) I had huge fluctuations
on the results (+- 35% easily).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:46 -08:00
Jeff Mahoney e453d989e0 btrfs: fix leaks during sysfs teardown
Filipe noticed that we were leaking the features attribute group
after umount. His fix of just calling sysfs_remove_group() wasn't enough
since that removes just the supported features and not the unsupported
features.

This patch changes the unknown feature handling to add them individually
so we can skip the kmalloc and uses the same iteration to tear them down
later.

We also fix the error handling during mount so that we catch the
failing creation of the per-super kobject, and handle proper teardown
of a half-setup sysfs context.

Tested properly with kmemleak enabled this time.

Reported-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Tested-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:45 -08:00
Jeff Mahoney 1b8e5df6d9 btrfs: fix static checker warnings
This patch fixes the following warnings:
fs/btrfs/extent-tree.c:6201:12: sparse: symbol 'get_raid_name' was not declared. Should it be static?
fs/btrfs/extent-tree.c:8430:9: error: format not a string literal and no format arguments [-Werror=format-security] get_raid_name(index));

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:44 -08:00
Filipe David Borba Manana 131e404a2a Btrfs: fix very slow inode eviction and fs unmount
The inode eviction can be very slow, because during eviction we
tell the VFS to truncate all of the inode's pages. This results
in calls to btrfs_invalidatepage() which in turn does calls to
lock_extent_bits() and clear_extent_bit(). These calls result in
too many merges and splits of extent_state structures, which
consume a lot of time and cpu when the inode has many pages. In
some scenarios I have experienced umount times higher than 15
minutes, even when there's no pending IO (after a btrfs fs sync).

A quick way to reproduce this issue:

$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ cd /mnt/btrfs
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
    --file-test-mode=seqwr --num-threads=128 \
    --file-block-size=16384 --max-time=60 --max-requests=0 run
$ time btrfs fi sync .
FSSync '.'

real	0m25.457s
user	0m0.000s
sys	0m0.092s
$ cd ..
$ time umount /mnt/btrfs

real	1m38.234s
user	0m0.000s
sys	1m25.760s

The same test on ext4 runs much faster:

$ mkfs.ext4 /dev/sdb3
$ mount /dev/sdb3 /mnt/ext4
$ cd /mnt/ext4
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
    --file-test-mode=seqwr --num-threads=128 \
    --file-block-size=16384 --max-time=60 --max-requests=0 run
$ sync
$ cd ..
$ time umount /mnt/ext4

real	0m3.626s
user	0m0.004s
sys	0m3.012s

After this patch, the unmount (inode evictions) is much faster:

$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ cd /mnt/btrfs
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
    --file-test-mode=seqwr --num-threads=128 \
    --file-block-size=16384 --max-time=60 --max-requests=0 run
$ time btrfs fi sync .
FSSync '.'

real	0m26.774s
user	0m0.000s
sys	0m0.084s
$ cd ..
$ time umount /mnt/btrfs

real	0m1.811s
user	0m0.000s
sys	0m1.564s

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:44 -08:00
Wang Shilong 0647bf564f Btrfs: improve forever loop when doing balance relocation
We hit a forever loop when doing balance relocation,the reason
is that we firstly reserve 4M(node size is 16k).and within transaction
we will try to add extra reservation for snapshot roots,this will
return -EAGAIN if there has been a thread flushing space to reserve
space.We will do this again and again with filesystem becoming nearly
full.

If the above '-EAGAIN' case happens, we try to refill reservation more
outsize of transaction, and this will return eariler in enospc case,however,
this dosen't really hurt because it makes no sense doing balance relocation
with the filesystem nearly full.

Miao Xie helped a lot to track this issue, thanks.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:43 -08:00
Filipe David Borba Manana 6126e3caf7 Btrfs: fix ordered extent check in btrfs_punch_hole
If the ordered extent's last byte was 1 less than our region's
start byte, we would unnecessarily wait for the completion of
that ordered extent, because it doesn't intersect our target
range.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:42 -08:00
Miao Xie 376cc685cb Btrfs: fix the reserved space leak caused by the race between nonlock dio and buffered io
When we ran sysbench on the fs with compression, the following WARN_ONs were
triggered:
 fs/btrfs/inode.c:7829	WARN_ON(BTRFS_I(inode)->outstanding_extents);
 fs/btrfs/inode.c:7830	WARN_ON(BTRFS_I(inode)->reserved_extents);
 fs/btrfs/inode.c:7832	WARN_ON(BTRFS_I(inode)->csum_bytes);

Steps to reproduce:
 # mkfs.btrfs -f <dev>
 # mount -o compress <dev> <mnt>
 # cd <mnt>
 # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
 > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
 > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
 > --file-test-mode=sync prepare
 # cd -
 # umount <mnt>
 # mount -o compress <dev> <mnt>
 # cd <mnt>
 # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
 > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
 > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
 > --file-test-mode=sync run
 # cd -
 # umount <mnt>

The reason of this problem is:
Task0				Task1
btrfs_direct_IO
  unlock(&inode->i_mutex)
				lock(&inode->i_mutex)
				reserve_space()
				prepare_pages()
				  lock_extent()
				  clear_extent()
				  unlock_extent()
  lock_extent()
  test_extent(uptodate)
    return false
				copy_data()
				set_delalloc_extent()
  extent need compress
    go back to buffered write
  clear_extent(DELALLOC | DIRTY)
  unlock_extent()

Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which
was set by task1, it made the dirty pages in that extents couldn't be flushed
into the disk, so the reserved space for that extent was not released at
the end.

This patch fixes the above bug by unlocking the extent after the delalloc.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:42 -08:00
Miao Xie b37392ea86 Btrfs: cleanup unnecessary parameter and variant of prepare_pages()
- the caller has gotten the inode object, needn't pass the file object.
  And if so, we needn't define a inode pointer variant.
- the position should be aligned by the page size not sector size, so
  we also needn't pass the root object into prepare_pages().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:41 -08:00
David Sterba cc37bb0420 btrfs: replace BUG in can_modify_feature
We don't need to crash hard here, it's just reading a sysfs file. The
values considered in switch are from a fixed set, the default case
should not happen at all.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:41 -08:00
David Sterba 43d87fa231 btrfs: reserve no transaction units in btrfs_feature_attr_store
Added in patch "btrfs: add ability to change features via sysfs",
modifications to superblock don't need to reserve metadata blocks when
starting a transaction.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:40 -08:00
Frank Holton 27a0dd61a5 Btrfs: make btrfs_debug match pr_debug handling related to DEBUG
The kernel macro pr_debug is defined as a empty statement when DEBUG is
not defined. Make btrfs_debug match pr_debug to avoid spamming
the kernel log with debug messages

Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:39 -08:00
Sergei Trofimovich 33b98f2271 btrfs: cleanup: removed unused 'btrfs_get_inode_ref_index'
Found by uselex.rb:
> btrfs_get_inode_ref_index: [R]: exported from:
fs/btrfs/inode-item.o fs/btrfs/btrfs.o fs/btrfs/built-in.o

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: David Stebra <dsterba@suse.cz>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:39 -08:00
Kelley Nielsen 3f870c2899 btrfs: expand btrfs_find_item() to include find_orphan_item functionality
This is the third step in bootstrapping the btrfs_find_item interface.
The function find_orphan_item(), in orphan.c, is similar to the two
functions already replaced by the new interface. It uses two parameters,
which are already present in the interface, and is nearly identical to
the function brought in in the previous patch.

Replace the two calls to find_orphan_item() with calls to
btrfs_find_item(), with the defined objectid and type that was used
internally by find_orphan_item(), a null path, and a null key. Add a
test for a null path to btrfs_find_item, and if it passes, allocate and
free the path. Finally, remove find_orphan_item().

Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:37 -08:00
Kelley Nielsen 75ac2dd907 btrfs: expand btrfs_find_item() to include find_root_ref functionality
This patch is the second step in bootstrapping the btrfs_find_item
interface. The btrfs_find_root_ref() is similar to the former
__inode_info(); it accepts four of its parameters, and duplicates the
first half of its functionality.

Replace the one former call to btrfs_find_root_ref() with a call to
btrfs_find_item(), along with the defined key type that was used
internally by btrfs_find_root ref, and a null found key. In
btrfs_find_item(), add a test for the null key at the place where
the functionality of btrfs_find_root_ref() ends; btrfs_find_item()
then returns if the test passes. Finally, remove btrfs_find_root_ref().

Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:36 -08:00
Kelley Nielsen e33d5c3d6d btrfs: bootstrap generic btrfs_find_item interface
There are many btrfs functions that manually search the tree for an
item. They all reimplement the same mechanism and differ in the
conditions that they use to find the item. __inode_info() is one such
example. Zach Brown proposed creating a new interface to take the place
of these functions.

This patch is the first step to creating the interface. A new function,
btrfs_find_item, has been added to ctree.c and prototyped in ctree.h.
It is identical to __inode_info, except that the order of the parameters
has been rearranged to more closely those of similar functions elsewhere
in the code (now, root and path come first, then the objectid, offset
and type, and the key to be filled in last). __inode_info's callers have
been set to call this new function instead, and __inode_info itself has
been removed.

Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:36 -08:00
Valentina Giusti a3df41ee37 btrfs: fix unused variables in qgroup.c
Use otherwise unused local variables slot in update_qgroup_limit_item and
in update_qgroup_info_item, and remove unused variable ins from
btrfs_qgroup_account_ref.

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:35 -08:00
Valentina Giusti e94acd86d4 btrfs: replace path->slots[0] with otherwise unused variable 'slot'
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:35 -08:00
Valentina Giusti ce3e7f1073 btrfs: remove unused variable from scrub_fixup_nodatasum
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:34 -08:00
Valentina Giusti f0265bb409 btrfs: remove unused variable from setup_cluster_no_bitmap
The variable window_start in setup_cluster_no_bitmap is not used since commit
1bb91902dc
(Btrfs: revamp clustered allocation logic)

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:33 -08:00
Valentina Giusti 50892bac3b btrfs: remove unused variables from extent_io.c
Remove unused variables:
* tree from end_bio_extent_writepage,
* item from extent_fiemap.

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:33 -08:00
Valentina Giusti 4b447bfac6 btrfs: remove unused variable from find_free_extent
The variable found_uncached_bg in find_free_extent is not used since commit
285ff5af6c
(Btrfs: remove the ideal caching code)

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:32 -08:00
Valentina Giusti 71db2a7751 btrfs: remove unused variables from disk-io.c
Remove unused variables:
* tree from csum_dirty_buffer,
* tree from btree_readpage_end_io_hook,
* tree from btree_writepages,
* bytenr from btrfs_create_tree,
* fs_info from end_workqueue_fn.

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:31 -08:00
Valentina Giusti 99e22f783b btrfs: remove unused variable from btrfs_new_inode
Variable owner in btrfs_new_inode is unused since commit
d82a6f1d7e
(Btrfs: kill BTRFS_I(inode)->block_group)

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:31 -08:00
Jeff Mahoney f8ba9c11f8 btrfs: publish fs label in sysfs
This adds a writeable attribute which describes the label.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:30 -08:00
Jeff Mahoney 29e5be240a btrfs: publish device membership in sysfs
Now that we have the infrastructure for per-super attributes, we can
publish device membership in /sys/fs/btrfs/<fsid>/devices. The information
is published as symlinks to the block devices.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:29 -08:00
Jeff Mahoney 6ab0a2029c btrfs: publish allocation data in sysfs
While trying to debug ENOSPC issues, it's helpful to understand what the
kernel's view of the available space is. We export this information
via ioctl, but sysfs files are more easily used.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:29 -08:00
Jeff Mahoney 01e219e806 btrfs: add ioctl to export size of global metadata reservation
btrfs filesystem df output will show the size of the metadata space
and how much of it is used, and the user assumes that the difference
is all usable space. Since that's not actually the case due to the
global metadata reservation, we should provide the full picture to the
user.

This patch adds an ioctl that exports the size of the global metadata
reservation so that btrfs filesystem df can report it.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:28 -08:00
Jeff Mahoney 3b02a68a63 btrfs: use feature attribute names to print better error messages
Now that we have the feature name strings available in the kernel via
the sysfs attributes, we can use them for printing better failure
messages from the ioctl path.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:28 -08:00
Jeff Mahoney ba631941ef btrfs: add ability to change features via sysfs
This patch adds the ability to change (set/clear) features while the file
system is mounted. A bitmask is added for each feature set for the
support to set and clear the bits. A message indicating which bit
has been set or cleared is issued when it's been changed and also when
permission or support for a particular bit has been denied.

Since the the attributes can now be writable, we need to introduce
another struct attribute to hold the different permissions.

If neither set or clear is supported, the file will have 0444 permissions.
If either set or clear is supported, the file will have 0644 permissions
and the store handler will filter out the write based on the bitmask.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:27 -08:00
Jeff Mahoney 79da4fa4d9 btrfs: publish unknown feature bits in sysfs
With the compat and compat-ro bits, it's possible for file systems to
exist that have features that aren't supported by the kernel's file system
implementation yet still be mountable.

This patch publishes read-only info on those features using a prefix:number
format, where the number is the bit number rather than the shifted value.
e.g. "compat:12"

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:26 -08:00
Jeff Mahoney 510d73600a btrfs: publish per-super features in sysfs
This patch publishes information on which features are enabled in the
file system on a per-super basis. At this point, it only publishes
information on features supported by the file system implementation.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:26 -08:00
Jeff Mahoney 5ac1d209f1 btrfs: publish per-super attributes in sysfs
This patch adds per-super attributes to sysfs.

It doesn't publish any attributes yet, but does the proper lifetime
handling as well as the basic infrastructure to add new attributes.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:25 -08:00
Jeff Mahoney 079b72bca3 btrfs: publish supported featured in sysfs
This patch adds the ability to publish supported features to sysfs under
/sys/fs/btrfs/features.

The files are module-wide and export which features the kernel supports.

The content, for now, is just "0\n".

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:25 -08:00
Jeff Mahoney 2eaa055fab btrfs: add ioctls to query/change feature bits online
There are some feature bits that require no offline setup and can
be enabled online. I've only reviewed extended irefs, but there will
probably be more.

We introduce three new ioctls:
- BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features.
- BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs
  basis, as well as querying for which features are changeable with mounted.
- BTRFS_IOC_SET_FEATURES: change features on a per-fs basis.

We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that
allow us to define which features are safe to change at runtime.

The failure modes for BTRFS_IOC_SET_FEATURES are as follows:
- Enabling a completely unsupported feature: warns and returns -ENOTSUPP
- Enabling a feature that can only be done offline: warns and returns -EPERM

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:23 -08:00
Liu Bo 9e5ac13acb Btrfs: skip merge part for delayed data refs
When we have data deduplication on, we'll hang on the merge part
because it needs to verify every queued delayed data refs related to
this disk offset but we may have millions refs.

And in the case of delayed data refs, we don't usually have too much
data refs to merge.

So it's safe to shut it down for data refs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:23 -08:00
Liu Bo c46effa601 Btrfs: introduce a head ref rbtree
The way how we process delayed refs is
1) get a bunch of head refs,
2) pick up one head ref,
3) go one node back for any delayed ref updates.

The head ref is also linked in the same rbtree as the delayed ref is,
so in 1) stage, we have to walk one by one including not only head refs, but
delayed refs.

When we have a great number of delayed refs pending to process,
this'll cost time a lot.

Here we introduce a head ref specific rbtree, it only has head refs, so troubles
go away.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:22 -08:00
Josef Bacik e20d6c5ba3 Btrfs: fix check-integrity to look at the referenced data properly
We were looking at file_extent_num_bytes unconditionally when looking at
referenced data bytes, but this isn't correct for compression.  Fix this by
checking the compression of the file extent we are and setting num_bytes to
disk_num_bytes in the case of compression so that we are marking the proper
bytes as referenced.  This fixes check_int_data freaking out when running
btrfs/004.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:21 -08:00
Josef Bacik 16e7549f04 Btrfs: incompatible format change to remove hole extents
Btrfs has always had these filler extent data items for holes in inodes.  This
has made somethings very easy, like logging hole punches and sending hole
punches.  However for large holey files these extent data items are pure
overhead.  So add an incompatible feature to no longer add hole extents to
reduce the amount of metadata used by these sort of files.  This has a few
changes for logging and send obviously since they will need to detect holes and
log/send the holes if there are any.  I've tested this thoroughly with xfstests
and it doesn't cause any issues with and without the incompat format set.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:21 -08:00
Jeff Layton 4db72b40fd nfs: add memory barriers around NFS_INO_INVALID_DATA and NFS_INO_INVALIDATING
If the setting of NFS_INO_INVALIDATING gets reordered to before the
clearing of NFS_INO_INVALID_DATA, then another task may hit a race
window where both appear to be clear, even though the inode's pages are
still in need of invalidation. Fix this by adding the appropriate memory
barriers.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-28 14:48:18 -05:00
Linus Torvalds d891ea23d5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "This is a big batch.  From Ilya we have:

   - rbd support for more than ~250 mapped devices (now uses same scheme
     that SCSI does for device major/minor numbering)
   - crush updates for new mapping behaviors (will be needed for coming
     erasure coding support, among other things)
   - preliminary support for tiered storage pools

  There is also a big series fixing a pile cephfs bugs with clustered
  MDSs from Yan Zheng, ACL support for cephfs from Guangliang Zhao, ceph
  fscache improvements from Li Wang, improved behavior when we get
  ENOSPC from Josh Durgin, some readv/writev improvements from
  Majianpeng, and the usual mix of small cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (76 commits)
  ceph: cast PAGE_SIZE to size_t in ceph_sync_write()
  ceph: fix dout() compile warnings in ceph_filemap_fault()
  libceph: support CEPH_FEATURE_OSD_CACHEPOOL feature
  libceph: follow redirect replies from osds
  libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid}
  libceph: follow {read,write}_tier fields on osd request submission
  libceph: add ceph_pg_pool_by_id()
  libceph: CEPH_OSD_FLAG_* enum update
  libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg()
  libceph: introduce and start using oid abstraction
  libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN
  libceph: move ceph_file_layout helpers to ceph_fs.h
  libceph: start using oloc abstraction
  libceph: dout() is missing a newline
  libceph: add ceph_kv{malloc,free}() and switch to them
  libceph: support CEPH_FEATURE_EXPORT_PEER
  ceph: add imported caps when handling cap export message
  ceph: add open export target session helper
  ceph: remove exported caps when handling cap import message
  ceph: handle session flush message
  ...
2014-01-28 11:02:23 -08:00
Linus Torvalds 08d21b5f93 Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
Pull exofs and ore fixes from Boaz Harrosh:
 "The main fix here, the first patch, is also destined for -stable.  The
  rest is small trivia and cosmetics.  The ORE patches effect both exofs
  and pnfs-objects very reproducible bugs"

[ ORE is "object raid engine", used by exofs and pnfs  - Linus ]

* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  exofs: Print less in r4w
  exofs: Allow corrupted directory entry to be empty file
  exofs: Allow O_DIRECT open
  ore: Don't crash on NULL bio in _clear_bio
  ore: Fix wrong math in allocation of per device BIO
2014-01-28 10:57:14 -08:00
Ilya Dryomov 125d725c92 ceph: cast PAGE_SIZE to size_t in ceph_sync_write()
Use min_t(size_t, ...) instead of plain min(), which does strict type
checking, to avoid compile warning on i386.

Cc: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-28 09:57:21 -08:00
Ilya Dryomov 37b52fe608 ceph: fix dout() compile warnings in ceph_filemap_fault()
PAGE_CACHE_SIZE is unsigned long on all architectures, however size_t
is either unsigned int or unsigned long.  Rather than change format
strings, cast PAGE_CACHE_SIZE to size_t to be in line with dout()s in
ceph_page_mkwrite().

Cc: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-28 09:57:06 -08:00
Linus Torvalds a53b75b37a Bug fixes and cleanups for ext4. We also enable the punch hole
functionality for bigalloc file systems.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJS59knAAoJENNvdpvBGATwdPMQAOqi3O7DdDKt6wUkv6LnYSH5
 y5MpvsKlvAL9WmS9Uj/S9VnG0eIakLbdl1DcTrnsqgbCryHPlWC8jOsPn22yanlN
 T+QGfw+1CkPrionsQVzKH17sFImITrcvz9z0vaX9//imSRmhm8JHMgVaboKnMXK2
 pdYdwNUGPlG4sUiaZEgX2lD4q57MUYRmGBracTRA4rC7roxRrZRv4MpRe3vdqC9D
 RNkUfrv6Bv63JsMVYAMzanMCH0IB+96ZhoRAym82H1U7nyAMvDKFWASf8dHbOXDu
 64RQlB55Ehqv4wAJ2JBwkH1vi6NIF/LEysfmksctMRDB+rP6i6HSQMOQ3SsjDUWL
 Llq6J6raeyfHa0sCDh5A4hQhtOFMuhhbqHWFcfoUgGTI3aSnt7zGuh7/afIw++Go
 tLzqJFRzCgGdnmihLRbjIy9d6eUeEj2m40bg/QOg89Y3EhTsq0kwVf0TXAOskvx4
 kBWH4ZFemfUxfDMNIa9uPN99I4/FlBO/9xGneDb6vYtPB/fA3jnaPYu/LIorF6Sc
 Mw1wH4Qy3oX5j9VcJlvZPY5cGmBLNyQqEdbOBgi6zeEFF1g2kTMf1PQQ++9rcLlV
 Cf9qc4IrjAUGbwo59gQtuB8OzhNxkl4mhOTVNZ/PW8G2rpufdqGhndUsjTAPNmKM
 +tf+CjdJuHzPnj7Ig21u
 =ofDv
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
 "Bug fixes and cleanups for ext4.  We also enable the punch hole
  functionality for bigalloc file systems"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: delete "set but not used" variables
  ext4: don't pass freed handle to ext4_walk_page_buffers
  ext4: avoid clearing beyond i_blocks when truncating an inline data file
  ext4: ext4_inode_is_fast_symlink should use EXT4_CLUSTER_SIZE
  ext4: fix a typo in extents.c
  ext4: use %pd printk specificer
  ext4: standardize error handling in ext4_da_write_inline_data_begin()
  ext4: retry allocation when inline->extent conversion failed
  ext4: enable punch hole for bigalloc
2014-01-28 08:54:16 -08:00
Linus Torvalds 2b2b15c32a NFS client updates for Linux 3.14
Highlights include:
 
 - Stable fix for an infinite loop in RPC state machine
 - Stable fix for a use after free situation in the NFSv4 trunking discovery
 - Stable fix for error handling in the NFSv4 trunking discovery
 - Stable fix for the page write update code
 - Stable fix for the NFSv4.1 mount time security negotiation
 - Stable fix for the NFSv4 open code.
 - O_DIRECT locking fixes
 - fix an Oops in the pnfs file commit code
 - RPC layer needs finer grained handling of connection errors
 - More RPC GSS upcall fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJS5ozQAAoJEGcL54qWCgDy8EIQAMKYX1E5qOal3oJCzWdHAPNz
 ZSQ7CbA3c66vgJwpxy5Mz4gEtTK1IEzfTX31gLgkCXkyw54As+0lOa/SvoXFUusN
 BdBtskkIcVjhcly56xP2dzWGMsVrS8Vt+nwhsPv1Qaor5El0zXwPv8YE5PuuxJK5
 fyQdFEsywnCHtmFdyBdzsV8qHvAA0rxZTMmd6ZDBPCi9362D+pfp/1ESVOA6O14N
 rMBAbadF0pVM1UNvcvxSQaeqwCNqg5OuYKgyy9rhlH0WiQ6ijvKPrLVwg2pKZ2hj
 DCmwEqmKNEpxIFeOvmgFs/uhOEBx2IOF58xTc0+X81q96yTVm80anG1VTNFX577U
 gO8Ts0K/gWTD8ghxz4vh4/llc4yUv8ep8zB3qdSfL8C217UJIwnshkbPct7P1DTh
 8vpWtUeVJPu6rwcxMQXy0NntNZjRo1aqrv+htvFzPAMicM2KEAp73eOjStefvtr5
 JkdbvhhOR6dLwPrUEXM5FW5ewURegLjLcEqw3tq8kMnH0nEYjWOMBaB+uT0QFXun
 EXNqCpQHmHisem/3lGU+iVPc9lPf3C6tPIgjvoSplKcah1l3phVx6a5ReL22Zx2n
 qB2ePHfqToMjMcWiW3O3sbRpaDb+Br7xI4l8F3oeicvfv7SKB8k1u/w2IIoXKFIa
 FIdD6R0UIPgdnH5c03EC
 =abfY
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

   - stable fix for an infinite loop in RPC state machine
   - stable fix for a use after free situation in the NFSv4 trunking discovery
   - stable fix for error handling in the NFSv4 trunking discovery
   - stable fix for the page write update code
   - stable fix for the NFSv4.1 mount time security negotiation
   - stable fix for the NFSv4 open code.
   - O_DIRECT locking fixes
   - fix an Oops in the pnfs file commit code
   - RPC layer needs finer grained handling of connection errors
   - more RPC GSS upcall fixes"

* tag 'nfs-for-3.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (30 commits)
  pnfs: Proper delay for NFS4ERR_RECALLCONFLICT in layout_get_done
  pnfs: fix BUG in filelayout_recover_commit_reqs
  nfs4: fix discover_server_trunking use after free
  NFSv4.1: Handle errors correctly in nfs41_walk_client_list
  nfs: always make sure page is up-to-date before extending a write to cover the entire page
  nfs: page cache invalidation for dio
  nfs: take i_mutex during direct I/O reads
  nfs: merge nfs_direct_write into nfs_file_direct_write
  nfs: merge nfs_direct_read into nfs_file_direct_read
  nfs: increment i_dio_count for reads, too
  nfs: defer inode_dio_done call until size update is done
  nfs: fix size updates for aio writes
  nfs4.1: properly handle ENOTSUP in SECINFO_NO_NAME
  NFSv4.1: Fix a race in nfs4_write_inode
  NFSv4.1: Don't trust attributes if a pNFS LAYOUTCOMMIT is outstanding
  point to the right include file in a comment (left over from a9004abc3)
  NFS: dprintk() should not print negative fileids and inode numbers
  nfs: fix dead code of ipv6_addr_scope
  sunrpc: Fix infinite loop in RPC state machine
  SUNRPC: Add tracepoint for socket errors
  ...
2014-01-28 08:46:44 -08:00
Linus Torvalds bf3d846b78 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Assorted stuff; the biggest pile here is Christoph's ACL series.  Plus
  assorted cleanups and fixes all over the place...

  There will be another pile later this week"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
  __dentry_path() fixes
  vfs: Remove second variable named error in __dentry_path
  vfs: Is mounted should be testing mnt_ns for NULL or error.
  Fix race when checking i_size on direct i/o read
  hfsplus: remove can_set_xattr
  nfsd: use get_acl and ->set_acl
  fs: remove generic_acl
  nfs: use generic posix ACL infrastructure for v3 Posix ACLs
  gfs2: use generic posix ACL infrastructure
  jfs: use generic posix ACL infrastructure
  xfs: use generic posix ACL infrastructure
  reiserfs: use generic posix ACL infrastructure
  ocfs2: use generic posix ACL infrastructure
  jffs2: use generic posix ACL infrastructure
  hfsplus: use generic posix ACL infrastructure
  f2fs: use generic posix ACL infrastructure
  ext2/3/4: use generic posix ACL infrastructure
  btrfs: use generic posix ACL infrastructure
  fs: make posix_acl_create more useful
  fs: make posix_acl_chmod more useful
  ...
2014-01-28 08:38:04 -08:00
Trond Myklebust 17dfeb9113 NFS: Fix races in nfs_revalidate_mapping
Commit d529ef83c3 (NFS: fix the handling
of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping) introduces
a potential race, since it doesn't test the value of nfsi->cache_validity
and set the bitlock in nfsi->flags atomically.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Jeff Layton <jlayton@redhat.com>
2014-01-28 10:47:08 -05:00
Heiko Carstens 592f6b842f compat: fix sys_fanotify_mark
Commit 91c2e0bcae ("unify compat fanotify_mark(2), switch to
COMPAT_SYSCALL_DEFINE") added a new unified compat fanotify_mark syscall
to be used by all architectures.

Unfortunately the unified version merges the split mask parameter in a
wrong way: the lower and higher word got swapped.

This was discovered with glibc's tst-fanotify test case.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Andreas Krebbel <krebbel@linux.vnet.ibm.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: <stable@vger.kernel.org>	[3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27 21:02:40 -08:00
Xiaowei.Hu f5b258550f ocfs2: do not log ENOENT in unlink()
Suppress log message like this: (open_delete,8328,0):ocfs2_unlink:951
ERROR: status = -2

Orabug:17445485

Signed-off-by: Xiaowei Hu <xiaowei.hu@oracle.com>
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27 21:02:39 -08:00
Ming Chen ed47b062ce nfsd: consider CLAIM_FH when handing out delegation
CLAIM_FH was added by NFSv4.1.  It is the same as CLAIM_NULL except that it
uses only current FH to identify the file to be opened.

The NFS client is using CLAIM_FH if the FH is available when opening a file.
Currently, we cannot get any delegation if we stat a file before open it
because the server delegation code does not recognize CLAIM_FH.

We tested this patch and found delegation can be handed out now when claim is
CLAIM_FH.

See http://marc.info/?l=linux-nfs&m=136369847801388&w=2 and
http://www.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues#New_open_claim_types

Signed-off-by: Ming Chen <mchen@cs.stonybrook.edu>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-01-27 18:02:42 -05:00
Ilya Dryomov 7c13cb6435 libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg()
Switch ceph_calc_ceph_pg() to new oloc and oid abstractions and rename
it to ceph_oloc_oid_to_pg() to make its purpose more clear.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2014-01-27 23:57:32 +02:00
Jeff Layton 0ea9de0ea6 sunrpc: turn warn_gssd() log message into a dprintk()
The original printk() made sense when the GSSAPI codepaths were called
only when sec=krb5* was explicitly requested. Now however, in many cases
the nfs client will try to acquire GSSAPI credentials by default, even
when it's not requested.

Since we don't have a great mechanism to distinguish between the two
cases, just turn the pr_warn into a dprintk instead. With this change we
can also get rid of the ratelimiting.

We do need to keep the EXPORT_SYMBOL(gssd_running) in place since
auth_gss.ko needs it and sunrpc.ko provides it. We can however,
eliminate the gssd_running call in the nfs code since that's a bit of a
layering violation.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-27 15:36:05 -05:00
Jeff Layton d529ef83c3 NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping
There is a possible race in how the nfs_invalidate_mapping function is
handled.  Currently, we go and invalidate the pages in the file and then
clear NFS_INO_INVALID_DATA.

The problem is that it's possible for a stale page to creep into the
mapping after the page was invalidated (i.e., via readahead). If another
writer comes along and sets the flag after that happens but before
invalidate_inode_pages2 returns then we could clear the flag
without the cache having been properly invalidated.

So, we must clear the flag first and then invalidate the pages. Doing
this however, opens another race:

It's possible to have two concurrent read() calls that end up in
nfs_revalidate_mapping at the same time. The first one clears the
NFS_INO_INVALID_DATA flag and then goes to call nfs_invalidate_mapping.

Just before calling that though, the other task races in, checks the
flag and finds it cleared. At that point, it trusts that the mapping is
good and gets the lock on the page, allowing the read() to be satisfied
from the cache even though the data is no longer valid.

These effects are easily manifested by running diotest3 from the LTP
test suite on NFS. That program does a series of DIO writes and buffered
reads. The operations are serialized and page-aligned but the existing
code fails the test since it occasionally allows a read to come out of
the cache incorrectly. While mixing direct and buffered I/O isn't
recommended, I believe it's possible to hit this in other ways that just
use buffered I/O, though that situation is much harder to reproduce.

The problem is that the checking/clearing of that flag and the
invalidation of the mapping really need to be atomic. Fix this by
serializing concurrent invalidations with a bitlock.

At the same time, we also need to allow other places that check
NFS_INO_INVALID_DATA to check whether we might be in the middle of
invalidating the file, so fix up a couple of places that do that
to look for the new NFS_INO_INVALIDATING flag.

Doing this requires us to be careful not to set the bitlock
unnecessarily, so this code only does that if it believes it will
be doing an invalidation.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-27 15:35:56 -05:00
J. Bruce Fields 4335723e8e nfsd4: fix delegation-unlink/rename race
If a file is unlinked or renamed between the time when we do the local
open and the time when we get the delegation, then we will return to the
client indicating that it holds a delegation even though the file no
longer exists under the name it was open under.

But a client performing an open-by-name, when it is returned a
delegation, must be able to assume that the file is still linked at the
name it was opened under.

So, hold the parent i_mutex for longer to prevent concurrent renames or
unlinks.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-01-27 13:59:16 -05:00
J. Bruce Fields c0e6bee480 nfsd4: delay setting current_fh in open
This is basically a no-op, to simplify a following patch.

Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-01-27 13:59:16 -05:00
J. Bruce Fields e873088f29 nfsd4: minor nfs4_setlease cleanup
As far as I can tell, this list is used only under the state lock, so we
may as well do this in the simpler order.

Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-01-27 13:59:15 -05:00
Malahal Naineni 7dd7d95916 nfs: handle servers that support only ALLOW ACE type.
Currently we support ACLs if the NFS server file system supports both
ALLOW and DENY ACE types. This patch makes the Linux client work with
ACLs even if the server supports only 'ALLOW' ACE type.

Signed-off-by: Malahal Naineni <malahal@us.ibm.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-27 11:54:35 -05:00
Steve French 666753c3ef [CIFS] Fix SMB2 mounts so they don't try to set or get xattrs via cifs
When mounting with smb2 (or smb2.1 or smb3) we need to check to make
sure that attempts to query or set extended attributes do not
attempt to send the request with the older cifs protocol instead
(eventually we also need to add the support in SMB2
to query/set extended attributes but this patch prevents us from
using the wrong protocol for extended attribute operations).

Signed-off-by: Steve French <smfrench@gmail.com>
2014-01-26 23:53:43 -06:00
Linus Torvalds 77d143de75 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML changes from Richard Weinberger:
 "This time only various cleanups and housekeeping patches"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
  um: hostfs: make functions static
  um: Include generic barrier.h
  um: Removed unused attributes from thread_struct
2014-01-26 11:06:16 -08:00
Linus Torvalds 1c2948380b 9p changes for 3.14 merge window
Included are a new cache model for support of mmap,
 and several cleanups across the filesystem and networking
 portions of the code.
 
 Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 Comment: GPGTools - http://gpgtools.org
 
 iQIcBAABAgAGBQJS4qT3AAoJEDZk62b0Tg6xiF0P/RI2c75f/5SDbeNtH0usbgRj
 YfC7do4NX0HX0nI8H0bvfai1UU9JLc8M7aEjf5nw19O45phOQcGm/KeGRmMKtAhr
 OixTLXaMkRd3llTGkFv8ZY0W6aaSsGpB3Lzin+lZBwzYqMcqksBqhOTOoS1MSh3F
 u5PyhpuyJmxMkS5ud857PfwIREXOSHF/NIMMs5k9M9kK0zCka7xvl4Kg8zng2RVf
 A3rmKsLEvYuAgnxOq16hsRMgqHwx2833C3VmQKSl/n6SfOCy7cAMNmChIDrnAwtF
 dJosxypiRSkYjgD/YJR3UZofF7IqPgdL4umNmnb2lTHbOpeqNQ1hLB8BotjGpoVX
 pl9lxzz8UzaflwkAdgPsy/GBrbULxQKPLhL1Y0QPedhYh57bqRUEPPJ/HOjyrbOE
 RZXKZXfKbYlbNwc61N+meRC0IJETTjafnJlEzXu2vA+3LxZ3n/uZ7uq7XasVPiUV
 UKTKcvzYMs/PxA47rX81DOzebmphGEZDzw2ONbi4LMwGqeWt6WIpCMLPdGDjq7kl
 jdkpf9DuDr4mDrVP5+cFhzGQYbv9rCGR1zakWSW2H9xqP4Zy+o3kEPstniTMuNS4
 smkLPfpcG0VAKvY3HiVxT62EA4M+38IBAME0ATicE6esrWDyuLtGlke7x+uZoLUF
 mQ7WPimYBR+60liZ3zbQ
 =tCej
 -----END PGP SIGNATURE-----

Merge tag 'for-3.14-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs

Pull 9p changes from Eric Van Hensbergen:
 "Included are a new cache model for support of mmap, and several
  cleanups across the filesystem and networking portions of the code"

* tag 'for-3.14-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  9p: update documentation
  9P: introduction of a new cache=mmap model.
  net/9p: remove virtio default hack and set appropriate bits instead
  9p: remove useless 'name' variable and assignment
  9p: fix return value in case in v9fs_fid_xattr_set()
  9p: remove useless variable and assignment
  9p: remove useless assignment
  9p: remove unused 'super_block' struct pointer
  9p: remove never used return variable
  9p: remove unused 'p9_fid' struct pointer
  9p: remove unused 'p9_client' struct pointer
2014-01-26 10:55:41 -08:00
Al Viro f650080152 __dentry_path() fixes
* we need to save the starting point for restarts
* reject pathologically short buffers outright

Spotted-by: Denys Vlasenko <dvlasenk@redhat.com>
Spotted-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-26 12:37:55 -05:00
Eric W. Biederman a8323da036 vfs: Remove second variable named error in __dentry_path
In commit  232d2d60aa
Author: Waiman Long <Waiman.Long@hp.com>
Date:   Mon Sep 9 12:18:13 2013 -0400

    dcache: Translating dentry into pathname without taking rename_lock

The __dentry_path locking was changed and the variable error was
intended to be moved outside of the loop.  Unfortunately the inner
declaration of error was not removed. Resulting in a version of
__dentry_path that will never return an error.

Remove the problematic inner declaration of error and allow
__dentry_path to return errors once again.

Cc: stable@vger.kernel.org
Cc: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-26 08:26:43 -05:00
Eric W. Biederman 260a459d2e vfs: Is mounted should be testing mnt_ns for NULL or error.
A bug was introduced with the is_mounted helper function in
commit f7a99c5b7c
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Sat Jun 9 00:59:08 2012 -0400

    get rid of ->mnt_longterm

    it's enough to set ->mnt_ns of internal vfsmounts to something
    distinct from all struct mnt_namespace out there; then we can
    just use the check for ->mnt_ns != NULL in the fast path of
    mntput_no_expire()

    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

The intent was to test if the real_mount(vfsmount)->mnt_ns was
NULL_OR_ERR but the code is actually testing real_mount(vfsmount)
and always returning true.

The result is d_absolute_path returning paths it should be hiding.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-26 08:26:42 -05:00
Steven Whitehouse 9fe55eea7e Fix race when checking i_size on direct i/o read
So far I've had one ACK for this, and no other comments. So I think it
is probably time to send this via some suitable tree. I'm guessing that
the vfs tree would be the most appropriate route, but not sure that
there is one at the moment (don't see anything recent at kernel.org)
so in that case I think -mm is the "back up plan". Al, please let me
know if you will take this?

Steve.

---------------------

Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
reads with append dio writes" thread on linux-fsdevel, this patch is my
current version of the fix proposed as option (b) in that thread.

Removing the i_size test from the direct i/o read path at vfs level
means that filesystems now have to deal with requests which are beyond
i_size themselves. These I've divided into three sets:

 a) Those with "no op" ->direct_IO (9p, cifs, ceph)
These are obviously not going to be an issue

 b) Those with "home brew" ->direct_IO (nfs, fuse)
I've been told that NFS should not have any problem with the larger
i_size, however I've added an extra test to FUSE to duplicate the
original behaviour just to be on the safe side.

 c) Those using __blockdev_direct_IO()
These call through to ->get_block() which should deal with the EOF
condition correctly. I've verified that with GFS2 and I believe that
Zheng has verified it for ext4. I've also run the test on XFS and it
passes both before and after this change.

The part of the patch in filemap.c looks a lot larger than it really is
- there are only two lines of real change. The rest is just indentation
of the contained code.

There remains a test of i_size though, which was added for btrfs. It
doesn't cause the other filesystems a problem as the test is performed
after ->direct_IO has been called. It is possible that there is a race
that does matter to btrfs, however this patch doesn't change that, so
its still an overall improvement.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-26 08:26:42 -05:00
Christoph Hellwig 2796e4cec5 hfsplus: remove can_set_xattr
When using the per-superblock xattr handlers permission checking is
done by the generic code.  hfsplus just needs to check for the magic
osx attribute not to leak into protected namespaces.

Also given that the code was obviously copied from JFS the proper
attribution was missing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-26 08:26:41 -05:00
Christoph Hellwig 4ac7249ea5 nfsd: use get_acl and ->set_acl
Remove the boilerplate code to marshall and unmarhall ACL objects into
xattrs and operate on the posix_acl objects directly.  Also move all
the ACL handling code into nfs?acl.c where it belongs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-26 08:26:41 -05:00
Christoph Hellwig feda821e76 fs: remove generic_acl
And instead convert tmpfs to use the new generic ACL code, with two stub
methods provided for in-memory filesystems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-26 08:26:40 -05:00
Christoph Hellwig 013cdf1088 nfs: use generic posix ACL infrastructure for v3 Posix ACLs
This causes a small behaviour change in that we don't bother to set
ACLs on file creation if the mode bit can express the access permissions
fully, and thus behaving identical to local filesystems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-26 08:26:20 -05:00
James Hogan 9e443bc369 um: hostfs: make functions static
The hostfs_*() callback functions are all only used within
hostfs_kern.c, so make them static.

Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: user-mode-linux-devel@lists.sourceforge.net
Signed-off-by: Richard Weinberger <richard@nod.at>
2014-01-26 11:51:09 +01:00
Christoph Hellwig e01580bf9e gfs2: use generic posix ACL infrastructure
This contains some major refactoring for the create path so that
inodes are created with the right mode to start with instead of
fixing it up later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:22 -05:00
Christoph Hellwig 2cc6a5a01c jfs: use generic posix ACL infrastructure
Copy the scheme I introduced to btrfs many years ago to only use the
xattr handler for ACLs, but pass plain attrs straight through.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:22 -05:00
Christoph Hellwig 2401dc2975 xfs: use generic posix ACL infrastructure
Also don't bother to set up a .get_acl method for symlinks as we do not
support access control (ACLs or even mode bits) for symlinks in Linux,
and create inodes with the proper mode instead of fixing it up later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:21 -05:00
Christoph Hellwig 47f70d08fa reiserfs: use generic posix ACL infrastructure
Also don't bother to set up a .get_acl method for symlinks as we do not
support access control (ACLs or even mode bits) for symlinks in Linux.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:21 -05:00
Christoph Hellwig 702e5bc68a ocfs2: use generic posix ACL infrastructure
This contains some major refactoring for the create path so that
inodes are created with the right mode to start with instead of
fixing it up later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:21 -05:00
Christoph Hellwig f2963d4551 jffs2: use generic posix ACL infrastructure
Also don't bother to set up a .get_acl method for symlinks as we do not
support access control (ACLs or even mode bits) for symlinks in Linux.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:20 -05:00
Christoph Hellwig b0a7ab5706 hfsplus: use generic posix ACL infrastructure
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:20 -05:00
Christoph Hellwig a6dda0e63e f2fs: use generic posix ACL infrastructure
f2fs has some weird mode bit handling, so still using the old
chmod code for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:19 -05:00
Christoph Hellwig 64e178a711 ext2/3/4: use generic posix ACL infrastructure
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:19 -05:00
Christoph Hellwig 996a710d46 btrfs: use generic posix ACL infrastructure
Also don't bother to set up a .get_acl method for symlinks as we do not
support access control (ACLs or even mode bits) for symlinks in Linux.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:18 -05:00
Christoph Hellwig 37bc15392a fs: make posix_acl_create more useful
Rename the current posix_acl_created to __posix_acl_create and add
a fully featured helper to set up the ACLs on file creation that
uses get_acl().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:18 -05:00
Christoph Hellwig 5bf3258fd2 fs: make posix_acl_chmod more useful
Rename the current posix_acl_chmod to __posix_acl_chmod and add
a fully featured ACL chmod helper that uses the ->set_acl inode
operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:18 -05:00
Christoph Hellwig 2aeccbe957 fs: add generic xattr_acl handlers
With the ->set_acl inode operation we can implement the Posix ACL
xattr handlers in generic code instead of duplicating them all
over the tree.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:17 -05:00
Christoph Hellwig 2982baa2ae fs: add get_acl helper
Factor out the code to get an ACL either from the inode or disk from
check_acl, so that it can be used elsewhere later on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:16 -05:00
Christoph Hellwig 5c8ebd57b6 fs: merge xattr_acl.c into posix_acl.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:16 -05:00
Christoph Hellwig 9dad943ae7 reiserfs: prefix ACL symbols with reiserfs_
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:15 -05:00
Linus Torvalds 4ba9920e5e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) BPF debugger and asm tool by Daniel Borkmann.

 2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann.

 3) Correct reciprocal_divide and update users, from Hannes Frederic
    Sowa and Daniel Borkmann.

 4) Currently we only have a "set" operation for the hw timestamp socket
    ioctl, add a "get" operation to match.  From Ben Hutchings.

 5) Add better trace events for debugging driver datapath problems, also
    from Ben Hutchings.

 6) Implement auto corking in TCP, from Eric Dumazet.  Basically, if we
    have a small send and a previous packet is already in the qdisc or
    device queue, defer until TX completion or we get more data.

 7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko.

 8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel
    Borkmann.

 9) Share IP header compression code between Bluetooth and IEEE802154
    layers, from Jukka Rissanen.

10) Fix ipv6 router reachability probing, from Jiri Benc.

11) Allow packets to be captured on macvtap devices, from Vlad Yasevich.

12) Support tunneling in GRO layer, from Jerry Chu.

13) Allow bonding to be configured fully using netlink, from Scott
    Feldman.

14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can
    already get the TCI.  From Atzm Watanabe.

15) New "Heavy Hitter" qdisc, from Terry Lam.

16) Significantly improve the IPSEC support in pktgen, from Fan Du.

17) Allow ipv4 tunnels to cache routes, just like sockets.  From Tom
    Herbert.

18) Add Proportional Integral Enhanced packet scheduler, from Vijay
    Subramanian.

19) Allow openvswitch to mmap'd netlink, from Thomas Graf.

20) Key TCP metrics blobs also by source address, not just destination
    address.  From Christoph Paasch.

21) Support 10G in generic phylib.  From Andy Fleming.

22) Try to short-circuit GRO flow compares using device provided RX
    hash, if provided.  From Tom Herbert.

The wireless and netfilter folks have been busy little bees too.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits)
  net/cxgb4: Fix referencing freed adapter
  ipv6: reallocate addrconf router for ipv6 address when lo device up
  fib_frontend: fix possible NULL pointer dereference
  rtnetlink: remove IFLA_BOND_SLAVE definition
  rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info
  qlcnic: update version to 5.3.55
  qlcnic: Enhance logic to calculate msix vectors.
  qlcnic: Refactor interrupt coalescing code for all adapters.
  qlcnic: Update poll controller code path
  qlcnic: Interrupt code cleanup
  qlcnic: Enhance Tx timeout debugging.
  qlcnic: Use bool for rx_mac_learn.
  bonding: fix u64 division
  rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC
  sfc: Use the correct maximum TX DMA ring size for SFC9100
  Add Shradha Shah as the sfc driver maintainer.
  net/vxlan: Share RX skb de-marking and checksum checks with ovs
  tulip: cleanup by using ARRAY_SIZE()
  ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called
  net/cxgb4: Don't retrieve stats during recovery
  ...
2014-01-25 11:17:34 -08:00
Rakesh Pandit ac34a1b35e befs: iget_locked() doesn't return an ERR_PTR
Also fix befs_iget return value if iget_locked fails.

Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 03:14:38 -05:00
Oleg Nesterov e6ff9a9fa4 fs: __fget_light() can use __fget() in slow path
The slow path in __fget_light() can use __fget() to avoid the
code duplication. Saves 232 bytes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 03:14:38 -05:00
Oleg Nesterov ad46183445 fs: factor out common code in fget_light() and fget_raw_light()
Apart from FMODE_PATH check fget_light() and fget_raw_light() are
identical, shift the code into the new helper, __fget_light(fd, mask).
Saves 208 bytes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 03:14:37 -05:00
Oleg Nesterov 1deb46e256 fs: factor out common code in fget() and fget_raw()
Apart from FMODE_PATH check fget() and fget_raw() are identical,
shift the code into the new simple helper, __fget(fd, mask). Saves
160 bytes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 03:14:37 -05:00
Oleg Nesterov ce08b62d18 change close_files() to use rcu_dereference_raw(files->fdt)
put_files_struct() and close_files() do rcu_read_lock() to make
rcu_dereference_check_fdtable() happy.

This looks a bit ugly, files_fdtable() just reads the pointer,
we can simply use rcu_dereference_raw() to avoid the warning.

The patch also changes close_files() to return fdt, this avoids
another rcu_read_lock()/files_fdtable() in put_files_struct().

I think close_files() needs more cleanups:

	- we do not need xchg() exactly because we are the last
	  user of this files_struct

	- "if (file)" should be turned into WARN_ON(!file)

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 03:14:37 -05:00
Oleg Nesterov a8d4b8345e introduce __fcheck_files() to fix rcu_dereference_check_fdtable(), kill rcu_my_thread_group_empty()
rcu_dereference_check_fdtable() looks very wrong,

1. rcu_my_thread_group_empty() was added by 844b9a8707 "vfs: fix
   RCU-lockdep false positive due to /proc" but it doesn't really
   fix the problem. A CLONE_THREAD (without CLONE_FILES) task can
   hit the same race with get_files_struct().

   And otoh rcu_my_thread_group_empty() can suppress the correct
   warning if the caller is the CLONE_FILES (without CLONE_THREAD)
   task.

2. files->count == 1 check is not really right too. Even if this
   files_struct is not shared it is not safe to access it lockless
   unless the caller is the owner.

   Otoh, this check is sub-optimal. files->count == 0 always means
   it is safe to use it lockless even if files != current->files,
   but put_files_struct() has to take rcu_read_lock(). See the next
   patch.

This patch removes the buggy checks and turns fcheck_files() into
__fcheck_files() which uses rcu_dereference_raw(), the "unshared"
callers, fget_light() and fget_raw_light(), can use it to avoid
the warning from RCU-lockdep.

fcheck_files() is trivially reimplemented as rcu_lockdep_assert()
plus __fcheck_files().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 03:14:36 -05:00
Al Viro 2ccdc41319 kill reiserfs_bdevname()
it's never called with NULL argument...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 03:14:06 -05:00