Commit Graph

37562 Commits

Author SHA1 Message Date
Weston Andros Adamson bfd484a560 nfs: use blocking page_group_lock in add_request
__nfs_pageio_add_request was calling nfs_page_group_lock nonblocking, but
this can return -EAGAIN which would end up passing -EIO to the application.

There is no reason not to block in this path, so change the two calls to
do so. Also, there is no need to check the return value of
nfs_page_group_lock when nonblock=false, so remove the error handling code.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-22 18:04:43 -04:00
Weston Andros Adamson bc8a309e88 nfs: fix nonblocking calls to nfs_page_group_lock
nfs_page_group_lock was calling wait_on_bit_lock even when told not to
block. Fix by first trying test_and_set_bit, followed by wait_on_bit_lock
if and only if blocking is allowed.  Return -EAGAIN if nonblocking and the
test_and_set of the bit was already locked.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-22 18:04:42 -04:00
Weston Andros Adamson fd2f3a06d3 nfs: change nfs_page_group_lock argument
Flip the meaning of the second argument from 'wait' to 'nonblock' to
match related functions. Update all five calls to reflect this change.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-22 18:04:42 -04:00
Chao Yu b5b822050c f2fs: use macro for code readability
This patch introduces DEF_NIDS_PER_INODE/GET_ORPHAN_BLOCKS/F2FS_CP_PACKS macro
instead of numbers in code for readability.

change log from v1:
 o fix typo pointed out by Jaegeuk Kim.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-22 13:56:47 -07:00
Jeff Layton e0b760ff71 locks: pass correct "before" pointer to locks_unlink_lock in generic_add_lease
The argument to locks_unlink_lock can't be just any pointer to a
pointer. It must be a pointer to the fl_next field in the previous
lock in the list.

Cc: <stable@vger.kernel.org> # v3.15+
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-08-22 09:58:22 -04:00
Chao Yu 9d1589ef2e f2fs: introduce need_do_checkpoint for readability
This patch introduce need_do_checkpoint() to include numerous judgment condition
for readability.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:57:07 -07:00
Chao Yu c200b1aa6c f2fs: fix incorrect calculation with total/free inode num
Theoretically, our total inodes number is the same as total node number, but
there are three node ids are reserved in f2fs, they are 0, 1 (node nid), and 2
(meta nid), and they should never be used by user, so our total/free inode
number calculated in ->statfs is wrong.

This patch indroduces F2FS_RESERVED_NODE_NUM and then fixes this issue by
recalculating total/free inode number with the macro.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:57:06 -07:00
Jaegeuk Kim 04859dba50 f2fs: remove rename and use rename2
Refer the following patch.

commit 7177a9c4b5
Author: Miklos Szeredi <mszeredi@suse.cz>
Date:   Wed Jul 23 15:15:30 2014 +0200

    fs: call rename2 if exists

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:57:04 -07:00
Jaegeuk Kim ec4e7af4ca f2fs: skip if inline_data was converted already
This patch checks inline_data one more time under the inode page lock whether
its inline_data is converted or not.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:57:03 -07:00
Jaegeuk Kim 202095a7a0 f2fs: remove rewrite_node_page
I think we need to let the dirty node pages remain in the page cache instead
of rewriting them in their places.
So, after done with successful recovery, write_checkpoint will flush all of them
through the normal write path.
Through this, we can avoid potential error cases in terms of block allocation.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:57:02 -07:00
Jaegeuk Kim 764aa3e978 f2fs: avoid double lock in truncate_blocks
The init_inode_metadata calls truncate_blocks when error is occurred.
The callers holds f2fs_lock_op, so we should not call it again in
truncate_blocks.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:57:01 -07:00
Jaegeuk Kim 14f4e69085 f2fs: prevent checkpoint during roll-forward
Any checkpoint should not be done during the core roll-forward procedure.
Especially, it includes error cases too.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:57:00 -07:00
Jaegeuk Kim b3fe0a0da2 f2fs: add WARN_ON in f2fs_bug_on
This patch adds WARN_ON when f2fs_bug_on is disable to see kernel messages.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:56:59 -07:00
Jaegeuk Kim cf779cab14 f2fs: handle EIO not to break fs consistency
There are two rules when EIO is occurred.
1. don't write any checkpoint data to preserve the previous checkpoint
2. don't lose the cached dentry/node/meta pages

So, at first, this patch adds set_page_dirty in f2fs_write_end_io's failure.
Then, writing checkpoint/dentry/node blocks is not allowed.

Note that, for the data pages, we can't just throw away by redirtying them.
Otherwise, kworker can fall into infinite loop to flush them.
(Ref. xfstests/019)

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:55:05 -07:00
Jaegeuk Kim 8501017e50 f2fs: check s_dirty under cp_mutex
It needs to check s_dirty under cp_mutex, since s_dirty is reset under that
mutex.
And previous condition was not correct, since we can omit doing checkpoint
when checkpoint was done followed by all the node pages were written back.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 09:21:02 -07:00
Jaegeuk Kim 5274651927 f2fs: unlock_page when node page is redirtied out
This patch fixes missing unlock_page when a node page is redirtied out.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 09:21:01 -07:00
Jaegeuk Kim 1e968fdfe6 f2fs: introduce f2fs_cp_error for readability
This patch adds f2fs_cp_error for readability.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 09:21:00 -07:00
Jaegeuk Kim ed2e621a95 f2fs: give a chance to mount again when encountering errors
This patch gives another chance to try mount process when we encounter an error.
This makes an effect on the roll-forward recovery failures as well.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 09:21:00 -07:00
Jaegeuk Kim 6f12ac25f0 f2fs: trigger release_dirty_inode in f2fs_put_super
The generic_shutdown_super calls sync_filesystem, evict_inode, and then
f2fs_put_super. In f2fs_evict_inode, we remain some dirty inode information
so we should release them at f2fs_put_super.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 09:20:29 -07:00
Chris Mason f6dc45c7a9 Btrfs: fix filemap_flush call in btrfs_file_release
We should only be flushing on close if the file was flagged as needing
it during truncate.  I broke this with my ordered data vs transaction
commit deadlock fix.

Thanks to Miao Xie for catching this.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
2014-08-21 07:55:31 -07:00
Liu Bo 38c1c2e44b Btrfs: fix crash on endio of reading corrupted block
The crash is

------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:2124!
[...]
Workqueue: btrfs-endio normal_work_helper [btrfs]
RIP: 0010:[<ffffffffa02d6055>]  [<ffffffffa02d6055>] end_bio_extent_readpage+0xb45/0xcd0 [btrfs]

This is in fact a regression.

It is because we forgot to increase @offset properly in reading corrupted block,
so that the @offset remains, and this leads to checksum errors while reading
left blocks queued up in the same bio, and then ends up with hiting the above
BUG_ON.

Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:30 -07:00
Eric Sandeen a3c108950d btrfs: fix leak in qgroup_subtree_accounting() error path
Coverity pointed this out; in the newly added
qgroup_subtree_accounting(), if btrfs_find_all_roots()
returns an error, we leak at least the parents pointer,
and possibly the roots pointer, depending on what failure
occurs.

If btrfs_find_all_roots() returns an error, we need to
free up all allocations before we return.  "roots" is
initialized to NULL, so it should be safe to free
it unconditionally (ulist_free() handles that case).

Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:29 -07:00
Qu Wenruo 51f395ad40 btrfs: Use right extent length when inserting overlap extent map.
When current btrfs finds that a new extent map is going to be insereted
but failed with -EEXIST, it will try again to insert the extent map
but with the length of sectorsize.
This is OK if we don't enable 'no-holes' feature since all extent space
is continuous, we will not go into the not found->insert routine.

But if we enable 'no-holes' feature, it will make things out of control.
e.g. in 4K sectorsize, we pass the following args to btrfs_get_extent():
btrfs_get_extent() args: start:  27874 len 4100
28672		  27874		28672	27874+4100	32768
                    |-----------------------|
|---------hole--------------------|---------data----------|

1) not found and insert
Since no extent map containing the range, btrfs_get_extent() will go
into the not_found and insert routine, which will try to insert the
extent map (27874, 27847 + 4100).

2) first overlap
But it overlaps with (28672, 32768) extent, so -EEXIST will be returned
by add_extent_mapping().

3) retry but still overlap
After catching the -EEXIST, then btrfs_get_extent() will try insert it
again but with 4K length, which still overlaps, so -EEXIST will be
returned.

This makes the following patch fail to punch hole.
d77815461f btrfs: Avoid trucating page or punching hole in a already existed hole.

This patch will use the right length, which is the (exsisting->start -
em->start) to insert, making the above patch works in 'no-holes' mode.
Also, some small code style problems in above patch is fixed too.

Reported-by: Filipe David Manana <fdmanana@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe David Manana <fdmanana@suse.com>
Tested-by: Filipe David Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:27 -07:00
Filipe Manana 62e2390e1a Btrfs: clone, don't create invalid hole extent map
When cloning a file that consists of an inline extent, we were creating
an extent map that represents a non-existing trailing hole starting at a
file offset that isn't a multiple of the sector size. This happened because
when processing an inline extent we weren't aligning the extent's length to
the sector size, and therefore incorrectly treating the range
[inline_extent_length; sector_size[ as a hole.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:26 -07:00
Filipe Manana 7064dd5c36 Btrfs: don't monopolize a core when evicting inode
If an inode has a very large number of extent maps, we can spend
a lot of time freeing them, which triggers a soft lockup warning.
Therefore reschedule if we need to when freeing the extent maps
while evicting the inode.

I could trigger this all the time by running xfstests/generic/299 on
a file system with the no-holes feature enabled. That test creates
an inode with 11386677 extent maps.

    $ mkfs.btrfs -f -O no-holes $TEST_DEV
    $ MKFS_OPTIONS="-O no-holes" ./check generic/299
    generic/299 382s ...
    Message from syslogd@debian-vm3 at Aug  7 10:44:29 ...
     kernel:[85304.208017] BUG: soft lockup - CPU#0 stuck for 22s! [umount:25330]
     384s
    Ran: generic/299
    Passed all 1 tests

    $ dmesg
    (...)
    [86304.300017] BUG: soft lockup - CPU#0 stuck for 23s! [umount:25330]
    (...)
    [86304.300036] Call Trace:
    [86304.300036]  [<ffffffff81698ba9>] __slab_free+0x54/0x295
    [86304.300036]  [<ffffffffa02ee9cc>] ? free_extent_map+0x5c/0xb0 [btrfs]
    [86304.300036]  [<ffffffff811a6cd2>] kmem_cache_free+0x282/0x2a0
    [86304.300036]  [<ffffffffa02ee9cc>] free_extent_map+0x5c/0xb0 [btrfs]
    [86304.300036]  [<ffffffffa02e3775>] btrfs_evict_inode+0xd5/0x660 [btrfs]
    [86304.300036]  [<ffffffff811e7c8d>] ? __inode_wait_for_writeback+0x6d/0xc0
    [86304.300036]  [<ffffffff816a389b>] ? _raw_spin_unlock+0x2b/0x40
    [86304.300036]  [<ffffffff811d8cbb>] evict+0xab/0x180
    [86304.300036]  [<ffffffff811d8dce>] dispose_list+0x3e/0x60
    [86304.300036]  [<ffffffff811d9b04>] evict_inodes+0xf4/0x110
    [86304.300036]  [<ffffffff811bd953>] generic_shutdown_super+0x53/0x110
    [86304.300036]  [<ffffffff811bdaa6>] kill_anon_super+0x16/0x30
    [86304.300036]  [<ffffffffa02a78ba>] btrfs_kill_super+0x1a/0xa0 [btrfs]
    [86304.300036]  [<ffffffff811bd3a9>] deactivate_locked_super+0x59/0x80
    [86304.300036]  [<ffffffff811be44e>] deactivate_super+0x4e/0x70
    [86304.300036]  [<ffffffff811dec14>] mntput_no_expire+0x174/0x1f0
    [86304.300036]  [<ffffffff811deab7>] ? mntput_no_expire+0x17/0x1f0
    [86304.300036]  [<ffffffff811e0517>] SyS_umount+0x97/0x100
    (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:25 -07:00
Filipe Manana 74121f7cbb Btrfs: fix hole detection during file fsync
The file hole detection logic during a file fsync wasn't correct,
because it didn't look back (in a previous leaf) for the last file
extent item that can be in a leaf to the left of our leaf and that
has a generation lower than the current transaction id. This made it
assume that a hole exists when it really doesn't exist in the file.

Such false positive hole detection happens in the following scenario:

* We have a file that has many file extent items, covering 3 or more
  btree leafs (the first leaf must contain non file extent items too).

* Two ranges of the file are modified, with their extent items being
  located at 2 different leafs and those leafs aren't consecutive.

* When processing the second modified leaf, we weren't checking if
  some file extent item exists that is located in some leaf that is
  between our 2 modified leafs, and therefore assumed the range defined
  between the last file extent item in the first leaf and the first file
  extent item in the second leaf matched a hole.

Fortunately this didn't result in overriding the log with wrong data,
instead it made the last loop in copy_items() attempt to insert a
duplicated key (for a hole file extent item), which makes the file
fsync code return with -EEXIST to file.c:btrfs_sync_file() which in
turn ends up doing a full transaction commit, which is much more
expensive then writing only to the log tree and wait for it to be
durably persisted (as well as the file's modified extents/pages).
Therefore fix the hole detection logic, so that we don't pay the
cost of doing full transaction commits.

I could trigger this issue with the following test for xfstests (which
never fails, either without or with this patch). The last fsync call
results in a full transaction commit, due to the -EEXIST error mentioned
above. I could also observe this behaviour happening frequently when
running xfstests/generic/075 in a loop.

Test:

    _cleanup()
    {
        _cleanup_flakey
        rm -fr $tmp
    }

    # get standard environment, filters and checks
    . ./common/rc
    . ./common/filter
    . ./common/dmflakey

    # real QA test starts here
    _supported_fs btrfs
    _supported_os Linux
    _require_scratch
    _require_dm_flakey
    _need_to_be_root

    rm -f $seqres.full

    # Create a file with many file extent items, each representing a 4Kb extent.
    # These items span 3 btree leaves, of 16Kb each (default mkfs.btrfs leaf size
    # as of btrfs-progs 3.12).
    _scratch_mkfs -l 16384 >/dev/null 2>&1
    _init_flakey
    SAVE_MOUNT_OPTIONS="$MOUNT_OPTIONS"
    MOUNT_OPTIONS="$MOUNT_OPTIONS -o commit=999"
    _mount_flakey

    # First fsync, inode has BTRFS_INODE_NEEDS_FULL_SYNC flag set.
    $XFS_IO_PROG -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" \
            $SCRATCH_MNT/foo | _filter_xfs_io

    # For any of the following fsync calls, inode doesn't have the flag
    # BTRFS_INODE_NEEDS_FULL_SYNC set.
    for ((i = 1; i <= 500; i++)); do
        OFFSET=$((4096 * i))
        LEN=4096
        $XFS_IO_PROG -c "pwrite -S 0x01 $OFFSET $LEN" -c "fsync" \
                $SCRATCH_MNT/foo | _filter_xfs_io
    done

    # Commit transaction and bump next transaction's id (to 7).
    sync

    # Truncate will set the BTRFS_INODE_NEEDS_FULL_SYNC flag in the btrfs's
    # inode runtime flags.
    $XFS_IO_PROG -c "truncate 2048000" $SCRATCH_MNT/foo

    # Commit transaction and bump next transaction's id (to 8).
    sync

    # Touch 1 extent item from the first leaf and 1 from the last leaf. The leaf
    # in the middle, containing only file extent items, isn't touched. So the
    # next fsync, when calling btrfs_search_forward(), won't visit that middle
    # leaf. First and 3rd leaf have now a generation with value 8, while the
    # middle leaf remains with a generation with value 6.
    $XFS_IO_PROG \
        -c "pwrite -S 0xee -b 4096 0 4096" \
        -c "pwrite -S 0xff -b 4096 2043904 4096" \
        -c "fsync" \
        $SCRATCH_MNT/foo | _filter_xfs_io

    _load_flakey_table $FLAKEY_DROP_WRITES
    md5sum $SCRATCH_MNT/foo | _filter_scratch
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    # During mount, we'll replay the log created by the fsync above, and the file's
    # md5 digest should be the same we got before the unmount.
    _mount_flakey
    md5sum $SCRATCH_MNT/foo | _filter_scratch
    _unmount_flakey
    MOUNT_OPTIONS="$SAVE_MOUNT_OPTIONS"

    status=0
    exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:24 -07:00
Filipe Manana 5762b5c958 Btrfs: ensure tmpfile inode is always persisted with link count of 0
If we open a file with O_TMPFILE, don't do any further operation on
it (so that the inode item isn't updated) and then force a transaction
commit, we get a persisted inode item with a link count of 1, and not 0
as it should be.

Steps to reproduce it (requires a modern xfs_io with -T support):

    $ mkfs.btrfs -f /dev/sdd
    $ mount -o /dev/sdd /mnt
    $ xfs_io -T /mnt &
    $ sync

Then btrfs-debug-tree shows the inode item with a link count of 1:

    $ btrfs-debug-tree /dev/sdd
    (...)
    fs tree key (FS_TREE ROOT_ITEM 0)
    leaf 29556736 items 4 free space 15851 generation 6 owner 5
    fs uuid f164d01b-1b92-481d-a4e4-435fb0f843d0
    chunk uuid 0e3d0e56-bcca-4a1c-aa5f-cec2c6f4f7a6
    	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
		inode generation 3 transid 6 size 0 block group 0 mode 40755 links 1
    	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
    		inode ref index 0 namelen 2 name: ..
    	item 2 key (257 INODE_ITEM 0) itemoff 15951 itemsize 160
    		inode generation 6 transid 6 size 0 block group 0 mode 100600 links 1
    	item 3 key (ORPHAN ORPHAN_ITEM 257) itemoff 15951 itemsize 0
		orphan item
    checksum tree key (CSUM_TREE ROOT_ITEM 0)
    (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:23 -07:00
Filipe Manana 9c3b306e1c Btrfs: race free update of commit root for ro snapshots
This is a better solution for the problem addressed in the following
commit:

    Btrfs: update commit root on snapshot creation after orphan cleanup
    (3821f34888)

The previous solution wasn't the best because of 2 reasons:

    1) It added another full transaction commit, which is more expensive
       than just swapping the commit root with the root;

    2) If a reboot happened after the first transaction commit (the one
       that creates the snapshot) and before the second transaction commit,
       then we would end up with the same problem if a send using that
       snapshot was requested before the first transaction commit after
       the reboot.

This change addresses those 2 issues. The second issue is addressed by
switching the commit root in the dentry lookup VFS callback, which is
also called by the snapshot/subvol creation ioctl and performs orphan
cleanup if needed. Like the vfs, the ioctl locks the parent inode too,
preventing race issues between a dentry lookup and snapshot creation.

Cc: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:21 -07:00
Liu Bo 87fa3bb078 Btrfs: fix regression of btrfs device replace
Commit 49c6f736f34f901117c20960ebd7d5e60f12fcac(
btrfs: dev replace should replace the sysfs entry) added the missing sysfs entry
in the process of device replace, but didn't take missing devices into account,
so now we have

BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
IP: [<ffffffffa0268551>] btrfs_kobj_rm_device+0x21/0x40 [btrfs]
...

To reproduce it,
1. mkfs.btrfs -f disk1 disk2
2. mkfs.ext4 disk1
3. mount disk2 /mnt -odegraded
4. btrfs replace start -B 1 disk3 /mnt
--------------------------

This fixes the problem.

Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:20 -07:00
Linus Torvalds 372b1dbdd1 Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "Most important fixes in this set include three SMB3 fixes for stable
  (including fix for possible kernel oops), and a workaround to allow
  writes to Mac servers (only cifs dialect, not more current SMB2.1,
  worked to Mac servers).  Also fallocate support added, and lease fix
  from Jeff"

* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
  [SMB3] Enable fallocate -z support for SMB3 mounts
  enable fallocate punch hole ("fallocate -p") for SMB3
  Incorrect error returned on setting file compressed on SMB2
  CIFS: Fix wrong directory attributes after rename
  CIFS: Fix SMB2 readdir error handling
  [CIFS] Possible null ptr deref in SMB2_tcon
  [CIFS] Workaround MacOS server problem with SMB2.1 write  response
  cifs: handle lease F_UNLCK requests properly
  Cleanup sparse file support by creating worker function for it
  Add sparse file support to SMB2/SMB3 mounts
  Add missing definitions for CIFS File System Attributes
  cifs: remove unused function cifs_oplock_break_wait
2014-08-20 18:33:21 -05:00
Chin-Tsung Cheng e6d8fb340f ext3: Count internal journal as bsddf overhead in ext3_statfs
The journal blocks of external journal device should not
be counted as overhead.

Signed-off-by: Chin-Tsung Cheng <chintzung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-08-19 23:16:51 +02:00
Jaegeuk Kim 97c3c5cac2 f2fs: don't skip checkpoint if there is no dirty node pages
This is the errorneous scenario.
1. write data
2. do checkpoint
3. produce some dirty node pages by the gc thread
4. write back dirty node pages
5. f2fs_put_super will skip the checkpoint, since dirty count for node pages is
  zero.

This patch removes such the wrong condition check.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:35 -07:00
Jaegeuk Kim b307384e4f f2fs: avoid bug_on when error is occurred
During the recovery, if an error like EIO or ENOMEM, f2fs_bug_on should skip.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:35 -07:00
Jaegeuk Kim 1c35a90e8a f2fs: fix to recover inline_xattr/data and blocks
This patch fixes not to skip xattr recovery and inline xattr/data recovery
order.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:34 -07:00
Jaegeuk Kim e3b4d43f7c f2fs: should clear the inline_xattr flag
During the recovery, we should clear the inline_xattr flag if its xattr node
block is recovered.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:34 -07:00
Jaegeuk Kim 695facc05a f2fs: clear FI_INC_LINK during the recovery
If an inode are fsynced multiple times with fsync & dent marks, this inode will
set FI_INC_LINK at find_fsync_dnodes during the recovery.
But, in recover_inode, recover_dentry doesn't clear that flag when multiple hits
were occurred.

So this patch removes the flag for the further consistency.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:34 -07:00
Jaegeuk Kim 617deb8c05 f2fs: fix the initial inode page for recovery
If a new inode page is needed for recover_dentry, we should assing i_inline
as zero.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:34 -07:00
Jaegeuk Kim 0342fd301a f2fs: make clear on test condition and return types
This patch adds a parentheses to make clear for condition check.
And also it changes the return type for better meanings.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:33 -07:00
Jaegeuk Kim b067ba1f1b f2fs: should convert inline_data during the mkwrite
If mkwrite is called to an inode having inline_data, it can overwrite the data
index space as NEW_ADDR. (e.g., the first 4 bytes are coincidently zero)

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:33 -07:00
arter97 e1c4204520 f2fs: fix typo
Fix typo and some grammatical errors.

The words "filesystem" and "readahead" are being used without the space treewide.

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:33 -07:00
Jan Kara 410dd3cf4c isofs: Fix unbounded recursion when processing relocated directories
We did not check relocated directory in any way when processing Rock
Ridge 'CL' tag. Thus a corrupted isofs image can possibly have a CL
entry pointing to another CL entry leading to possibly unbounded
recursion in kernel code and thus stack overflow or deadlocks (if there
is a loop created from CL entries).

Fix the problem by not allowing CL entry to point to a directory entry
with CL entry (such use makes no good sense anyway) and by checking
whether CL entry doesn't point to itself.

CC: stable@vger.kernel.org
Reported-by: Chris Evans <cevans@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-08-19 18:29:30 +02:00
Chao Yu 85cd083b49 udf: avoid unneeded up_write when fail to add entry in ->symlink
We have released the ->i_data_sem before invoking udf_add_entry(),
so in following error path, we should not release this lock again.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-08-19 18:29:30 +02:00
Miao Xie 95669976bd Btrfs: don't consider the missing device when allocating new chunks
The original code allocated new chunks by the number of the writable devices
and missing devices to make sure that any RAID levels on a degraded FS continue
to be honored, but it introduced a problem that it stopped us to allocating
new chunks, the steps to reproduce is following:

 # mkfs.btrfs -m raid1 -d raid1 -f <dev0> <dev1>
 # mkfs.btrfs -f <dev1>			//Removing <dev1> from the original fs
 # mount -o degraded <dev0> <mnt>
 # dd if=/dev/null of=<mnt>/tmpfile bs=1M

It is because we allocate new chunks only on the writable devices, if we take
the number of missing devices into account, and want to allocate new chunks
with higher RAID level, we will fail becaue we don't have enough writable
device. Fix it by ignoring the number of missing devices when allocating
new chunks.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:19 -07:00
Miao Xie 7df69d3e94 Btrfs: Fix wrong device size when we are resizing the device
total_bytes of device is just a in-memory variant which is used to record
the size of the device, and it might be changed before we resize a device,
if the resize operation fails, it will be fallbacked. But some code used it
to update on-disk metadata of the device, it would cause the problem that
on-disk metadata of the devices was not consistent. We should use the other
variant named disk_total_bytes to update the on-disk metadata of device,
because that variant is updated only when the resize operation is successful.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:18 -07:00
Miao Xie 5d68da3b8e Btrfs: don't write any data into a readonly device when scrub
We should not write data into a readonly device especially seed device when
doing scrub, skip those devices.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:17 -07:00
Miao Xie ff61d17c63 Btrfs: Fix the problem that the replace destroys the seed filesystem
The seed filesystem was destroyed by the device replace, the reproduce
method is:
 # mkfs.btrfs -f <dev0>
 # btrfstune -S 1 <dev0>
 # mount <dev0> <mnt>
 # btrfs device add <dev1> <mnt>
 # umount <mnt>
 # mount <dev1> <mnt>
 # btrfs replace start -f <dev0> <dev2> <mnt>
 # umount <mnt>
 # mount <dev0> <mnt>

It is because we erase the super block on the seed device. It is wrong,
we should not change anything on the seed device.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:16 -07:00
Qu Wenruo 2c91943b50 btrfs: Return right extent when fiemap gives unaligned offset and len.
When page aligned start and len passed to extent_fiemap(), the result is
good, but when start and len is not aligned, e.g. start = 1 and len =
4095 is passed to extent_fiemap(), it returns no extent.

The problem is that start and len is all rounded down which causes the
problem. This patch will round down start and round up (start + len) to
return right extent.

Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:14 -07:00
Wang Shilong e2eca69dc6 Btrfs: fix wrong extent mapping for DirectIO
btrfs_next_leaf() will use current leaf's last key to search
and then return a bigger one. So it may still return a file extent
item that is smaller than expected value and we will
get an overflow here for @em->len.

This is easy to reproduce for Btrfs Direct writting, it did not
cause any problem, because writting will re-insert right mapping later.

However, by hacking code to make DIO support compression, wrong extent
mapping is kept and it encounter merging failure(EEXIST) quickly.

Fix this problem by looping to find next file extent item that is bigger
than @start or we could not find anything more.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:13 -07:00
Wang Shilong 9a025a0860 Btrfs: fix wrong write range for filemap_fdatawrite_range()
filemap_fdatawrite_range() expect the third arg to be @end
not @len, fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:12 -07:00
Miao Xie 3a7d55c84c Btrfs: fix wrong missing device counter decrease
The missing devices are accounted by its own fs device, for example
the missing devices in seed filesystem will be accounted by the fs device
of the seed filesystem, not by the new filesystem which is based on
the seed filesystem, so when we remove the missing device in the
seed filesystem, we should decrease the counter of its own fs device.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:10 -07:00
Miao Xie 69611ac810 Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs
We forgot to zero some members in fs_devices when we create new fs_devices
from the one of the seed fs. It would cause the problem that we got wrong
chunk profile when allocating chunks. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:32 -07:00
Anand Jain 77bdae4d13 btrfs: check generation as replace duplicates devid+uuid
When FS in unmounted we need to check generation number as well
since devid+uuid combination could match with the missing replaced
disk when it reappears, and without this patch it might pair with
the replaced disk again.

 device_list_add() function is called in the following threads,
	mount device option
	mount argument
	ioctl BTRFS_IOC_SCAN_DEV (btrfs dev scan)
	ioctl BTRFS_IOC_DEVICES_READY (btrfs dev ready <dev>)
 they have been unit tested to work fine with this patch.

 If the user knows what he is doing and really want to pair with
 replaced disk (which is not a standard operation), then he should
 first clear the kernel btrfs device list in the memory by doing
 the module unload/load and followed with the mount -o device option.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:30 -07:00
Anand Jain b96de000bc Btrfs: device_list_add() should not update list when mounted
device_list_add() is called when user runs btrfs dev scan, which would add
any btrfs device into the btrfs_fs_devices list.

Now think of a mounted btrfs. And a new device which contains the a SB
from the mounted btrfs devices.

In this situation when user runs btrfs dev scan, the current code would
just replace existing device with the new device.

Which is to note that old device is neither closed nor gracefully
removed from the btrfs.

The FS is still operational with the old bdev however the device name
is the btrfs_device is new which is provided by the btrfs dev scan.

reproducer:

devmgt[1] detach /dev/sdc

replace the missing disk /dev/sdc

btrfs rep start -f 1 /dev/sde /btrfs
Label: none  uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120
        Total devices 2 FS bytes used 32.00KiB
        devid    1 size 958.94MiB used 115.88MiB path /dev/sde
        devid    2 size 958.94MiB used 103.88MiB path /dev/sdd

make /dev/sdc to reappear

devmgt attach host2

btrfs dev scan

btrfs fi show -m
Label: none  uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120^M
        Total devices 2 FS bytes used 32.00KiB^M
        devid    1 size 958.94MiB used 115.88MiB path /dev/sdc <- Wrong.
        devid    2 size 958.94MiB used 103.88MiB path /dev/sdd

since /dev/sdc has been replaced with /dev/sde, the /dev/sdc shouldn't be
part of the btrfs-fsid when it reappears. If user want it to be part of it
then sys admin should be using btrfs device add instead.

[1] github.com/anajain/devmgt.git

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:28 -07:00
chandan 1707e26d6a Btrfs: fill_holes: Fix slot number passed to hole_mergeable() call.
For a non-existent key, btrfs_search_slot() sets path->slots[0] to the slot
where the key could have been present, which in this case would be the slot
containing the extent item which would be the next neighbor of the file range
being punched. The current code passes an incremented path->slots[0] and we
skip to the wrong file extent item. This would mean that we would fail to
merge the "yet to be created" hole with the next neighboring hole (if one
exists). Fix this.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:26 -07:00
Miao Xie 7a5c3c9be1 Btrfs: fix put dio bio twice when we submit dio bio fail
The caller of btrfs_submit_direct_hook() will put the original dio bio
when btrfs_submit_direct_hook() return a error number, so we needn't
put the original bio in btrfs_submit_direct_hook().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:24 -07:00
Steve French 30175628bf [SMB3] Enable fallocate -z support for SMB3 mounts
fallocate -z (FALLOC_FL_ZERO_RANGE) can map to SMB3
FSCTL_SET_ZERO_DATA SMB3 FSCTL but FALLOC_FL_ZERO_RANGE
when called without the FALLOC_FL_KEEPSIZE flag set could want
the file size changed so we can not support that subcase unless
the file is cached (and thus we know the file size).

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
2014-08-17 18:16:40 -05:00
Steve French 31742c5a33 enable fallocate punch hole ("fallocate -p") for SMB3
Implement FALLOC_FL_PUNCH_HOLE (which does not change the file size
fortunately so this matches the behavior of the equivalent SMB3
fsctl call) for SMB3 mounts.  This allows "fallocate -p" to work.
It requires that the server support setting files as sparse
(which Windows allows).

Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-17 18:12:38 -05:00
Steve French ad3829cf1d Incorrect error returned on setting file compressed on SMB2
When the server (for an SMB2 or SMB3 mount) doesn't support
an ioctl (such as setting the compressed flag
on a file) we were incorrectly returning EIO instead
of EOPNOTSUPP, this is confusing e.g. doing chattr +c to a file
on a non-btrfs Samba partition, now the error returned is more
intuitive to the user.  Also fixes error mapping on setting
hardlink to servers which don't support that.

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: David Disseldorp <ddiss@suse.de>
2014-08-17 18:12:31 -05:00
Pavel Shilovsky b46799a8f2 CIFS: Fix wrong directory attributes after rename
When we requests rename we also need to update attributes
of both source and target parent directories. Not doing it
causes generic/309 xfstest to fail on SMB2 mounts. Fix this
by marking these directories for force revalidating.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-17 05:08:46 -05:00
Pavel Shilovsky 52755808d4 CIFS: Fix SMB2 readdir error handling
SMB2 servers indicates the end of a directory search with
STATUS_NO_MORE_FILE error code that is not processed now.
This causes generic/257 xfstest to fail. Fix this by triggering
the end of search by this error code in SMB2_query_directory.

Also when negotiating CIFS protocol we tell the server to close
the search automatically at the end and there is no need to do
it itself. In the case of SMB2 protocol, we need to close it
explicitly - separate close directory checks for different
protocols.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-17 05:08:39 -05:00
Steve French 18f39e7be0 [CIFS] Possible null ptr deref in SMB2_tcon
As Raphael Geissert pointed out, tcon_error_exit can dereference tcon
and there is one path in which tcon can be null.

Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org> # v3.7+
Reported-by: Raphael Geissert <geissert@debian.org>
2014-08-17 00:41:02 -05:00
Linus Torvalds e64df3ebe8 Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "These are all fixes I'd like to get out to a broader audience.

  The biggest of the bunch is Mark's quota fix, which is also in the
  SUSE kernel, and makes our subvolume quotas dramatically more
  accurate.

  I've been running xfstests with these against your current git
  overnight, but I'm queueing up longer tests as well"

* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: disable strict file flushes for renames and truncates
  Btrfs: fix csum tree corruption, duplicate and outdated checksums
  Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch
  Btrfs: fix compressed write corruption on enospc
  btrfs: correctly handle return from ulist_add
  btrfs: qgroup: account shared subtrees during snapshot delete
  Btrfs: read lock extent buffer while walking backrefs
  Btrfs: __btrfs_mod_ref should always use no_quota
  btrfs: adjust statfs calculations according to raid profiles
2014-08-16 09:06:55 -06:00
Linus Torvalds 53b95d6341 File locking related bugfixes for v3.17 (pile #2)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJT7ey6AAoJEAAOaEEZVoIVgdkQAJfToVhgddVMOweeGo4wqPqM
 lmS35dYVEy+gPfYhCU2Zytgk9yLlmNJeDT7Y+XGe4l15Ax/PDNuWLiRnUKRy1FrU
 l7cbKRAn1L6TqBO2ang5t68Bw7ojUpRoKWKnDyjVAj80mZYRZPWvQOurLeTtra2o
 XLnHbK54F36s07OXwSTZgbh/ffVQ1RMUBU8fy+0Ws3mTAbzO1KuB5Ws4pdt2ysjI
 14pBHO553X0VXAJxGkH66xxblt1o+Q9aBZp7RL0VNtR4bGU4FMvXy+0D5G6h8AGt
 rhl2PWTNKGg8HFgUK+7nCheH4j/0maXo541D3q+sJhbknRhD3n3x7IBvjkm9tdjB
 OgTkAp1mwL21mJP21MOrAil8uwGoSr7qTCngZY6nNWH4L3EkVl27+LlGDtkgBp/n
 BJyhcPbFSh+K4TTD3dg2rEx7wF/npQ6yPOljjvWXKJEm5lx3ZPkK1l1xS3QnVTMe
 pLq+wTZ9v1cU7+9JFWICQwclv4unjIgqxLo7/op5P5KvTWOHFW4cjdwCBPVE1g3a
 WC2c5jdB0Up6Z59aXAN3p5dk8MCy6NA41lkMfJN4AUAIzb6NvNeDBhrcFaHwXowm
 jgCtEPqFN/HqsEwJmhJ7fcIEKYrOCuzeaR5tGuwJ11re/oULGXTgMkrFwgm/YWwu
 esRBAc53/hZg6oo3ipUH
 =09BX
 -----END PGP SIGNATURE-----

Merge tag 'locks-v3.17-2' of git://git.samba.org/jlayton/linux

Pull file locking bugfixes from Jeff Layton:
 "Most of these patches are to fix a long-standing regression that crept
  in when the BKL was removed from the file-locking code.  The code was
  converted to use a conventional spinlock, but some fl_release_private
  ops can block and you can end up sleeping inside the lock.

  There's also a patch to make /proc/locks show delegations as 'DELEG'"

* tag 'locks-v3.17-2' of git://git.samba.org/jlayton/linux:
  locks: update Locking documentation to clarify fl_release_private behavior
  locks: move locks_free_lock calls in do_fcntl_add_lease outside spinlock
  locks: defer freeing locks in locks_delete_lock until after i_lock has been dropped
  locks: don't reuse file_lock in __posix_lock_file
  locks: don't call locks_release_private from locks_copy_lock
  locks: show delegations as "DELEG" in /proc/locks
2014-08-16 08:58:47 -06:00
Linus Torvalds da06df548e Merge git://git.kvack.org/~bcrl/aio-next
Pull aio updates from Ben LaHaise.

* git://git.kvack.org/~bcrl/aio-next:
  aio: use iovec array rather than the single one
  aio: fix some comments
  aio: use the macro rather than the inline magic number
  aio: remove the needless registration of ring file's private_data
  aio: remove no longer needed preempt_disable()
  aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx()
  aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()
2014-08-16 08:56:27 -06:00
Steve French 754789a1c0 [CIFS] Workaround MacOS server problem with SMB2.1 write
response

Writes fail to Mac servers with SMB2.1 mounts (works with cifs though) due
to them sending an incorrect RFC1001 length for the SMB2.1 Write response.
Workaround this problem. MacOS server sends a write response with 3 bytes
of pad beyond the end of the SMB itself.  The RFC1001 length is 3 bytes
more than the sum of the SMB2.1 header length + the write reponse.

Incorporate feedback from Jeff and JRA to allow servers to send
a tcp frame that is even more than three bytes too long
(ie much longer than the SMB2/SMB3 request that it contains) but
we do log it once now. In the earlier version of the patch I had
limited how far off the length field could be before we fail the request.

Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-15 23:49:01 -05:00
Jeff Layton 024408062b cifs: handle lease F_UNLCK requests properly
Currently any F_UNLCK request for a lease just gets back -EAGAIN. Allow
them to go immediately to generic_setlease instead.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-15 23:01:52 -05:00
Steve French d43cc79343 Cleanup sparse file support by creating worker function for it
Simply move code to new function (for clarity). Function sets or clears
the sparse file attribute flag.

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: David Disseldorp <ddiss@samba.org>
2014-08-15 23:01:00 -05:00
Chris Mason 8d875f95da btrfs: disable strict file flushes for renames and truncates
Truncates and renames are often used to replace old versions of a file
with new versions.  Applications often expect this to be an atomic
replacement, even if they haven't done anything to make sure the new
version is fully on disk.

Btrfs has strict flushing in place to make sure that renaming over an
old file with a new file will fully flush out the new file before
allowing the transaction commit with the rename to complete.

This ordering means the commit code needs to be able to lock file pages,
and there are a few paths in the filesystem where we will try to end a
transaction with the page lock held.  It's rare, but these things can
deadlock.

This patch removes the ordered flushes and switches to a best effort
filemap_flush like ext4 uses. It's not perfect, but it should fix the
deadlocks.

Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:42 -07:00
Filipe Manana 27b9a8122f Btrfs: fix csum tree corruption, duplicate and outdated checksums
Under rare circumstances we can end up leaving 2 versions of a checksum
for the same file extent range.

The reason for this is that after calling btrfs_next_leaf we process
slot 0 of the leaf it returns, instead of processing the slot set in
path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
btrfs_next_leaf() releases the path and before it searches for the next
leaf, another task might cause a split of the next leaf, which migrates
some of its keys to the leaf we were processing before calling
btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
same leaf but with path->slots[0] having a slot number corresponding
to the first new key it got, that is, a slot number that didn't exist
before calling btrfs_next_leaf(), as the leaf now has more keys than
it had before. So we must really process the returned leaf starting at
path->slots[0] always, as it isn't always 0, and the key at slot 0 can
have an offset much lower than our search offset/bytenr.

For example, consider the following scenario, where we have:

sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472

  Leaf N:

    slot = 0                           slot = btrfs_header_nritems() - 1
  |-------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
  |-------------------------------------------------------------------|

  Leaf N + 1:

      slot = 0                          slot = btrfs_header_nritems() - 1
  |--------------------------------------------------------------------|
  | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
  |--------------------------------------------------------------------|

Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
find the next highest key, which releases the current path and then searches
for that next key. However after releasing the path and before finding that
next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
btrfs_next_leaf() will returns us a path again with leaf N but with the slot
pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
is then:

    slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
  |----------------------------------------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
  |----------------------------------------------------------------------------------------------------|

And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
into the "insert:" label, which will set tmp to:

    tmp = min((sums->len - total_bytes) >> blocksize_bits,
        (next_offset - file_key.offset) >> blocksize_bits) =
    min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
    min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4

and

   ins_size = csum_size * tmp = 4 * 4 = 16 bytes.

In other words, we insert a new csum item in the tree with key
(CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
because the item with key (CSUM CSUM 40161280) (the one that was moved from
leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
bytes of our data and won't get those old checksums removed.

So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
and breaks the logical rule:

   Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover

An obvious bad effect of this is that a subsequent csum tree lookup to get
the checksum of any of the blocks with logical offset of 40161280, 40165376
or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:40 -07:00
Takashi Iwai 4eb1f66dce Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch
We've got bug reports that btrfs crashes when quota is enabled on
32bit kernel, typically with the Oops like below:
 BUG: unable to handle kernel NULL pointer dereference at 00000004
 IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs]
 *pde = 00000000
 Oops: 0000 [#1] SMP
 CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S      W 3.15.2-1.gd43d97e-default #1
 Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
 task: f1478130 ti: f147c000 task.ti: f147c000
 EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0
 EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
 EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000
 ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38
  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
 CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690
 Stack:
  00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
  00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
  00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
 Call Trace:
  [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
  [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
  [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs]
  [<c025e38b>] process_one_work+0x11b/0x390
  [<c025eea1>] worker_thread+0x101/0x340
  [<c026432b>] kthread+0x9b/0xb0
  [<c0712a71>] ret_from_kernel_thread+0x21/0x30
  [<c0264290>] kthread_create_on_node+0x110/0x110

This indicates a NULL corruption in prefs_delayed list.  The further
investigation and bisection pointed that the call of ulist_add_merge()
results in the corruption.

ulist_add_merge() takes u64 as aux and writes a 64bit value into
old_aux.  The callers of this function in backref.c, however, pass a
pointer of a pointer to old_aux.  That is, the function overwrites
64bit value on 32bit pointer.  This caused a NULL in the adjacent
variable, in this case, prefs_delayed.

Here is a quick attempt to band-aid over this: a new function,
ulist_add_merge_ptr() is introduced to pass/store properly a pointer
value instead of u64.  There are still ugly void ** cast remaining
in the callers because void ** cannot be taken implicitly.  But, it's
safer than explicit cast to u64, anyway.

Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046
Cc: <stable@vger.kernel.org> [v3.11+]
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:19 -07:00
Liu Bo ce62003f69 Btrfs: fix compressed write corruption on enospc
When failing to allocate space for the whole compressed extent, we'll
fallback to uncompressed IO, but we've forgotten to redirty the pages
which belong to this compressed extent, and these 'clean' pages will
simply skip 'submit' part and go to endio directly, at last we got data
corruption as we write nothing.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Tested-By: Martin Steigerwald <martin@lichtvoll.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:18 -07:00
Mark Fasheh f90e579c2b btrfs: correctly handle return from ulist_add
ulist_add() can return '1' on sucess, which qgroup_subtree_accounting()
doesn't take into account. As a result, that value can be bubbled up to
callers, causing an error to be printed. Fix this by only returning the
value of ulist_add() when it indicates an error.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:16 -07:00
Mark Fasheh 1152651a08 btrfs: qgroup: account shared subtrees during snapshot delete
During its tree walk, btrfs_drop_snapshot() will skip any shared
subtrees it encounters. This is incorrect when we have qgroups
turned on as those subtrees need to have their contents
accounted. In particular, the case we're concerned with is when
removing our snapshot root leaves the subtree with only one root
reference.

In those cases we need to find the last remaining root and add
each extent in the subtree to the corresponding qgroup exclusive
counts.

This patch implements the shared subtree walk and a new qgroup
operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
this type is encountered during qgroup accounting, we search for
any root references to that extent and in the case that we find
only one reference left, we go ahead and do the math on it's
exclusive counts.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:14 -07:00
Filipe Manana 6f7ff6d783 Btrfs: read lock extent buffer while walking backrefs
Before processing the extent buffer, acquire a read lock on it, so
that we're safe against concurrent updates on the extent buffer.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:13 -07:00
Josef Bacik e339a6b097 Btrfs: __btrfs_mod_ref should always use no_quota
Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
understand how snapshot delete was using it and assumed that we needed the
quota operations there.  With Mark's work this has turned out to be not the
case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
argument and make __btrfs_mod_ref call it's process function with no_quota set
always.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:11 -07:00
David Sterba ba7b6e62f4 btrfs: adjust statfs calculations according to raid profiles
This has been discussed in thread:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/32528

and this patch implements this proposal:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/32536

Works fine for "clean" raid profiles where the raid factor correction
does the right job. Otherwise it's pessimistic and may show low space
although there's still some left.

The df nubmers are lightly wrong in case of mixed block groups, but this
is not a major usecase and can be addressed later.

The RAID56 numbers are wrong almost the same way as before and will be
addressed separately.

CC: Hugo Mills <hugo@carfax.org.uk>
CC: cwillu <cwillu@cwillu.com>
CC: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:10 -07:00
Jeff Layton 2dfb928f7e locks: move locks_free_lock calls in do_fcntl_add_lease outside spinlock
There's no need to call locks_free_lock here while still holding the
i_lock. Defer that until the lock has been dropped.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-08-14 10:07:47 -04:00
Jeff Layton ed9814d858 locks: defer freeing locks in locks_delete_lock until after i_lock has been dropped
In commit 72f98e7255 (locks: turn lock_flocks into a spinlock), we
moved from using the BKL to a global spinlock. With this change, we lost
the ability to block in the fl_release_private operation.

This is problematic for NFS (and probably some other filesystems as
well). Add a new list_head argument to locks_delete_lock. If that
argument is non-NULL, then queue any locks that we want to free to the
list instead of freeing them.

Then, add a new locks_dispose_list function that will walk such a list
and call locks_free_lock on them after the i_lock has been dropped.

Finally, change all of the callers of locks_delete_lock to pass in a
list_head, except for lease_modify. That function can be called long
after the i_lock has been acquired. Deferring the freeing of a lease
after unlocking it in that function is non-trivial until we overhaul
some of the spinlocking in the lease code.

Currently though, no filesystem that sets fl_release_private supports
leases, so this is not currently a problem. We'll eventually want to
make the same change in the lease code, but it needs a lot more work
before we can reasonably do so.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-08-14 10:07:47 -04:00
Jeff Layton b84d49f944 locks: don't reuse file_lock in __posix_lock_file
Currently in the case where a new file lock completely replaces the old
one, we end up overwriting the existing lock with the new info. This
means that we have to call fl_release_private inside i_lock. Change the
code to instead copy the info to new_fl, insert that lock into the
correct spot and then delete the old lock. In a later patch, we'll defer
the freeing of the old lock until after the i_lock has been dropped.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-08-14 10:07:47 -04:00
Linus Torvalds 06b8ab5528 NFS client updates for Linux 3.17
Highlights include:
 
 - Stable fix for a bug in nfs3_list_one_acl()
 - Speed up NFS path walks by supporting LOOKUP_RCU
 - More read/write code cleanups
 - pNFS fixes for layout return on close
 - Fixes for the RCU handling in the rpcsec_gss code
 - More NFS/RDMA fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJT65zoAAoJEGcL54qWCgDyvq8QAJ+OKuC5dpngrZ13i4ZJIcK1
 TJSkWCr44FhYPlrmkLCntsGX6C0376oFEtJ5uqloqK0+/QtvwRNVSQMKaJopKIVY
 mR4En0WwpigxVQdW2lgto6bfOhzMVO+llVdmicEVrU8eeSThATxGNv7rxRzWorvL
 RX3TwBkWSc0kLtPi66VRFQ1z+gg5I0kngyyhsKnLOaHHtpTYP2JDZlRPRkokXPUg
 nmNedmC3JrFFkarroFIfYr54Qit2GW/eI2zVhOwHGCb45j4b2wntZ6wr7LpUdv3A
 OGDBzw59cTpcx3Hij9CFvLYVV9IJJHBNd2MJqdQRtgWFfs+aTkZdk4uilUJCIzZh
 f4BujQAlm/4X1HbPxsSvkCRKga7mesGM7e0sBDPHC1vu0mSaY1cakcj2kQLTpbQ7
 gqa1cR3pZ+4shCq37cLwWU0w1yElYe1c4otjSCttPCrAjXbXJZSFzYnHm8DwKROR
 t+yEDRL5BIXPu1nEtSnD2+xTQ3vUIYXooZWEmqLKgRtBTtPmgSn9Vd8P1OQXmMNo
 VJyFXyjNx5WH06Wbc/jLzQ1/cyhuPmJWWyWMJlVROyv+FXk9DJUFBZuTkpMrIPcF
 NlBXLV1GnA7PzMD9Xt9bwqteERZl6fOUDJLWS9P74kTk5c2kD+m+GaqC/rBTKKXc
 ivr2s7aIDV48jhnwBSVL
 =KE07
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

   - stable fix for a bug in nfs3_list_one_acl()
   - speed up NFS path walks by supporting LOOKUP_RCU
   - more read/write code cleanups
   - pNFS fixes for layout return on close
   - fixes for the RCU handling in the rpcsec_gss code
   - more NFS/RDMA fixes"

* tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits)
  nfs: reject changes to resvport and sharecache during remount
  NFS: Avoid infinite loop when RELEASE_LOCKOWNER getting expired error
  SUNRPC: remove all refcounting of groupinfo from rpcauth_lookupcred
  NFS: fix two problems in lookup_revalidate in RCU-walk
  NFS: allow lockless access to access_cache
  NFS: teach nfs_lookup_verify_inode to handle LOOKUP_RCU
  NFS: teach nfs_neg_need_reval to understand LOOKUP_RCU
  NFS: support RCU_WALK in nfs_permission()
  sunrpc/auth: allow lockless (rcu) lookup of credential cache.
  NFS: prepare for RCU-walk support but pushing tests later in code.
  NFS: nfs4_lookup_revalidate: only evaluate parent if it will be used.
  NFS: add checks for returned value of try_module_get()
  nfs: clear_request_commit while holding i_lock
  pnfs: add pnfs_put_lseg_async
  pnfs: find swapped pages on pnfs commit lists too
  nfs: fix comment and add warn_on for PG_INODE_REF
  nfs: check wait_on_bit_lock err in page_group_lock
  sunrpc: remove "ec" argument from encrypt_v2 operation
  sunrpc: clean up sparse endianness warnings in gss_krb5_wrap.c
  sunrpc: clean up sparse endianness warnings in gss_krb5_seal.c
  ...
2014-08-13 18:13:19 -06:00
Linus Torvalds dc1cc85133 xfs: update for 3.17-rc1
This update contains:
 o conversion of the XFS core to pass negative error numbers
 o restructing of core XFS code that is shared with userspace to fs/xfs/libxfs
 o introduction of sysfs interface for XFS
 o bulkstat refactoring
 o demand driven speculative preallocation removal
 o XFS now always requires 64 bit sectors to be configured
 o metadata verifier changes to ensure CRCs are calculated during log recovery
 o various minor code cleanups
 o miscellaneous bug fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJT6dJvAAoJEK3oKUf0dfodYOYP/jhWGv29OrX958WC+U1kCItb
 WmNV6dSENWgtgYRgdtcxa5ZE0HeEndvN63LOCzAB9VegZK7XjpLob9J2f07gMvP1
 u4vv9LvgLdky/tYxttBYHuyen0hQO0r+esRg/xrawpklKQ3Efofi4MUSbb5ZBDzP
 QvVeNhIDI04A0CoDngWDkV1PpbdwDUjVZFZon36XVDTcSCf/h2B+nekjOVJQpEDC
 qJ9nZWRgAcm4IzZZBqwGt0LJ62Z+Ww8jksevWEnjcXm1xfsltL8gf8o6WOX5iXDR
 PSeoJwVRAxLfiQxnQSENEOeLvKWVXNyC/B10w1H20qZkvQJMsX/Fq0MraBL0Ediu
 0qyZgJsOyoOAvJYXWfUyykUe03dF5/cgrB+RpOLmp4B9lCugCEsQHtjRXEHE1EK2
 +4sHc13I7+SLIAlgySmJdrLe8Kq1vamo0/WD2wH+pdOvNmQ7HJzBUFEj1D83b55a
 DWfBjbz/N5lJW82Vfek2coURJGi/1B87koAtODXWfeM+nbBs47z0dv31G235Boq3
 +Qy2qQdCv05xV+CZn/RAmaudAFFnJEV8L4ADpwqSGfVM72ug0KQF8M5DblG1Q8sb
 xRAQq+krG7K4ui1kPNKcxoHoUIYQUJm0p1aU4V2OAQlikYCCaSgRYGF2xzSVC7j9
 uZd4e4bllsBdPF/Se+8W
 =EUjq
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-3.17-rc1' of git://oss.sgi.com/xfs/xfs

Pull xfs update from Dave Chinner:
 "This update contains:
   - conversion of the XFS core to pass negative error numbers
   - restructing of core XFS code that is shared with userspace to
     fs/xfs/libxfs
   - introduction of sysfs interface for XFS
   - bulkstat refactoring
   - demand driven speculative preallocation removal
   - XFS now always requires 64 bit sectors to be configured
   - metadata verifier changes to ensure CRCs are calculated during log
     recovery
   - various minor code cleanups
   - miscellaneous bug fixes

  The diffstat is kind of noisy because of the restructuring of the code
  to make kernel/userspace code sharing simpler, along with the XFS wide
  change to use the standard negative error return convention (at last!)"

* tag 'xfs-for-linus-3.17-rc1' of git://oss.sgi.com/xfs/xfs: (45 commits)
  xfs: fix coccinelle warnings
  xfs: flush both inodes in xfs_swap_extents
  xfs: fix swapext ilock deadlock
  xfs: kill xfs_vnode.h
  xfs: kill VN_MAPPED
  xfs: kill VN_CACHED
  xfs: kill VN_DIRTY()
  xfs: dquot recovery needs verifiers
  xfs: quotacheck leaves dquot buffers without verifiers
  xfs: ensure verifiers are attached to recovered buffers
  xfs: catch buffers written without verifiers attached
  xfs: avoid false quotacheck after unclean shutdown
  xfs: fix rounding error of fiemap length parameter
  xfs: introduce xfs_bulkstat_ag_ichunk
  xfs: require 64-bit sector_t
  xfs: fix uflags detection at xfs_fs_rm_xquota
  xfs: remove XFS_IS_OQUOTA_ON macros
  xfs: tidy up xfs_set_inode32
  xfs: allow inode allocations in post-growfs disk space
  xfs: mark xfs_qm_quotacheck as static
  ...
2014-08-13 17:49:53 -06:00
Linus Torvalds cec997093b Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota, reiserfs, UDF updates from Jan Kara:
 "Scalability improvements for quota, a few reiserfs fixes, and couple
  of misc cleanups (udf, ext2)"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  reiserfs: Fix use after free in journal teardown
  reiserfs: fix corruption introduced by balance_leaf refactor
  udf: avoid redundant memcpy when writing data in ICB
  fs/udf: re-use hex_asc_upper_{hi,lo} macros
  fs/quota: kernel-doc warning fixes
  udf: use linux/uaccess.h
  fs/ext2/super.c: Drop memory allocation cast
  quota: remove dqptr_sem
  quota: simplify remove_inode_dquot_ref()
  quota: avoid unnecessary dqget()/dqput() calls
  quota: protect Q_GETFMT by dqonoff_mutex
2014-08-13 17:45:40 -06:00
Linus Torvalds 8d2d441ac4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "There is a lot of refactoring and hardening of the libceph and rbd
  code here from Ilya that fix various smaller bugs, and a few more
  important fixes with clone overlap.  The main fix is a critical change
  to the request_fn handling to not sleep that was exposed by the recent
  mutex changes (which will also go to the 3.16 stable series).

  Yan Zheng has several fixes in here for CephFS fixing ACL handling,
  time stamps, and request resends when the MDS restarts.

  Finally, there are a few cleanups from Himangi Saraogi based on
  Coccinelle"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits)
  libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
  rbd: remove extra newlines from rbd_warn() messages
  rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC
  rbd: rework rbd_request_fn()
  ceph: fix kick_requests()
  ceph: fix append mode write
  ceph: fix sizeof(struct tYpO *) typo
  ceph: remove redundant memset(0)
  rbd: take snap_id into account when reading in parent info
  rbd: do not read in parent info before snap context
  rbd: update mapping size only on refresh
  rbd: harden rbd_dev_refresh() and callers a bit
  rbd: split rbd_dev_spec_update() into two functions
  rbd: remove unnecessary asserts in rbd_dev_image_probe()
  rbd: introduce rbd_dev_header_info()
  rbd: show the entire chain of parent images
  ceph: replace comma with a semicolon
  rbd: use rbd_segment_name_free() instead of kfree()
  ceph: check zero length in ceph_sync_read()
  ceph: reset r_resend_mds after receiving -ESTALE
  ...
2014-08-13 17:43:29 -06:00
Linus Torvalds 89838b80bb No significant changes, mostly small fixes here and there. The more important
fixes are:
 
 * UBI deleted list items while iterating the list with 'list_for_each_entry'
 * The UBI block driver did not work properly with very large UBI volumes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJT6IALAAoJECmIfjd9wqK0yRwP/R4XpNyWZOrK72fvb3M6ucK0
 SK3pu6FNfvOSYibmhOtZWjFbiMkgUd7o8jy+4uJa6PIMRQbKGrH/xvujPzON6GQT
 ZZxeJ7n1O8IAZL/dibqFpEv3NQFjgz1IaLnpktQkkp6rWsXCBYNbvFIywWYgkndt
 1w11QG/TAhkJcA/0Xy/+zmkHeXjOQrVeVMbBN2MMVekWg8JgMNTd9mkWGa1A1oGI
 nuq30nIkUP7LX9T/olFGhjFIeeb/bkmWgpzS4K9PI+hOGGCZvIWj/pu8Jj8DAR54
 UGA7qtMkTjxldvHPnuACsSKNS/N7UGY64kTwcucEcJaN2MXmaBFP8ipI/wtb9bCV
 fUoBt+p470E7iDoQymbM5zNw/HPNh4XWVd2X1oYVftmO70PZ6sWeOKWC+tJn/Aj8
 xsrgd1PcWmuyu399EFW/Il5ItOqfYsNIkVFNzIb1O6Pd8Ylt5eYtuyoPXk5aRA4p
 xZ3ohO3OihvWXwwtCUjforwPy37iaX8vwnuI9xdgnEisTHJWR6PYITvcrwqAIH44
 6PEqINU4zwGTGxOTOWKZxdtPVIuChuwqDyY+7+xQD+LIPQ1/BodozExGiQ0lhAmA
 DDC2Te8q5lpZTz03E8ot9zZHmjQoG+uo10DhEzT5U1ycv9p7dISGfbVucyuKDTw9
 JdUnI8cZ2k6mZi8q6dfA
 =zgqa
 -----END PGP SIGNATURE-----

Merge tag 'upstream-3.17-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS changes from Artem Bityutskiy:
 "No significant changes, mostly small fixes here and there.  The more
  important fixes are:

   - UBI deleted list items while iterating the list with
     'list_for_each_entry'
   - The UBI block driver did not work properly with very large UBI
     volumes"

* tag 'upstream-3.17-rc1' of git://git.infradead.org/linux-ubifs: (21 commits)
  UBIFS: Add log overlap assertions
  Revert "UBIFS: add a log overlap assertion"
  UBI: bugfix in ubi_wl_flush()
  UBI: block: Avoid disk size integer overflow
  UBI: block: Set disk_capacity out of the mutex
  UBI: block: Make ubiblock_resize return something
  UBIFS: add a log overlap assertion
  UBIFS: remove unnecessary check
  UBIFS: remove mst_mutex
  UBIFS: kernel-doc warning fix
  UBI: init_volumes: Ignore volumes with no LEBs
  UBIFS: replace seq_printf by seq_puts
  UBIFS: replace count*size kzalloc by kcalloc
  UBIFS: kernel-doc warning fix
  UBIFS: fix error path in create_default_filesystem()
  UBIFS: fix spelling of "scanned"
  UBIFS: fix some comments
  UBIFS: remove useless @ecc in struct ubifs_scan_leb
  UBIFS: remove useless statements
  UBIFS: Add missing break statements in dbg_chk_pnode()
  ...
2014-08-13 17:42:11 -06:00
Steve French 3d1a3745d8 Add sparse file support to SMB2/SMB3 mounts
Many Linux filesystes make a file "sparse" when extending
a file with ftruncate. This does work for CIFS to Samba
(only) but not for SMB2/SMB3 (to Samba or Windows) since
there is a "set sparse" fsctl which is supposed to be
sent to mark a file as sparse.

This patch marks a file as sparse by sending this simple
set sparse fsctl if it is extended more than 2 pages.
It has been tested to Windows 8.1, Samba and various
SMB2/SMB3 servers which do support setting sparse (and
MacOS which does not appear to support the fsctl yet).
If a server share does not support setting a file
as sparse, then we do not retry setting sparse on that
share.

The disk space savings for sparse files can be quite
large (even more significant on Windows servers than Samba).

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
2014-08-13 13:18:35 -05:00
Steve French 8ae31240cc Add missing definitions for CIFS File System Attributes
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Shirish Pargaonkar <spargaonkar@suse.com>
2014-08-12 23:47:14 -05:00
Jan Kara 01777836c8 reiserfs: Fix use after free in journal teardown
If do_journal_release() races with do_journal_end() which requeues
delayed works for transaction flushing, we can leave work items for
flushing outstanding transactions queued while freeing them. That
results in use after free and possible crash in run_timers_softirq().

Fix the problem by not requeueing works if superblock is being shut down
(MS_ACTIVE not set) and using cancel_delayed_work_sync() in
do_journal_release().

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2014-08-12 12:46:30 +02:00
Linus Torvalds f6f993328b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Stuff in here:

   - acct.c fixes and general rework of mnt_pin mechanism.  That allows
     to go for delayed-mntput stuff, which will permit mntput() on deep
     stack without worrying about stack overflows - fs shutdown will
     happen on shallow stack.  IOW, we can do Eric's umount-on-rmdir
     series without introducing tons of stack overflows on new mntput()
     call chains it introduces.
   - Bruce's d_splice_alias() patches
   - more Miklos' rename() stuff.
   - a couple of regression fixes (stable fodder, in the end of branch)
     and a fix for API idiocy in iov_iter.c.

  There definitely will be another pile, maybe even two.  I'd like to
  get Eric's series in this time, but even if we miss it, it'll go right
  in the beginning of for-next in the next cycle - the tricky part of
  prereqs is in this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  fix copy_tree() regression
  __generic_file_write_iter(): fix handling of sync error after DIO
  switch iov_iter_get_pages() to passing maximal number of pages
  fs: mark __d_obtain_alias static
  dcache: d_splice_alias should detect loops
  exportfs: update Exporting documentation
  dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
  dcache: remove unused d_find_alias parameter
  dcache: d_obtain_alias callers don't all want DISCONNECTED
  dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
  dcache: d_splice_alias mustn't create directory aliases
  dcache: close d_move race in d_splice_alias
  dcache: move d_splice_alias
  namei: trivial fix to vfs_rename_dir comment
  VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
  cifs: support RENAME_NOREPLACE
  hostfs: support rename flags
  shmem: support RENAME_EXCHANGE
  shmem: support RENAME_NOREPLACE
  btrfs: add RENAME_NOREPLACE
  ...
2014-08-11 11:44:11 -07:00
Jeff Layton 566709bd62 locks: don't call locks_release_private from locks_copy_lock
All callers of locks_copy_lock pass in a brand new file_lock struct, so
there's no need to call locks_release_private on it. Replace that with
a warning that fires in the event that we receive a target lock that
doesn't look like it's properly initialized.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-08-11 14:24:22 -04:00
Jeff Layton 8144f1f699 locks: show delegations as "DELEG" in /proc/locks
Now that they are a distinct lease type, show them as such.

Cc: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-08-11 13:36:54 -04:00
Al Viro 12a5b5294c fix copy_tree() regression
Since 3.14 we had copy_tree() get the shadowing wrong - if we had one
vfsmount shadowing another (i.e. if A is a slave of B, C is mounted
on A/foo, then D got mounted on B/foo creating D' on A/foo shadowed
by C), copy_tree() of A would make a copy of D' shadow the the copy of
C, not the other way around.

It's easy to fix, fortunately - just make sure that mount follows
the one that shadows it in mnt_child as well as in mnt_hash, and when
copy_tree() decides to attach a new mount, check if the last child
it has added to the same parent should be shadowing the new one.
And if it should, just use the same logics commit_tree() has - put the
new mount into the hash and children lists right after the one that
should shadow it.

Cc: stable@vger.kernel.org [3.14 and later]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-11 12:28:10 -04:00
Vincent Stehlé e91259f3c7 cifs: remove unused function cifs_oplock_break_wait
Commit 743162013d ("sched: Remove proliferation of wait_on_bit() action
functions") has removed the call to cifs_oplock_break_wait, making this
function unused; remove it.

This fixes the following compilation warning:

  fs/cifs/misc.c:578:1: warning: ‘cifs_oplock_break_wait’ defined but not used [-Wunused-function]

Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-11 01:31:03 -05:00
Linus Torvalds 155134fef2 Revert "proc: Point /proc/{mounts,net} at /proc/thread-self/{mounts,net} instead of /proc/self/{mounts,net}"
This reverts commits 344470cac4 and e813244072.

It turns out that the exact path in the symlink matters, if for somewhat
unfortunate reasons: some apparmor configurations don't allow dhclient
access to the per-thread /proc files.  As reported by Jörg Otte:

  audit: type=1400 audit(1407684227.003:28): apparmor="DENIED"
    operation="open" profile="/sbin/dhclient"
    name="/proc/1540/task/1540/net/dev" pid=1540 comm="dhclient"
    requested_mask="r" denied_mask="r" fsuid=0 ouid=0

so we had better revert this for now.  We might be able to work around
this in practice by only using the per-thread symlinks if the thread
isn't the thread group leader, and if the namespaces differ between
threads (which basically never happens).

We'll see. In the meantime, the revert was made to be intentionally easy.

Reported-by: Jörg Otte <jrg.otte@gmail.com>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-10 21:24:59 -07:00
Linus Torvalds 77e40aae76 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "This is a bunch of small changes built against 3.16-rc6.  The most
  significant change for users is the first patch which makes setns
  drmatically faster by removing unneded rcu handling.

  The next chunk of changes are so that "mount -o remount,.." will not
  allow the user namespace root to drop flags on a mount set by the
  system wide root.  Aks this forces read-only mounts to stay read-only,
  no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec
  mounts to stay no exec and it prevents unprivileged users from messing
  with a mounts atime settings.  I have included my test case as the
  last patch in this series so people performing backports can verify
  this change works correctly.

  The next change fixes a bug in NFS that was discovered while auditing
  nsproxy users for the first optimization.  Today you can oops the
  kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever
  with pid namespaces.  I rebased and fixed the build of the
  !CONFIG_NFS_FS case yesterday when a build bot caught my typo.  Given
  that no one to my knowledge bases anything on my tree fixing the typo
  in place seems more responsible that requiring a typo-fix to be
  backported as well.

  The last change is a small semantic cleanup introducing
  /proc/thread-self and pointing /proc/mounts and /proc/net at it.  This
  prevents several kinds of problemantic corner cases.  It is a
  user-visible change so it has a minute chance of causing regressions
  so the change to /proc/mounts and /proc/net are individual one line
  commits that can be trivially reverted.  Unfortunately I lost and
  could not find the email of the original reporter so he is not
  credited.  From at least one perspective this change to /proc/net is a
  refgression fix to allow pthread /proc/net uses that were broken by
  the introduction of the network namespace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
  proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
  proc: Implement /proc/thread-self to point at the directory of the current thread
  proc: Have net show up under /proc/<tgid>/task/<tid>
  NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
  mnt: Add tests for unprivileged remount cases that have found to be faulty
  mnt: Change the default remount atime from relatime to the existing value
  mnt: Correct permission checks in do_remount
  mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
  mnt: Only change user settable mount flags in remount
  namespaces: Use task_lock and not rcu to protect nsproxy
2014-08-09 17:10:41 -07:00
Linus Torvalds 0d10c2c170 Merge branch 'for-3.17' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 "This includes a major rewrite of the NFSv4 state code, which has
  always depended on a single mutex.  As an example, open creates are no
  longer serialized, fixing a performance regression on NFSv3->NFSv4
  upgrades.  Thanks to Jeff, Trond, and Benny, and to Christoph for
  review.

  Also some RDMA fixes from Chuck Lever and Steve Wise, and
  miscellaneous fixes from Kinglong Mee and others"

* 'for-3.17' of git://linux-nfs.org/~bfields/linux: (167 commits)
  svcrdma: remove rdma_create_qp() failure recovery logic
  nfsd: add some comments to the nfsd4 object definitions
  nfsd: remove the client_mutex and the nfs4_lock/unlock_state wrappers
  nfsd: remove nfs4_lock_state: nfs4_state_shutdown_net
  nfsd: remove nfs4_lock_state: nfs4_laundromat
  nfsd: Remove nfs4_lock_state(): reclaim_complete()
  nfsd: Remove nfs4_lock_state(): setclientid, setclientid_confirm, renew
  nfsd: Remove nfs4_lock_state(): exchange_id, create/destroy_session()
  nfsd: Remove nfs4_lock_state(): nfsd4_open and nfsd4_open_confirm
  nfsd: Remove nfs4_lock_state(): nfsd4_delegreturn()
  nfsd: Remove nfs4_lock_state(): nfsd4_open_downgrade + nfsd4_close
  nfsd: Remove nfs4_lock_state(): nfsd4_lock/locku/lockt()
  nfsd: Remove nfs4_lock_state(): nfsd4_release_lockowner
  nfsd: Remove nfs4_lock_state(): nfsd4_test_stateid/nfsd4_free_stateid
  nfsd: Remove nfs4_lock_state(): nfs4_preprocess_stateid_op()
  nfsd: remove old fault injection infrastructure
  nfsd: add more granular locking to *_delegations fault injectors
  nfsd: add more granular locking to forget_openowners fault injector
  nfsd: add more granular locking to forget_locks fault injector
  nfsd: add a list_head arg to nfsd_foreach_client_lock
  ...
2014-08-09 14:31:18 -07:00
Linus Torvalds 023f78b02c Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS updates from Steve French:
 "The most visible change in this set is the additional of multi-credit
  support for SMB2/SMB3 which dramatically improves the large file i/o
  performance for these dialects and significantly increases the maximum
  i/o size used on the wire for SMB2/SMB3.

  Also reconnection behavior after network failure is improved"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (35 commits)
  Add worker function to set allocation size
  [CIFS] Fix incorrect hex vs. decimal in some debug print statements
  update CIFS TODO list
  Add Pavel to contributor list in cifs AUTHORS file
  Update cifs version
  CIFS: Fix STATUS_CANNOT_DELETE error mapping for SMB2
  CIFS: Optimize readpages in a short read case on reconnects
  CIFS: Optimize cifs_user_read() in a short read case on reconnects
  CIFS: Improve indentation in cifs_user_read()
  CIFS: Fix possible buffer corruption in cifs_user_read()
  CIFS: Count got bytes in read_into_pages()
  CIFS: Use separate var for the number of bytes got in async read
  CIFS: Indicate reconnect with ECONNABORTED error code
  CIFS: Use multicredits for SMB 2.1/3 reads
  CIFS: Fix rsize usage for sync read
  CIFS: Fix rsize usage in user read
  CIFS: Separate page reading from user read
  CIFS: Fix rsize usage in readpages
  CIFS: Separate page search from readpages
  CIFS: Use multicredits for SMB 2.1/3 writes
  ...
2014-08-09 13:03:34 -07:00
Linus Torvalds c309bfa9b4 MTD updates for 3.17-rc1
AMD-compatible CFI driver:
  - Support OTP programming for Micron M29EW family
  - Increase buffer write timeout, according to detected flash parameter info
 
 NAND
  - Add helpers for retrieving ONFI timing modes
  - GPMI: provide option to disable bad block marker swapping (required for
      Ka-On electronics platforms)
 
 SPI NOR
  - EON EN25QH128 support
  - Support new Flag Status Register (FSR) on a few Micron flash
 
 Common
  - New sysfs entries for bad block and ECC stats
 
 And a few miscellaneous refactorings, cleanups, and driver improvements
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJT5WCXAAoJEFySrpd9RFgt0rwP/1anAulAcve/QzVF9LDFPec8
 jSvK8WWFcHLVb9EvTHtUjRz2RSRNhe0eeyEld3WpdKZZ73VVVaHnGdJv8J8Ys8jn
 kfNBDfgDLFrVzycYCqjQ2gdvidCyrjQgtPP0E/Q/RN6FBur0/rp2WKoJ2FvuT6SS
 kOz5f3TOe+iNtxQoJwkFvs/IjfFMThGs+YMJ8Z9s4LcJHD65T6hF+zDwl8xF2xfG
 b104PsG3I58kJdYjKhRQ2/ol+YCPoVhQorhhuaqeouZum/Hb/2g3rKHVZpAv2n6m
 JWnTpbdJDqGoPFVPyJr5Vm/UYOwxEBSWimuNp+2WN7EsXux1x9JZZl5+ZNUMmb4q
 vxYhIDul2+Sg1lN+ruBe+xi6d8DI8Y5WIc9xJgn3YHLC8YSkiZ11bhQyyeHA9i5h
 jZYKSkN/ERl8iAA4ULD6tsZv4ds8LVI/XOxrcSM7myQ4p8oY5QBxEWEuGPgyH6A6
 qCVkc0TAriSPfcCBvs4o8s2uNUocq7x6ZT1xfdlJ0KVstCmeEJBJnBYwWIXR3tu7
 3+I/bI41Q29ZGV8x5PEGDSgLVDNAJZnfGPuCwdMWuVD7UQuEEOJkCvy8o7C2Fold
 hRXh3SlUICCGDzd0JdNmdt5hPuB0tzsG0YkRoRj2sS30TlHi77nN/m3HYi/JQ4UW
 gA21laizfJ+z7g/5Kabb
 =Wkmz
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20140808' of git://git.infradead.org/linux-mtd

Pull MTD updates from Brian Norris:
 "AMD-compatible CFI driver:
   - Support OTP programming for Micron M29EW family
   - Increase buffer write timeout, according to detected flash
     parameter info

  NAND
   - Add helpers for retrieving ONFI timing modes
   - GPMI: provide option to disable bad block marker swapping (required
     for Ka-On electronics platforms)

  SPI NOR
   - EON EN25QH128 support
   - Support new Flag Status Register (FSR) on a few Micron flash

  Common
   - New sysfs entries for bad block and ECC stats

  And a few miscellaneous refactorings, cleanups, and driver
  improvements"

* tag 'for-linus-20140808' of git://git.infradead.org/linux-mtd: (31 commits)
  mtd: gpmi: make blockmark swapping optional
  mtd: gpmi: remove line breaks from error messages and improve wording
  mtd: gpmi: remove useless (void *) type casts and spaces between type casts and variables
  mtd: atmel_nand: NFC: support multiple interrupt handling
  mtd: atmel_nand: implement the nfc_device_ready() by checking the R/B bit
  mtd: atmel_nand: add NFC status error check
  mtd: atmel_nand: make ecc parameters same as definition
  mtd: nand: add ONFI timing mode to nand_timings converter
  mtd: nand: define struct nand_timings
  mtd: cfi_cmdset_0002: fix do_write_buffer() timeout error
  mtd: denali: use 8 bytes for READID command
  mtd/ftl: fix the double free of the buffers allocated in build_maps()
  mtd: phram: Fix whitespace issues
  mtd: spi-nor: add support for EON EN25QH128
  mtd: cfi_cmdset_0002: Add support for locking OTP memory
  mtd: cfi_cmdset_0002: Add support for writing OTP memory
  mtd: cfi_cmdset_0002: Invalidate cache after entering/exiting OTP memory
  mtd: cfi_cmdset_0002: Add support for reading OTP
  mtd: spi-nor: add support for flag status register on Micron chips
  mtd: Account for BBT blocks when a partition is being allocated
  ...
2014-08-08 18:13:21 -07:00
David Herrmann 40e041a2c8 shm: add sealing API
If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
  - one side cannot overwrite data while the other reads it
  - one side cannot shrink the buffer while the other accesses it
  - one side cannot grow the buffer beyond previously set boundaries

If there is a trust-relationship between both parties, there is no need
for policy enforcement.  However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies.  Look at the following two
use-cases:

  1) A graphics client wants to share its rendering-buffer with a
     graphics-server. The memory-region is allocated by the client for
     read/write access and a second FD is passed to the server. While
     scanning out from the memory region, the server has no guarantee that
     the client doesn't shrink the buffer at any time, requiring rather
     cumbersome SIGBUS handling.
  2) A process wants to perform an RPC on another process. To avoid huge
     bandwidth consumption, zero-copy is preferred. After a message is
     assembled in-memory and a FD is passed to the remote side, both sides
     want to be sure that neither modifies this shared copy, anymore. The
     source may have put sensible data into the message without a separate
     copy and the target may want to parse the message inline, to avoid a
     local copy.

While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.

This patch introduces the concept of SEALING.  If you seal a file, a
specific set of operations is blocked on that file forever.  Unlike locks,
seals can only be set, never removed.  Hence, once you verified a specific
set of seals is set, you're guaranteed that no-one can perform the blocked
operations on this file, anymore.

An initial set of SEALS is introduced by this patch:
  - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
            in size. This affects ftruncate() and open(O_TRUNC).
  - GROW: If SEAL_GROW is set, the file in question cannot be increased
          in size. This affects ftruncate(), fallocate() and write().
  - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
           are possible. This affects fallocate(PUNCH_HOLE), mmap() and
           write().
  - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
          This basically prevents the F_ADD_SEAL operation on a file and
          can be set to prevent others from adding further seals that you
          don't want.

The described use-cases can easily use these seals to provide safe use
without any trust-relationship:

  1) The graphics server can verify that a passed file-descriptor has
     SEAL_SHRINK set. This allows safe scanout, while the client is
     allowed to increase buffer size for window-resizing on-the-fly.
     Concurrent writes are explicitly allowed.
  2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
     SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
     process can modify the data while the other side parses it.
     Furthermore, it guarantees that even with writable FDs passed to the
     peer, it cannot increase the size to hit memory-limits of the source
     process (in case the file-storage is accounted to the source).

The new API is an extension to fcntl(), adding two new commands:
  F_GET_SEALS: Return a bitset describing the seals on the file. This
               can be called on any FD if the underlying file supports
               sealing.
  F_ADD_SEALS: Change the seals of a given file. This requires WRITE
               access to the file and F_SEAL_SEAL may not already be set.
               Furthermore, the underlying file must support sealing and
               there may not be any existing shared mapping of that file.
               Otherwise, EBADF/EPERM is returned.
               The given seals are _added_ to the existing set of seals
               on the file. You cannot remove seals again.

The fcntl() handler is currently specific to shmem and disabled on all
files. A file needs to explicitly support sealing for this interface to
work. A separate syscall is added in a follow-up, which creates files that
support sealing. There is no intention to support this on other
file-systems. Semantics are unclear for non-volatile files and we lack any
use-case right now. Therefore, the implementation is specific to shmem.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:31 -07:00
David Herrmann 4bb5f5d939 mm: allow drivers to prevent new writable mappings
This patch (of 6):

The i_mmap_writable field counts existing writable mappings of an
address_space.  To allow drivers to prevent new writable mappings, make
this counter signed and prevent new writable mappings if it is negative.
This is modelled after i_writecount and DENYWRITE.

This will be required by the shmem-sealing infrastructure to prevent any
new writable mappings after the WRITE seal has been set.  In case there
exists a writable mapping, this operation will fail with EBUSY.

Note that we rely on the fact that iff you already own a writable mapping,
you can increase the counter without using the helpers.  This is the same
that we do for i_writecount.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:31 -07:00
Fabian Frederick e0d9bf4cc0 fs/dlm/debug_fs.c: remove unnecessary null test before debugfs_remove
This fixes checkpatch warning:

  WARNING: debugfs_remove(NULL) is safe this check is probably not required

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:27 -07:00