Commit Graph

34219 Commits

Author SHA1 Message Date
Steve French b1d9335642 setfacl removes part of ACL when setting POSIX ACLs to Samba
setfacl over cifs mounts can remove the default ACL when setting the
(non-default part of) the ACL and vice versa (we were leaving at 0
rather than setting to -1 the count field for the unaffected
half of the ACL.  For example notice the setfacl removed
the default ACL in this sequence:

steven@steven-GA-970A-DS3:~/cifs-2.6$ getfacl /mnt/test-dir ; setfacl
-m default:user:test:rwx,user:test:rwx /mnt/test-dir
getfacl: Removing leading '/' from absolute path names
user::rwx
group::r-x
other::r-x
default:user::rwx
default:user:test:rwx
default:group::r-x
default😷:rwx
default:other::r-x

steven@steven-GA-970A-DS3:~/cifs-2.6$ getfacl /mnt/test-dir
getfacl: Removing leading '/' from absolute path names
user::rwx
user:test:rwx
group::r-x
mask::rwx
other::r-x

CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Jeremy Allison <jra@samba.org>
2013-11-15 20:50:58 -06:00
Linus Torvalds 9073e1a804 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "Usual earth-shaking, news-breaking, rocket science pile from
  trivial.git"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
  doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
  doc: add missing files to timers/00-INDEX
  timekeeping: Fix some trivial typos in comments
  mm: Fix some trivial typos in comments
  irq: Fix some trivial typos in comments
  NUMA: fix typos in Kconfig help text
  mm: update 00-INDEX
  doc: Documentation/DMA-attributes.txt fix typo
  DRM: comment: `halve' -> `half'
  Docs: Kconfig: `devlopers' -> `developers'
  doc: typo on word accounting in kprobes.c in mutliple architectures
  treewide: fix "usefull" typo
  treewide: fix "distingush" typo
  mm/Kconfig: Grammar s/an/a/
  kexec: Typo s/the/then/
  Documentation/kvm: Update cpuid documentation for steal time and pv eoi
  treewide: Fix common typo in "identify"
  __page_to_pfn: Fix typo in comment
  Correct some typos for word frequency
  clk: fixed-factor: Fix a trivial typo
  ...
2013-11-15 16:47:22 -08:00
Steve French de9f68df67 [CIFS] Set copychunk defaults
Patch 2 of the copy chunk series (the final patch will
use these to handle copies of files larger than the chunk size.

We set the same defaults that Windows and Samba expect for
CopyChunk.

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: David Disseldorp <ddiss@samba.org>
2013-11-15 15:27:22 -06:00
Christoph Hellwig 8c2fabc654 nfs: fix pnfs Kconfig defaults
Defaulting to m seem to prevent building the pnfs layout modules into the
kernel.  Default to the value of CONFIG_NFS_V4 make sure they are
built in for built-in NFSv4 support and modular for a modular NFSv4.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-11-15 13:41:43 -05:00
NeilBrown 9e08ef1afb NFS: correctly report misuse of "migration" mount option.
The current test on valid use of the "migration" mount option can never
report an error as it will only do so if
    mnt->version !=4 && mnt->minor_version != 0
(and some other condition), but if that test would succeed, then the previous
test has already gone-to  out_minorversion_mismatch.

So change the && to an || to get correct semantics.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-11-15 13:41:43 -05:00
Al Viro 54563d41a5 btrfs: get rid of fdentry()
3 of 4 callers actually want file_inode()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-15 09:18:14 -05:00
Chris Mason 46e0f66a0c btrfs: fix empty_zero_page misusage
Heiko Carstens noticed that btrfs was using empty_zero_page
incorrectly.  He explained:

	The definition of empty_zero_page is architecture specific.  It
	is (currently) either a character array, an unsigned long
	containing the address of the empty_zero_page, or even worse
	only the address of the struct page belonging to the
	empty_zero_page.

This commit changes btrfs to use a for-loop instead.  On x86
the resulting .ko is smaller, and we're no longer worrying about
how each arch builds its zeros.

Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-15 09:17:47 -05:00
Linus Torvalds d8fe4acc88 Merge branch 'akpm' (patch-bomb from Andrew Morton)
Merge patches from Andrew Morton:
 - memstick fixes

 - the rest of MM

 - various misc bits that were awaiting merges from linux-next into
   mainline: seq_file, printk, rtc, completions, w1, softirqs, llist,
   kfifo, hfsplus

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (72 commits)
  cmdline-parser: fix build
  hfsplus: Fix undefined __divdi3 in hfsplus_init_header_node()
  kfifo API type safety
  kfifo: kfifo_copy_{to,from}_user: fix copied bytes calculation
  sound/core/memalloc.c: use gen_pool_dma_alloc() to allocate iram buffer
  llists-move-llist_reverse_order-from-raid5-to-llistc-fix
  llists: move llist_reverse_order from raid5 to llist.c
  kernel: fix generic_exec_single indentation
  kernel-provide-a-__smp_call_function_single-stub-for-config_smp-fix
  kernel: provide a __smp_call_function_single stub for !CONFIG_SMP
  kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS
  revert "softirq: Add support for triggering softirq work on softirqs"
  drivers/w1/masters/w1-gpio.c: use dev_get_platdata()
  sched: remove INIT_COMPLETION
  tree-wide: use reinit_completion instead of INIT_COMPLETION
  sched: replace INIT_COMPLETION with reinit_completion
  drivers/rtc/rtc-hid-sensor-time.c: enable HID input processing early
  drivers/rtc/rtc-hid-sensor-time.c: use dev_get_platdata()
  vsprintf: ignore %n again
  seq_file: remove "%n" usage from seq_file users
  ...
2013-11-15 09:32:31 +09:00
Geert Uytterhoeven a99b7069aa hfsplus: Fix undefined __divdi3 in hfsplus_init_header_node()
ERROR: "__divdi3" [fs/hfsplus/hfsplus.ko] undefined!

Introduced by commit 099e9245e0 ("hfsplus: implement attributes file's
header node initialization code").

i_size_read() returns loff_t, which is long long, i.e.  64-bit.  node_size
is size_t, which is either 32-bit or 64-bit.  Hence
"i_size_read(attr_file) / node_size" is a 64-by-32 or 64-by-64 division,
causing (some versions of) gcc to emit a call to __divdi3().

Fortunately node_size is actually 16-bit, as the sole caller of
hfsplus_init_header_node() passes a u16.  Hence change its type from
size_t to u16, and use do_div() to perform a 64-by-32 division.

Not seen in m68k/allmodconfig in -next, so it really depends on the
verion of gcc.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:23 +09:00
Wolfram Sang 16735d022f tree-wide: use reinit_completion instead of INIT_COMPLETION
Use this new function to make code more comprehensible, since we are
reinitialzing the completion, not initializing.

[akpm@linux-foundation.org: linux-next resyncs]
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Acked-by: Linus Walleij <linus.walleij@linaro.org> (personally at LCE13)
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:21 +09:00
Tetsuo Handa 652586df95 seq_file: remove "%n" usage from seq_file users
All seq_printf() users are using "%n" for calculating padding size,
convert them to use seq_setwidth() / seq_pad() pair.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Joe Perches <joe@perches.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:20 +09:00
Tetsuo Handa 839cc2a94c seq_file: introduce seq_setwidth() and seq_pad()
There are several users who want to know bytes written by seq_*() for
alignment purpose.  Currently they are using %n format for knowing it
because seq_*() returns 0 on success.

This patch introduces seq_setwidth() and seq_pad() for allowing them to
align without using %n format.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Joe Perches <joe@perches.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:20 +09:00
Kirill A. Shutemov cb900f4121 mm, hugetlb: convert hugetlbfs to use split pmd lock
Hugetlb supports multiple page sizes. We use split lock only for PMD
level, but not for PUD.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:14 +09:00
Kirill A. Shutemov bf929152e9 mm, thp: change pmd_trans_huge_lock() to return taken lock
With split ptlock it's important to know which lock
pmd_trans_huge_lock() took.  This patch adds one more parameter to the
function to return the lock.

In most places migration to new api is trivial.  Exception is
move_huge_pmd(): we need to take two locks if pmd tables are different.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:14 +09:00
Kirill A. Shutemov e1f56c89b0 mm: convert mm->nr_ptes to atomic_long_t
With split page table lock for PMD level we can't hold mm->page_table_lock
while updating nr_ptes.

Let's convert it to atomic_long_t to avoid races.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:14 +09:00
Linus Torvalds 3aeb58ab62 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update frm Chris Mason:
 "This is our usual merge window set of bug fixes, performance
  improvements and cleanups.  Miao Xie has some really nice
  optimizations for writeback.

  Josef also expanded our sanity checks quite a bit; these make up a big
  chunk of the new lines"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (98 commits)
  Btrfs: rename btrfs_start_all_delalloc_inodes
  Btrfs: don't wait for the completion of all the ordered extents
  Btrfs: don't wait for all the async delalloc when shrinking delalloc
  Btrfs: fix the confusion between delalloc bytes and metadata bytes
  Btrfs: pick up the code for the item number calculation in flush_space()
  Btrfs: wait for the ordered extent only when we want
  Btrfs: remove unnecessary initialization and memory barrior in shrink_delalloc()
  Btrfs: avoid unnecessary scrub workers allocation
  Btrfs: check file extent type before anything else
  btrfs: Remove useless variable in write_ctree_super()
  btrfs: Fix checkpatch.pl warning of spacing issues
  btrfs: Replace kmalloc with kmalloc_array
  btrfs: Enclose macros with complex values within parenthesis
  btrfs: Use WARN_ON()'s return value in place of WARN_ON(1)
  btrfs: Remove redundant local zero structure
  btrfs: Pack struct btrfs_device
  btrfs: Replace multiple atomic_inc() with atomic_add()
  btrfs: Add helper function for free_root_pointers()
  Btrfs: fix a crash when running balance and defrag concurrently
  Btrfs: do not run snapshot-aware defragment on error
  ...
2013-11-15 08:45:16 +09:00
Christoph Hellwig aea240f416 nfsd: export proper maximum file size to the client
I noticed that we export a way to high value for the maxfilesize
attribute when debugging a client issue.  The issue didn't turn
out to be related to it, but I think we should export it, so that
clients can limit what write sizes they accept before hitting
the server.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-11-14 15:18:47 -05:00
Tyler Hicks 2000f5beab eCryptfs: file->private_data is always valid
When accessing the lower_file pointer located in private_data of
eCryptfs files, there is no need to check to see if the private_data
pointer has been initialized to a non-NULL value. The file->private_data
and file->private_data->lower_file pointers are always initialized to
non-NULL values in ecryptfs_open().

This change quiets a Smatch warning:

  CHECK   /var/scm/kernel/linux/fs/ecryptfs/file.c
fs/ecryptfs/file.c:321 ecryptfs_unlocked_ioctl() error: potential NULL dereference 'lower_file'.
fs/ecryptfs/file.c:335 ecryptfs_compat_ioctl() error: potential NULL dereference 'lower_file'.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Geyslan G. Bem <geyslan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2013-11-14 12:12:15 -08:00
Linus Torvalds 4fbf888acc Ext4 updates for 3.13. Mostly bug fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABCAAGBQJSgzHaAAoJENNvdpvBGATwPKkQAIydz5qBwyFyj3S2JieCe0L4
 kIqy4QNJtRCPtDKid3aHBjrFmY1QB0bwqFGC/m5zFBlLGxVoPpRB+w01pjZ0uLBl
 DYDdefsQyjmzEDkcawQsQGuvyXIMJPhD7EGKDgpzxOJUBSC9hytrnfGAue1zzFt1
 GVBtjpoJAfS32mxf8Kl/N0biFF3VasFkuvm8gtby2eXM06n5O5/yy/5k8Pv6X/3r
 ZhfYRq4nUZ2vUQx7Tw5x9Bn0IR5xMDnfj2Bs/gfPBo+GbdhJQzdPK1+l4e03Vp62
 sAO9J8Pz+hZs6e7QfZ+FwiBQY8uHVkgTJ/rPWfTEY3bGxvPahfKrFuqvPLh2flpu
 RVdO8Dxw9NuMqR96mpMxQua7XtnJBknytr82QfVzoYBPIkBMwKUzUKoYxob/FcL3
 ztdnnj0YygOan/E/unZn1mnIz6RyO1YjmL7S150kGtyUYWAI/pVHWufHqgY6gUqO
 4AQLLPEOXnF9+OITLGgnncWNunimPl0VWCCUyvOLiFK9n3WVFZibR7JDBKM4PzAb
 pYfgIlySofadALQefs8G1IV+k+3GuiobdmdVUN1GZ609julBi82DhQz31tHy32mO
 gE0YJVmorvWNO+ap3Sf4UJ6Bk5of29xtSx+cTBQ0CNhs+TmWjT25iTuLAvbvxWnh
 FUsRfzLrXy0SqZ6+gm5q
 =F4g3
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 changes from Ted Ts'o:
 "Ext4 updates for 3.13.  Mostly bug fixes and cleanups"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: add prototypes for macro-generated functions
  ext4: return non-zero st_blocks for inline data
  ext4: use prandom_u32() instead of get_random_bytes()
  ext4: remove unreachable code after ext4_can_extents_be_merged()
  ext4: remove unreachable code in ext4_can_extents_be_merged()
  ext4: avoid bh leak in retry path of ext4_expand_extra_isize_ea()
  ext4: don't count free clusters from a corrupt block group
  ext4: fix FITRIM in no journal mode
  ext4: drop set but otherwise unused variable from ext4_add_dirent_to_inline()
  ext4: change ext4_read_inline_dir() to return 0 on success
  ext4: pair trace_ext4_writepages & trace_ext4_writepages_result
  ext4: add ratelimiting to ext4 messages
  ext4: fix performance regression in ext4_writepages
  ext4: fixup kerndoc annotation of mpage_map_and_submit_extent()
  ext4: fix assertion in ext4_add_complete_io()
2013-11-14 17:19:58 +09:00
Linus Torvalds 7e1a1e9378 xfs: update for v3.13-rc1
For 3.13-rc1 we have an eclectic assortment of bugfixes, cleanups, and
 refactoring.  Bugfixes that stand out are the fix for the AGF/AGI
 deadlock, incore extent list fixes, verifier fixes for v4 superblocks
 and growfs, and memory leaks.  There are some asserts, warnings, and
 strings that were cleaned up.  There was further rearrangement of code
 to make libxfs and the kernel sync up more easily, differences between
 v2 and v3 directory code were abstracted using an ops vector,
 xfs_inactive was reworked, and the preallocation/hole punching code was
 refactored.
 
 - simplify kmem_zone_zalloc
 - add traces for AGF/AGI read ops
 - add additional AIL traces
 - fix xfs_remove AGF vs AGI deadlock
 - fix the extent count of new incore extent page in the indirection array
 - don't fail bad secondary superblocks verification on v4 filesystems
   due to unzeroed bits after v4 fields
 - fix possible NULL dereference in xlog_verify_iclog
 - remove redundant assert in xfs_dir2_leafn_split
 - prevent stack overflows from page cache allocation
 - fix some sparse warnings
 - fix directory block format verifier to check the leaf entry count
 - abstract the differences in dir2/dir3 via an ops vector
 - continue process of reorganization to make libxfs/kernel code merges easier
 - refactor the preallocation and hole punching code
 - fix for growfs and verifiers
 - remove unnecessary scary corruption error when probing non-xfs filesystems
 - remove extra newlines from strings passed to printk
 - prevent deadlock trying to cover an active log
 - rework xfs_inactive()
 - add the inode directory type support to XFS_IOC_FSGEOM
 - cleanup (remove) usage of is_bad_inode
 - fix miscalculation in xfs_iext_realloc_direct which results in oversized
   direct extent list
 - remove unnecessary count arg to xfs_iomap_write_allocate
 - fix memory leak in xlog_recover_add_to_trans
 - check superblock instead of block magic to determine if dtype field
   is present
 - fix lockdep annotation due to project quotas
 - fix regression in xfs_node_toosmall which can lead to incorrect directory
   btree node collapse
 - make log recovery verify filesystem uuid of recovering blocks
 - fix XFS_IOC_FREE_EOFBLOCKS definition
 - remove invalid assert in xfs_inode_free
 - fix for AIL lock regression
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJShBL7AAoJENaLyazVq6ZOaRwP/1B3QkWxFArRSD4wl15oBxpN
 Zv6D7woTmAvON87OIG4m67gyTr5/yNrPy8bg6Qw4YoL6lHTgle+RDUaKhgnsODoX
 Gd/oOiKCBqGfe93zs2fzIQzZ+yn+xdXr+q8uyEwEe8QHK6/wg6lEHNNae8VXEBlO
 20ec4b0U9dxoOJyG8nJNdytI++jp3TWzmZGpmLwisRogt4b86JM+QRhKFOe18AeI
 c9ky0uQmOQ6gX6h1VKN1L1u66GpTtFgj8XqPp/V6D8xHb1XGNiutDAKD7Mt/Rcgf
 njQsky2lXSQIuOnhyS1+lPvR8x19srs6UdnxcWdJvvwsICb14ZHEsZQA7M1bkLnw
 zNYtwn5RSneVSdjUZ+55dU1oDfTw2fxRHmKcm3bKrJCG7aOcH5vhEKs6HS0eVZAW
 4AcjThA43UpcEv47sghd7WJ+hFc4tKDVh9BOLUNi9zlkltVdP6WmWduMco0mRNeJ
 gd++CFRv9R3cQ0UUNsNMGQ9a8k/TW5uHYRsfX2IRBcgXQD2Ip1HBGLGSft2/JA4G
 U53mM08RntInGKctp1PjJea74QPJrYT7wBMlBl917tmnZ59i20nDs/OfeD2Dsnod
 9ekK5J7cMGHdWnQ3+o2b9Awuypcl+d9vdNKgNmNVTPlptfkI5OjJ5+BhqScyDw7m
 LJ1JmPIPIJF7vdIqBJWL
 =XMd/
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-v3.13-rc1' of git://oss.sgi.com/xfs/xfs

Pull xfs update from Ben Myers:
 "For 3.13-rc1 we have an eclectic assortment of bugfixes, cleanups, and
  refactoring.  Bugfixes that stand out are the fix for the AGF/AGI
  deadlock, incore extent list fixes, verifier fixes for v4 superblocks
  and growfs, and memory leaks.  There are some asserts, warnings, and
  strings that were cleaned up.  There was further rearrangement of code
  to make libxfs and the kernel sync up more easily, differences between
  v2 and v3 directory code were abstracted using an ops vector,
  xfs_inactive was reworked, and the preallocation/hole punching code
  was refactored.

   - simplify kmem_zone_zalloc
   - add traces for AGF/AGI read ops
   - add additional AIL traces
   - fix xfs_remove AGF vs AGI deadlock
   - fix the extent count of new incore extent page in the indirection
     array
   - don't fail bad secondary superblocks verification on v4 filesystems
     due to unzeroed bits after v4 fields
   - fix possible NULL dereference in xlog_verify_iclog
   - remove redundant assert in xfs_dir2_leafn_split
   - prevent stack overflows from page cache allocation
   - fix some sparse warnings
   - fix directory block format verifier to check the leaf entry count
   - abstract the differences in dir2/dir3 via an ops vector
   - continue process of reorganization to make libxfs/kernel code
     merges easier
   - refactor the preallocation and hole punching code
   - fix for growfs and verifiers
   - remove unnecessary scary corruption error when probing non-xfs
     filesystems
   - remove extra newlines from strings passed to printk
   - prevent deadlock trying to cover an active log
   - rework xfs_inactive()
   - add the inode directory type support to XFS_IOC_FSGEOM
   - cleanup (remove) usage of is_bad_inode
   - fix miscalculation in xfs_iext_realloc_direct which results in
     oversized direct extent list
   - remove unnecessary count arg to xfs_iomap_write_allocate
   - fix memory leak in xlog_recover_add_to_trans
   - check superblock instead of block magic to determine if dtype field
     is present
   - fix lockdep annotation due to project quotas
   - fix regression in xfs_node_toosmall which can lead to incorrect
     directory btree node collapse
   - make log recovery verify filesystem uuid of recovering blocks
   - fix XFS_IOC_FREE_EOFBLOCKS definition
   - remove invalid assert in xfs_inode_free
   - fix for AIL lock regression"

* tag 'xfs-for-linus-v3.13-rc1' of git://oss.sgi.com/xfs/xfs: (49 commits)
  xfs: simplify kmem_{zone_}zalloc
  xfs: add tracepoints to AGF/AGI read operations
  xfs: trace AIL manipulations
  xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering
  xfs: fix the extent count when allocating an new indirection array entry
  xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields
  xfs: fix possible NULL dereference in xlog_verify_iclog
  xfs:xfs_dir2_node.c: pointer use before check for null
  xfs: prevent stack overflows from page cache allocation
  xfs: fix static and extern sparse warnings
  xfs: validity check the directory block leaf entry count
  xfs: make dir2 ftype offset pointers explicit
  xfs: convert directory vector functions to constants
  xfs: convert directory vector functions to constants
  xfs: vectorise encoding/decoding directory headers
  xfs: vectorise DA btree operations
  xfs: vectorise directory leaf operations
  xfs: vectorise directory data operations part 2
  xfs: vectorise directory data operations
  xfs: vectorise remaining shortform dir2 ops
  ...
2013-11-14 17:16:35 +09:00
Linus Torvalds 5e30025a31 Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking changes from Ingo Molnar:
 "The biggest changes:

   - add lockdep support for seqcount/seqlocks structures, this
     unearthed both bugs and required extra annotation.

   - move the various kernel locking primitives to the new
     kernel/locking/ directory"

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  block: Use u64_stats_init() to initialize seqcounts
  locking/lockdep: Mark __lockdep_count_forward_deps() as static
  lockdep/proc: Fix lock-time avg computation
  locking/doc: Update references to kernel/mutex.c
  ipv6: Fix possible ipv6 seqlock deadlock
  cpuset: Fix potential deadlock w/ set_mems_allowed
  seqcount: Add lockdep functionality to seqcount/seqlock structures
  net: Explicitly initialize u64_stats_sync structures for lockdep
  locking: Move the percpu-rwsem code to kernel/locking/
  locking: Move the lglocks code to kernel/locking/
  locking: Move the rwsem code to kernel/locking/
  locking: Move the rtmutex code to kernel/locking/
  locking: Move the semaphore core to kernel/locking/
  locking: Move the spinlock code to kernel/locking/
  locking: Move the lockdep code to kernel/locking/
  locking: Move the mutex code to kernel/locking/
  hung_task debugging: Add tracepoint to report the hang
  x86/locking/kconfig: Update paravirt spinlock Kconfig description
  lockstat: Report avg wait and hold times
  lockdep, x86/alternatives: Drop ancient lockdep fixup message
  ...
2013-11-14 16:30:30 +09:00
Steve French 41c1358e91 CIFS: SMB2/SMB3 Copy offload support (refcopy) phase 1
This first patch adds the ability for us to do a server side copy
(ie fast copy offloaded to the server to perform, aka refcopy)

"cp --reflink"

of one file to another located on the same server.  This
is much faster than traditional copy (which requires
reading and writing over the network and extra
memcpys).

This first version is not going to be copy
files larger than about 1MB (to Samba) until I add
support for multiple chunks and for autoconfiguring
the chunksize.

It includes:
1) processing of the ioctl
2) marshalling and sending the SMB2/SMB3 fsctl over the network
3) simple parsing of the response

It does not include yet (these will be in followon patches to come soon):
1) support for multiple chunks
2) support for autoconfiguring and remembering the chunksize
3) Support for the older style copychunk which Samba 4.1 server supports
(because this requires write permission on the target file, which
cp does not give you, apparently per-posix).  This may require
a distinct tool (other than cp) and other ioctl to implement.

Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-11-14 00:05:36 -06:00
Linus Torvalds 82cb6acea4 MTD merge for 3.13
* Unify some compile-time differences so that we have fewer uses of
    #ifdef CONFIG_OF in atmel_nand
  * Other general cleanups (removing unused functions, options, variables,
    fields; use correct interfaces)
  * Fix BUG() for new odd-sized NAND, which report non-power-of-2 dimensions via
    ONFI
  * Miscellaneous driver fixes (SPI NOR flash; BCM47xx NAND flash; etc.)
  * Improve differentiation between SLC and MLC NAND -- this clarifies an ABI
    issue regarding the MTD "type" (in sysfs and in ioctl(MEMGETINFO)), where
    the MTD_MLCNANDFLASH type was present but inconsistently used
  * Extend GPMI NAND to support multi-chip-select NAND for some platforms
  * Many improvements to the OMAP2/3 NAND driver, including an expanded DT
    binding to bring us closer to mainline support for some OMAP systems
  * Fix a deadlock in the error path of the Atmel NAND driver probe
  * Correct the error codes from MTD mmap() to conform to POSIX and the Linux
    Programmer's Manual. This is an acknowledged change in the MTD ABI, but I
    can't imagine somebody relying on the non-standard -ENOSYS error code
    specifically. Am I just being unimaginative? :)
  * Fix a few important GPMI NAND bugs (one regression from 3.12 and one
    long-standing race condition)
  * More? Read the log!
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJSgzYRAAoJEFySrpd9RFgtv8EP/3ZIS1w4fHyWafVSdgVFGR0Y
 urlVDhg7iBauh9admN9xxBz6CYRwhjby8GnN87Q1qzu95Xp63RVx31nNfdBW3DGd
 92vSyskijYJcUtanBxYqGp1i3EbQcpF4mumqxnre3C4KTLNije41t/wNVqnXAstU
 DWho2iymZdkweKJ0DqBA7WF4l/YscdFyNDanO9JWiwII05Rh3Acv7FPMFm3Clblw
 Nvfwzgp4XycYMeIQtkmQgQ3GgeWtxPgQwqMofn97MVH4zeTsmUP317ohIMukLGJD
 db33J2xBdrIbk9P4D3RvjOCYyAyonu9y6/p+B1Vmj+R4CAUvQOIljhklHFoT3UZW
 OzUHPxB6T0+NZyQ/5IRQIYH9As++vdb/bzsUXm/cXceI4o4I0QCPy/8adifakBOF
 IUX9/BCdUOfKXvdOXY5dXMR2sY1IBg/1WfI+qcAoITsS/EVrUTrOcfSLyGqF0ERU
 c7mAzXiyp4D51x66/QnfJ4aJjlioQSoa3mK1j4fXqH08YB5Zclpz938Bo1AO3lWy
 /n+NYSbeXJoi4rVkNawjrRVs+0OTby2XQ5OqBlUMH6f30fqjUefPm66ZBMhbxzYu
 5QFDctUbnHCyAPpOtM/WR3/NOkIqVhQl1331A+dG2TzLK0vTHs+kbt/YmIITpjI+
 yn70XJGhk1F4gy8zhD+V
 =z5qO
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20131112' of git://git.infradead.org/linux-mtd

Pull MTD changes from Brian Norris:
 - Unify some compile-time differences so that we have fewer uses of
   #ifdef CONFIG_OF in atmel_nand
 - Other general cleanups (removing unused functions, options,
   variables, fields; use correct interfaces)
 - Fix BUG() for new odd-sized NAND, which report non-power-of-2
   dimensions via ONFI
 - Miscellaneous driver fixes (SPI NOR flash; BCM47xx NAND flash; etc.)
 - Improve differentiation between SLC and MLC NAND -- this clarifies an
   ABI issue regarding the MTD "type" (in sysfs and in the MEMGETINFO
   ioctl), where the MTD_MLCNANDFLASH type was present but
   inconsistently used
 - Extend GPMI NAND to support multi-chip-select NAND for some platforms
 - Many improvements to the OMAP2/3 NAND driver, including an expanded
   DT binding to bring us closer to mainline support for some OMAP
   systems
 - Fix a deadlock in the error path of the Atmel NAND driver probe
 - Correct the error codes from MTD mmap() to conform to POSIX and the
   Linux Programmer's Manual.  This is an acknowledged change in the MTD
   ABI, but I can't imagine somebody relying on the non-standard -ENOSYS
   error code specifically.  Am I just being unimaginative? :)
 - Fix a few important GPMI NAND bugs (one regression from 3.12 and one
   long-standing race condition)
 - More? Read the log!

* tag 'for-linus-20131112' of git://git.infradead.org/linux-mtd: (98 commits)
  mtd: gpmi: fix the NULL pointer
  mtd: gpmi: fix kernel BUG due to racing DMA operations
  mtd: mtdchar: return expected errors on mmap() call
  mtd: gpmi: only scan two chips for imx6
  mtd: gpmi: Use devm_kzalloc()
  mtd: atmel_nand: fix bug driver will in a dead lock if no nand detected
  mtd: nand: use a local variable to simplify the nand_scan_tail
  mtd: nand: remove deprecated IRQF_DISABLED
  mtd: dataflash: Say if we find a device we don't support
  mtd: nand: omap: fix error return code in omap_nand_probe()
  mtd: nand_bbt: kill NAND_BBT_SCANALLPAGES
  mtd: m25p80: fixup device removal failure path
  mtd: mxc_nand: Include linux/of.h header
  mtd: remove duplicated include from mtdcore.c
  mtd: m25p80: add support for Macronix mx25l3255e
  mtd: nand: omap: remove selection of BCH ecc-scheme via KConfig
  mtd: nand: omap: updated devm_xx for all resource allocation and free calls
  mtd: nand: omap: use drivers/mtd/nand/nand_bch.c wrapper for BCH ECC instead of lib/bch.c
  mtd: nand: omap: clean-up ecc layout for BCH ecc schemes
  mtd: nand: omap2: clean-up BCHx_HW and BCHx_SW ECC configurations in device_probe
  ...
2013-11-14 12:31:43 +09:00
Linus Torvalds 0910c0bdf7 Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-block
Pull block IO core updates from Jens Axboe:
 "This is the pull request for the core changes in the block layer for
  3.13.  It contains:

   - The new blk-mq request interface.

     This is a new and more scalable queueing model that marries the
     best part of the request based interface we currently have (which
     is fully featured, but scales poorly) and the bio based "interface"
     which the new drivers for high IOPS devices end up using because
     it's much faster than the request based one.

     The bio interface has no block layer support, since it taps into
     the stack much earlier.  This means that drivers end up having to
     implement a lot of functionality on their own, like tagging,
     timeout handling, requeue, etc.  The blk-mq interface provides all
     these.  Some drivers even provide a switch to select bio or rq and
     has code to handle both, since things like merging only works in
     the rq model and hence is faster for some workloads.  This is a
     huge mess.  Conversion of these drivers nets us a substantial code
     reduction.  Initial results on converting SCSI to this model even
     shows an 8x improvement on single queue devices.  So while the
     model was intended to work on the newer multiqueue devices, it has
     substantial improvements for "classic" hardware as well.  This code
     has gone through extensive testing and development, it's now ready
     to go.  A pull request is coming to convert virtio-blk to this
     model will be will be coming as well, with more drivers scheduled
     for 3.14 conversion.

   - Two blktrace fixes from Jan and Chen Gang.

   - A plug merge fix from Alireza Haghdoost.

   - Conversion of __get_cpu_var() from Christoph Lameter.

   - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.

   - A fix for a race between request completion and the timeout
     handling from Jeff Moyer.  This is what caused the merge conflict
     with blk-mq/core, in case you are looking at that.

   - A dm stacking fix from Mike Snitzer.

   - A code consolidation fix and duplicated code removal from Kent
     Overstreet.

   - A handful of block bug fixes from Mikulas Patocka, fixing a loop
     crash and memory corruption on blk cg.

   - Elevator switch bug fix from Tomoki Sekiyama.

  A heads-up that I had to rebase this branch.  Initially the immutable
  bio_vecs had been queued up for inclusion, but a week later, it became
  clear that it wasn't fully cooked yet.  So the decision was made to
  pull this out and postpone it until 3.14.  It was a straight forward
  rebase, just pruning out the immutable series and the later fixes of
  problems with it.  The rest of the patches applied directly and no
  further changes were made"

* 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
  block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
  block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
  block: Do not call sector_div() with a 64-bit divisor
  kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
  block: Consolidate duplicated bio_trim() implementations
  block: Use rw_copy_check_uvector()
  block: Enable sysfs nomerge control for I/O requests in the plug list
  block: properly stack underlying max_segment_size to DM device
  elevator: acquire q->sysfs_lock in elevator_change()
  elevator: Fix a race in elevator switching and md device initialization
  block: Replace __get_cpu_var uses
  bdi: test bdi_init failure
  block: fix a probe argument to blk_register_region
  loop: fix crash if blk_alloc_queue fails
  blk-core: Fix memory corruption if blkcg_init_queue fails
  block: fix race between request completion and timeout handling
  blktrace: Send BLK_TN_PROCESS events to all running traces
  blk-mq: don't disallow request merges for req->special being set
  blk-mq: mq plug list breakage
  blk-mq: fix for flush deadlock
  ...
2013-11-14 12:08:14 +09:00
Jeff Layton 6d769f1e14 nfs: don't retry detect_trunking with RPC_AUTH_UNIX more than once
Currently, when we try to mount and get back NFS4ERR_CLID_IN_USE or
NFS4ERR_WRONGSEC, we create a new rpc_clnt and then try the call again.
There is no guarantee that doing so will work however, so we can end up
retrying the call in an infinite loop.

Worse yet, we create the new client using rpc_clone_client_set_auth,
which creates the new client as a child of the old one. Thus, we can end
up with a *very* long lineage of rpc_clnts. When we go to put all of the
references to them, we can end up with a long call chain that can smash
the stack as each rpc_free_client() call can recurse back into itself.

This patch fixes this by simply ensuring that the SETCLIENTID call will
only be retried in this situation if the last attempt did not use
RPC_AUTH_UNIX.

Note too that with this change, we don't need the (i > 2) check in the
-EACCES case since we now have a more reliable test as to whether we
should reattempt.

Cc: stable@vger.kernel.org # v3.10+
Cc: Chuck Lever <chuck.lever@oracle.com>
Tested-by/Acked-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-11-13 19:21:05 -05:00
J. Bruce Fields 6ff40decff nfsd4: improve write performance with better sendspace reservations
Currently the rpc code conservatively refuses to accept rpc's from a
client if the sum of its worst-case estimates of the replies it owes
that client exceed the send buffer space.

Unfortunately our estimate of the worst-case reply for an NFSv4 compound
is always the maximum read size.  This can unnecessarily limit the
number of operations we handle concurrently, for example in the case
most operations are writes (which have small replies).

We can do a little better if we check which ops the compound contains.

This is still a rough estimate, we'll need to improve on it some day.

Reported-by: Shyam Kaushik <shyamnfs1@gmail.com>
Tested-by: Shyam Kaushik <shyamnfs1@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-11-13 16:12:54 -05:00
Al Viro ede4cebce1 prepend_path() needs to reinitialize dentry/vfsmount/mnt on restarts
... and equivalent is needed in 3.12; it's broken there as well

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-13 07:45:40 -05:00
Li Zhong 4ec6c2aeab fix unpaired rcu lock in prepend_path()
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-13 07:43:10 -05:00
Dan Carpenter 4fdb793ffe locks: missing unlock on error in generic_add_lease()
We should unlock here before returning.

Fixes: df4e8d2c1d ('locks: implement delegations')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-13 07:30:53 -05:00
Dan Carpenter 7f62656be8 aio: checking for NULL instead of IS_ERR
alloc_anon_inode() returns an ERR_PTR(), it doesn't return NULL.

Fixes: 71ad7490c1 ('rework aio migrate pages to use aio fs')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-13 07:30:53 -05:00
Linus Torvalds 5cbb3d216e Merge branch 'akpm' (patches from Andrew Morton)
Merge first patch-bomb from Andrew Morton:
 "Quite a lot of other stuff is banked up awaiting further
  next->mainline merging, but this batch contains:

   - Lots of random misc patches
   - OCFS2
   - Most of MM
   - backlight updates
   - lib/ updates
   - printk updates
   - checkpatch updates
   - epoll tweaking
   - rtc updates
   - hfs
   - hfsplus
   - documentation
   - procfs
   - update gcov to gcc-4.7 format
   - IPC"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (269 commits)
  ipc, msg: fix message length check for negative values
  ipc/util.c: remove unnecessary work pending test
  devpts: plug the memory leak in kill_sb
  ./Makefile: export initial ramdisk compression config option
  init/Kconfig: add option to disable kernel compression
  drivers: w1: make w1_slave::flags long to avoid memory corruption
  drivers/w1/masters/ds1wm.cuse dev_get_platdata()
  drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
  drivers/memstick/core/mspro_block.c: fix attributes array allocation
  drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
  kernel/panic.c: reduce 1 byte usage for print tainted buffer
  gcov: reuse kbasename helper
  kernel/gcov/fs.c: use pr_warn()
  kernel/module.c: use pr_foo()
  gcov: compile specific gcov implementation based on gcc version
  gcov: add support for gcc 4.7 gcov format
  gcov: move gcov structs definitions to a gcc version specific file
  kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
  kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
  kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
  ...
2013-11-13 15:45:43 +09:00
Linus Torvalds 9bc9ccd7db Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "All kinds of stuff this time around; some more notable parts:

   - RCU'd vfsmounts handling
   - new primitives for coredump handling
   - files_lock is gone
   - Bruce's delegations handling series
   - exportfs fixes

  plus misc stuff all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
  ecryptfs: ->f_op is never NULL
  locks: break delegations on any attribute modification
  locks: break delegations on link
  locks: break delegations on rename
  locks: helper functions for delegation breaking
  locks: break delegations on unlink
  namei: minor vfs_unlink cleanup
  locks: implement delegations
  locks: introduce new FL_DELEG lock flag
  vfs: take i_mutex on renamed file
  vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
  vfs: don't use PARENT/CHILD lock classes for non-directories
  vfs: pull ext4's double-i_mutex-locking into common code
  exportfs: fix quadratic behavior in filehandle lookup
  exportfs: better variable name
  exportfs: move most of reconnect_path to helper function
  exportfs: eliminate unused "noprogress" counter
  exportfs: stop retrying once we race with rename/remove
  exportfs: clear DISCONNECTED on all parents sooner
  exportfs: more detailed comment for path_reconnect
  ...
2013-11-13 15:34:18 +09:00
Linus Torvalds f023029427 dlm for 3.13
This set includes a single fix to resolve to a race that could cause
 lockspace shutdown to incorrectly return -EBUSY.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJSgUirAAoJEDgbc8f8gGmqbyoQAK4mO5dAXYdZwqjxJ3qaE2IS
 UyBJDNeRMyxfKaCGl8B4vM8p35mR9JaYMGFmwgV5a4nkhUia+/2uNqSkd8zk35nq
 xglt94bQ8dpYvUXpQvhgORvG+j+AQvOY5YbYPCzTwUBSKLALdOS6cSKF8rC0EM2q
 JCfCCvXXEshfCSD0jvQKoXZrSkE+vOwJH9oqB8dzQYa6JT1fQidXGftey75AAtEc
 Gkx24glMhA+ovRP+TMjD4K/gwRlKXuUuCnPYnSWfuT3XHlGB+hhf81g1P6731DxB
 gVMoaSybaV8SS94scymQNVHkU6+7yRTMeTj3s7V/o4ZgAT7t1VvPEeXQC5AMdrnI
 BTO/S1SUJ+ySHsPy6Lr8DfRMtJww8ZLCptj30zGkzBIVkcSzr1y6/0kFWU3V7w7G
 mH0NyPrrH9HNwXaOELGKiQ37g+UTVnGQtdFXsZZzQ3iu0/2mKaA9x9bCahtt27YX
 Y7T88AaXQD0na3ubk/IwN3mEXzGRUUdUVyVYLcCBYNMCtncHZmtZAnAaoI6aaArc
 wleNIyMJmqeCufXMHf9TeT9RSIC49wMdU5A/2SB/MZ6H6xcZ0jr5yVKykLzJxjNo
 VwieufyvqFQ1sGr3nOUPYyRzBvtFvrjqT349TW8lo15z/zDb4xhA+SGVF1Mesuz7
 Y2X1sJ6CmVd53QkmT+LA
 =ta9e
 -----END PGP SIGNATURE-----

Merge tag 'dlm-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm fix from David Teigland:
 "This set includes a single fix to resolve to a race that could cause
  lockspace shutdown to incorrectly return -EBUSY"

* tag 'dlm-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: Avoid that dlm_release_lockspace() incorrectly returns -EBUSY
2013-11-13 15:31:13 +09:00
Linus Torvalds fbe43ff003 Mostly fixes for the power cut emulation UBIFS mode, and only one functional
change which fixes a return error code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJSgh3wAAoJECmIfjd9wqK0dxwQALF6rMiFldVa9uZnNgOawzA9
 yB8OR2EmQilZzCcSCLKaPIwzc7izJO7H6hbaLZrNY9UHzjoCwLhTGkbdkuGwkKuF
 Q+M99MKGnYghNWsOWu50J7wewR9zJPD1UGy3ZU/CJXfQzK4uwHIEOiXss+b0kZCo
 sW5ogFOiFzEkz8/CtGwffPXNWuqwPXbDBVHXBjnmv9MCCzsSRdGEpNClCOI/qUHQ
 pFZhKVHSiqYDPvvwQ/pTGv6kw2tu2pVh6+O232g35JB9G3NHq7MDxg0RLr61C7qK
 3ZK6K34r6D09NKhjrOOAsf/EgzyNXtcIL9ySYamaZzEsMGBb2LNMneu3gkrk7O49
 5CXi5Em4cg8eBzhEEmHDxCZanwPhggnuLh+YPlaBBNU/z3NiYRNYoHRjzQWCFwa/
 yee6AzU+w1i0NRtjpPIAmpaqkryJnAZ+c2Hthlm+DPtN3DzWX13QxHdWTxC3Xycc
 8aZNqC128meDqrKRxKkhh1H5TLRdsVTkC9yf0Iv8U8nQg4aT/dPJedNOI4n20Oqy
 C8LnABH6wtc7MImzA5xRbk7CHI0Feb8h6darSKWSLydKNJmvyJ6nRdsi5tnfBEyn
 kbLdYed833tCll86G8iYMwt3jO0DURn+yNlQrTsf3kvNTADvJdzAoP3tudgDxdIA
 T7Uxs/2VFIFAlDK+U5Aw
 =ap2l
 -----END PGP SIGNATURE-----

Merge tag 'upstream-3.13-rc1' of git://git.infradead.org/linux-ubifs

Pull ubifs changes from Artem Bityutskiy:
 "Mostly fixes for the power cut emulation UBIFS mode, and only one
  functional change which fixes a return error code"

* tag 'upstream-3.13-rc1' of git://git.infradead.org/linux-ubifs:
  UBIFS: correct data corruption range
  UBIFS: fix return code
  UBIFS: remove unnecessary code in ubifs_garbage_collect
2013-11-13 15:28:45 +09:00
Linus Torvalds a7fa20a594 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
 "This adds a ->writepage() implementation to fuse, improving mmaped
  writeout and paving the way for buffered writeback.

  And there's a patch to add a fix minor number for /dev/cuse, similarly
  to /dev/fuse"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: writepages: protect secondary requests from fuse file release
  fuse: writepages: update bdi writeout when deleting secondary request
  fuse: writepages: crop secondary requests
  fuse: writepages: roll back changes if request not found
  cuse: add fix minor number to /dev/cuse
  fuse: writepage: skip already in flight
  fuse: writepages: handle same page rewrites
  fuse: writepages: fix aggregation
  fuse: fix race in fuse_writepages()
  fuse: Implement writepages callback
  fuse: don't BUG on no write file
  fuse: lock page in mkwrite
  fuse: Prepare to handle multiple pages in writeback
  fuse: Getting file for writeback helper
2013-11-13 15:27:00 +09:00
Linus Torvalds a30124539b Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext[23], udf and quota fixes from Jan Kara:
 "Assorted fixes in quota, ext2, ext3 & udf.

  Probably the most important is a fix of fs corruption issue in ext2
  XIP support (OTOH xip is rarely used)"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2: Fix fs corruption in ext2_get_xip_mem()
  quota: info leak in quota_getquota()
  jbd: Revert "jbd: remove dependency on __GFP_NOFAIL"
  udf: fix for pathetic mount times in case of invalid file system
  ext3: Count journal as bsddf overhead in ext3_statfs
2013-11-13 15:25:47 +09:00
Linus Torvalds dd1d1399f2 f2fs updates for v3.13
This patch-set includes the following major enhancement patches.
 o add a sysfs to control reclaiming free segments
 o enhance the f2fs global lock procedures
 o enhance the victim selection flow
 o wait for selected node blocks during fsync
 o add some tracepoints
 o add a config to remove abundant BUG_ONs
 
 The other bug fixes are as follows.
 o fix deadlock on acl operations
 o fix some bugs with respect to orphan inodes
 
 And, there are a bunch of cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJSgMZ/AAoJEEAUqH6CSFDSsL4P/Ri6GZyy5F0DGjAJElX825gO
 gthRZ1uq1OAXUYDOEy150CsFgIiWeu2MxiOV15UnmX893a5cXrf32afoa/Cqx8GG
 FVEYc5+dDdogezQCW6XSatQ4s7cDQDymyT2Mky4MJyAxhpYtvbpcyVI/OWdVcTwh
 pqdJfsfuOikbOOL6VU2B5dDKwjc+6lgntdv/eICzNCH9NqHv8kxmm+h3NfaqUVrW
 pK1irqsXrktcwLIrOH0c5ZpPcKPghJuw37oFpEw8MxYbTnpdrbLq4BKE/fRh8Fhf
 R+sQgEoWZriE1SISHmYjWdt87hnFCk3wysl61Z/zkOxnYKebRBrjEiudzxAHDIGY
 +I71ovpVCWe0uljdiTBpLQ/iN4p2fRMLjn7j1IsMzoG9yfVFduMaY70m1AOZI/7z
 03QRpkmiRi7F8GYTSlPefsUUAnMYVDO6DzsyfHdxa8v+4UvWhSE4380L9DttNbCr
 2/+NGRZ4kga6GSsMhdn2Bnm6i3TkMDJosu4USkv4qGR1SH1+S5dodwxfQdonPUZg
 380kPkV7/gBYaMBSdrQFds3lh7g431gfYEfGSWt3vA14fFIWP7nIFpVIPGMM6/Sd
 GFe6gqZ2JLatqJnQNwEjPsBPPsiCAt6exbg86fTCvrS+oyQTiv44FNOWbz7iTrxw
 5nZQfQHSMhKtux7rpM/N
 =Grs1
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "This patch-set includes the following major enhancement patches.
   - add a sysfs to control reclaiming free segments
   - enhance the f2fs global lock procedures
   - enhance the victim selection flow
   - wait for selected node blocks during fsync
   - add some tracepoints
   - add a config to remove abundant BUG_ONs

  The other bug fixes are as follows.
   - fix deadlock on acl operations
   - fix some bugs with respect to orphan inodes

  And, there are a bunch of cleanups"

* tag 'for-f2fs-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (42 commits)
  f2fs: issue more large discard command
  f2fs: fix memory leak after kobject init failed in fill_super
  f2fs: cleanup waiting routine for writeback pages in cp
  f2fs: avoid to use a NULL point in destroy_segment_manager
  f2fs: remove unnecessary TestClearPageError when wait pages writeback
  f2fs: update f2fs document
  f2fs: avoid to wait all the node blocks during fsync
  f2fs: check all ones or zeros bitmap with bitops for better mount performance
  f2fs: change the method of calculating the number summary blocks
  f2fs: fix calculating incorrect free size when update xattr in __f2fs_setxattr
  f2fs: add an option to avoid unnecessary BUG_ONs
  f2fs: introduce CONFIG_F2FS_CHECK_FS for BUG_ON control
  f2fs: fix a deadlock during init_acl procedure
  f2fs: clean up acl flow for better readability
  f2fs: remove unnecessary segment bitmap updates
  f2fs: add tracepoint for vm_page_mkwrite
  f2fs: add tracepoint for set_page_dirty
  f2fs: remove redundant set_page_dirty from write_compacted_summaries
  f2fs: add reclaiming control by sysfs
  f2fs: introduce f2fs_balance_fs_bg for some background jobs
  ...
2013-11-13 15:24:40 +09:00
Ilija Hadzic 66da0e1f90 devpts: plug the memory leak in kill_sb
When devpts is unmounted, there may be a no-longer-used IDR tree hanging
off the superblock we are about to kill.  This needs to be cleaned up
before destroying the SB.

The leak is usually not a big deal because unmounting devpts is typically
done when shutting down the whole machine.  However, shutting down an LXC
container instead of a physical machine exposes the problem (the garbage
is detectable with kmemleak).

Signed-off-by: Ilija Hadzic <ihadzic@research.bell-labs.com>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:36 +09:00
Kees Cook d049f74f2d exec/ptrace: fix get_dumpable() incorrect tests
The get_dumpable() return value is not boolean.  Most users of the
function actually want to be testing for non-SUID_DUMP_USER(1) rather than
SUID_DUMP_DISABLE(0).  The SUID_DUMP_ROOT(2) is also considered a
protected state.  Almost all places did this correctly, excepting the two
places fixed in this patch.

Wrong logic:
    if (dumpable == SUID_DUMP_DISABLE) { /* be protective */ }
        or
    if (dumpable == 0) { /* be protective */ }
        or
    if (!dumpable) { /* be protective */ }

Correct logic:
    if (dumpable != SUID_DUMP_USER) { /* be protective */ }
        or
    if (dumpable != 1) { /* be protective */ }

Without this patch, if the system had set the sysctl fs/suid_dumpable=2, a
user was able to ptrace attach to processes that had dropped privileges to
that user.  (This may have been partially mitigated if Yama was enabled.)

The macros have been moved into the file that declares get/set_dumpable(),
which means things like the ia64 code can see them too.

CVE-2013-2929

Reported-by: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:33 +09:00
Randy Dunlap 1c3fc3e5cc kcore: add Kconfig help text
Under Pseudo filesystems, /proc/kcore support has no help.

Fixes a portion of kernel bugzilla #52671:
  https://bugzilla.kernel.org/show_bug.cgi?id=52671

Thanks for David Howells for the help text.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: <lailavrazda1979@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:33 +09:00
HATAYAMA Daisuke 5721cf84d9 procfs: clean up proc_reg_get_unmapped_area for 80-column limit
Clean up proc_reg_get_unmapped_area due to its 80-column limit
violation.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:33 +09:00
Vyacheslav Dubeyko 95e0d7dbb9 hfsplus: implement attributes file creation functionality
Implement functionality of creation AttributesFile metadata file on HFS+
volume in the case of absence of it.

It makes trying to open AttributesFile's B-tree during mount of HFS+
volume.  If HFS+ volume hasn't AttributesFile then a pointer on
AttributesFile's B-tree keeps as NULL.  Thereby, when it is discovered
absence of AttributesFile on HFS+ volume in the begin of xattr creation
operation then AttributesFile will be created.

The creation of AttributesFile will have success in the case of
availability (2 * clump) free blocks on HFS+ volume.  Otherwise,
creation operation is ended with error (-ENOSPC).

Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:32 +09:00
Vyacheslav Dubeyko 099e9245e0 hfsplus: implement attributes file's header node initialization code
Implement functionality of AttributesFile's header node initialization.

Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:32 +09:00
Vyacheslav Dubeyko b3b5b0f03c hfsplus: add metadata file's clump size calculation functionality
There are situation when HFS+ volume had been created without
AttributesFile.  Such situation can take place because of using old
mkfs.hfs utility or creation HFS+ volume without taking in mind
necessity to use xattrs.  For example, Mac OS X 10.4 (Tiger) doesn't
create AttributesFile during mkfs phase.  Also it is a very frequent
situation for the case of users that created HFS+ volumes under Linux.
As a result, xattrs and POSIX ACLs on HFS+ volume are unavailable for
such users.

This patchset implements functionality of AttributesFile creation on
HFS+ volume in the case of this metadata file absence during operation
of xattr creation.

This patch:

Add functionality of metadata file's clump size calculation.  Operation
of AttributesFile creation needs in clump size setting.  This value will
be used when AttributesFile will be extended.

This code is adopted from code of newfs_hfs utility of diskdev_cmds packet
http://opensource.apple.com/tarballs/diskdev_cmds/.

Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:32 +09:00
Michael Opdenacker 74a797d99a fs/hfs/btree.h: remove duplicate defines
This patch removes duplicate defines from fs/hfs/btree.h

[akpm@linux-foundation.org: retain the comments]
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:32 +09:00
Jason Baron 67347fe4e6 epoll: do not take global 'epmutex' for simple topologies
When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
directly to a wakeup source, we do not need to take the global 'epmutex',
unless the epoll file descriptor is nested.  The purpose of taking the
'epmutex' on add is to prevent complex topologies such as loops and deep
wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
operations.  However, for the simple case of an epoll file descriptor
attached directly to a wakeup source (with no nesting), we do not need to
hold the 'epmutex'.

This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
scalability on larger systems.  Quoting Nathan Zimmer's mail on SPECjbb
performance:

"On the 16 socket run the performance went from 35k jOPS to 125k jOPS.  In
addition the benchmark when from scaling well on 10 sockets to scaling
well on just over 40 sockets.

...

Currently the benchmark stops scaling at around 40-44 sockets but it seems like
I found a second unrelated bottleneck."

[akpm@linux-foundation.org: use `bool' for boolean variables, remove unneeded/undesirable cast of void*, add missed ep_scan_ready_list() kerneldoc]
Signed-off-by: Jason Baron <jbaron@akamai.com>
Tested-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Nelson Elhage <nelhage@nelhage.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:25 +09:00
Jason Baron ae10b2b4eb epoll: optimize EPOLL_CTL_DEL using rcu
Nathan Zimmer found that once we get over 10+ cpus, the scalability of
SPECjbb falls over due to the contention on the global 'epmutex', which is
taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations.

Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path
by using rcu to guard against any concurrent traversals.

Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for
simple topologies.  IE when adding a link from an epoll file descriptor to
a wakeup source, where the epoll file descriptor is not nested.

This patch (of 2):

Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by
converting the file->f_ep_links list into an rcu one.  In this way, we can
traverse the epoll network on the add path in parallel with deletes.
Since deletes can't create loops or worse wakeup paths, this is safe.

This patch in combination with the patch "epoll: Do not take global 'epmutex'
for simple topologies", shows a dramatic performance improvement in
scalability for SPECjbb.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Tested-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Nelson Elhage <nelhage@nelhage.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
CC: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:25 +09:00
Oleg Nesterov 6bc080d8fd debugfs: use list_next_entry() in debugfs_remove_recursive()
Change debugfs_remove_recursive() to use list_next_entry(child), no
changes in generated code.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eilon Greenstein <eilong@broadcom.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:24 +09:00
Michael Opdenacker 54886a7153 cramfs: mark as obsolete
Who needs cramfs when you have squashfs?  At least, we should warn people
that cramfs is obsolete.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:12 +09:00
Jerome Marchand 00619bcc44 mm: factor commit limit calculation
The same calculation is currently done in three differents places.
Factor that code so future changes has to be made at only one place.

[akpm@linux-foundation.org: uninline vm_commit_limit()]
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:11 +09:00
Jan Kara c4a391b53a writeback: do not sync data dirtied after sync start
When there are processes heavily creating small files while sync(2) is
running, it can easily happen that quite some new files are created
between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2).  That can happen
especially if there are several busy filesystems (remember that sync
traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
fs before starting it on another fs).  Because WB_SYNC_ALL pass is slow
(e.g.  causes a transaction commit and cache flush for each inode in
ext3), resulting sync(2) times are rather large.

The following script reproduces the problem:

  function run_writers
  {
    for (( i = 0; i < 10; i++ )); do
      mkdir $1/dir$i
      for (( j = 0; j < 40000; j++ )); do
        dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
      done &
    done
  }

  for dir in "$@"; do
    run_writers $dir
  done

  sleep 40
  time sync

Fix the problem by disregarding inodes dirtied after sync(2) was called
in the WB_SYNC_ALL pass.  To allow for this, sync_inodes_sb() now takes
a time stamp when sync has started which is used for setting up work for
flusher threads.

To give some numbers, when above script is run on two ext4 filesystems
on simple SATA drive, the average sync time from 10 runs is 267.549
seconds with standard deviation 104.799426.  With the patched kernel,
the average sync time from 10 runs is 2.995 seconds with standard
deviation 0.096.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:07 +09:00
Naoya Horiguchi ec8e41aec1 /proc/pid/smaps: show VM_SOFTDIRTY flag in VmFlags line
This flag shows that the VMA is "newly created" and thus represents
"dirty" in the task's VM.

You can clear it by "echo 4 > /proc/pid/clear_refs."

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:07 +09:00
David Rientjes 948927ee9e mm, mempolicy: make mpol_to_str robust and always succeed
mpol_to_str() should not fail.  Currently, it either fails because the
string buffer is too small or because a string hasn't been defined for a
mempolicy mode.

If a new mempolicy mode is introduced and no string is defined for it,
just warn and return "unknown".

If the buffer is too small, just truncate the string and return, the
same behavior as snprintf().

This also fixes a bug where there was no NULL-byte termination when doing
*p++ = '=' and *p++ ':' and maxlen has been reached.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Chen Gang <gang.chen@asianux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:05 +09:00
Xishi Qiu 83285c72e0 mm: use pgdat_end_pfn() to simplify the code in others
Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn +
pgdat->node_spanned_pages".  Simplify the code, no functional change.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:03 +09:00
Jan Kara 41ecc34598 ocfs2: simplify ocfs2_invalidatepage() and ocfs2_releasepage()
Ocfs2 doesn't do data journalling.  Thus its ->invalidatepage and
->releasepage functions never get called on buffers that have journal
heads attached.  So just use standard variants of functions from
buffer.c.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:02 +09:00
Joe Perches d00d2f8ab9 ocfs2: convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:02 +09:00
Xue jiufei b1214e4757 ocfs2: fix possible double free in ocfs2_write_begin_nolock
When ocfs2_write_cluster_by_desc() failed in ocfs2_write_begin_nolock()
because of ENOSPC, it goes to out_quota, freeing data_ac(meta_ac).  Then
it calls ocfs2_try_to_free_truncate_log() to free space.  If enough
space freed, it will try to write again.  Unfortunately, some error
happenes before ocfs2_lock_allocators(), it goes to out and free
data_ac(meta_ac) again.

Signed-off-by: joyce <xuejiufei@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:02 +09:00
Younger Liu bfbca926d6 ocfs2: add missing errno in ocfs2_ioctl_move_extents()
If the file is not regular or writeable, it should return errno(EPERM).

This patch is based on 85a258b70d ("ocfs2: fix error handling in
ocfs2_ioctl_move_extents()").

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:02 +09:00
Younger Liu 8abaae8d85 ocfs2: do not call brelse() if group_bh is not initialized in ocfs2_group_add()
If group_bh is not initialized, there is no need to release.  This
problem does not cause anything wrong, but the patch would make the code
more logical.

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:01 +09:00
Younger Liu eedd40e1ca ocfs2: rollback transaction in ocfs2_group_add()
If ocfs2_journal_access_di() fails, group->bg_next_group should rollback.
Otherwise, there would be a inconsistency between group_bh and main_bm_bh.

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:01 +09:00
Junxiao Bi 728b98059a ocfs2: break useless while loop
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:01 +09:00
Akinobu Mita 518df6bdf3 ocfs2: use find_last_bit()
We already have find_last_bit().  So just use it as described in the
comment.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:01 +09:00
Xue jiufei fae477b6f0 ocfs2: delay migration when the lockres is in migration state
We trigger a bug in __dlm_lockres_reserve_ast() when we parallel umount 4
nodes.  The situation is as follows:

1) Node A migrate all lockres it owned(eg.  lockres A) to other nodes
   say node B when it umounts.

2) Receiving MIG_LOCKRES message from A, Node B masters the lockres A
   with DLM_LOCK_RES_MIGRATING state set.

3) Then we umount ocfs2 on node B.  It also should migrate lockres A to
   another node, say node C.  But now, DLM_LOCK_RES_MIGRATING state of
   lockers A is not cleared.  Node B triggered the BUG on lockres with
   state DLM_LOCK_RES_MIGRATING.

Signed-off-by: Xuejiufei <xuejiufei@huawei.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Tariq Saeed <tariq.x.saeed@oracle.com>
Cc: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:01 +09:00
Xue jiufei 750e3c6581 ocfs2: skip locks in the blocked list
A parallel umount on 4 nodes triggered a bug in
dlm_process_recovery_date().  Here's the situation:

Receiving MIG_LOCKRES message, A node processes the locks in migratable
lockres.  It copys lvb from migratable lockres when processing the first
valid lock.

If there is a lock in the blocked list with the EX level, it triggers the
BUG.  Since valid lvbs are set when locks are granted with EX or PR
levels, locks in the blocked list cannot have valid lvbs.  Therefore I
think we should skip the locks in the blocked list.

Signed-off-by: Xuejiufei <xuejiufei@huawei.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:01 +09:00
Akinobu Mita a8f70de37b ocfs2: use bitmap_weight()
Use bitmap_weight() instead of reinventing the wheel.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:01 +09:00
Joel Becker 910bffe086 ocfs2: don't spam on -EDQUOT
-EDQUOT is a user-visible error, not a logic problem.  Teach mlog_errno()
to ignore it like it ignores -ENOSPC, etc.

Signed-off-by: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Marek Królikowski <admin@wset.edu.pl>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:01 +09:00
Rui Xiang 58796207cf ocfs2: add necessary check in case sb_getblk() fails
sb_getblk() may return an err, so add a check for bh.

[joseph.qi@huawei.com: also add a check after calling sb_getblk() in ocfs2_create_xattr_block()]
Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:00 +09:00
Rui Xiang 7391a294b8 ocfs2: return ENOMEM when sb_getblk() fails
The only reason for sb_getblk() failing is if it can't allocate the
buffer_head.  So return ENOMEM instead when it fails.

[joseph.qi@huawei.com: ocfs2_symlink_get_block() and ocfs2_read_blocks_sync() and ocfs2_read_blocks() need the same change]
Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:00 +09:00
Junxiao Bi f0cb0f0bca fs/ocfs2/file.c: fix wrong comment
Unwritten extent only exists for file systems which support holes.  But
the comment said was opposite meaning and also the comment is not very
clear, so rephase it.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:00 +09:00
Goldwyn Rodrigues 06f9da6e82 fs/ocfs2: remove unnecessary variable bits_wanted from ocfs2_calc_extend_credits
Code cleanup to remove unnecessary variable passed but never used
to ocfs2_calc_extend_credits.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:00 +09:00
Linus Torvalds 10d0c9705e DeviceTree updates for 3.13. This is a bit larger pull request than
usual for this cycle with lots of clean-up.
 
 - Cross arch clean-up and consolidation of early DT scanning code.
 - Clean-up and removal of arch prom.h headers. Makes arch specific
   prom.h optional on all but Sparc.
 - Addition of interrupts-extended property for devices connected to
   multiple interrupt controllers.
 - Refactoring of DT interrupt parsing code in preparation for deferred
   probe of interrupts.
 - ARM cpu and cpu topology bindings documentation.
 - Various DT vendor binding documentation updates.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJSgPQ4AAoJEMhvYp4jgsXif28H/1WkrXq5+lCFQZF8nbYdE2h0
 R8PsfiJJmAl6/wFgQTsRel+ScMk2hiP08uTyqf2RLnB1v87gCF7MKVaLOdONfUDi
 huXbcQGWCmZv0tbBIklxJe3+X3FIJch4gnyUvPudD1m8a0R0LxWXH/NhdTSFyB20
 PNjhN/IzoN40X1PSAhfB5ndWnoxXBoehV/IVHVDU42vkPVbVTyGAw5qJzHW8CLyN
 2oGTOalOO4ffQ7dIkBEQfj0mrgGcODToPdDvUQyyGZjYK2FY2sGrjyquir6SDcNa
 Q4gwatHTu0ygXpyphjtQf5tc3ZCejJ/F0s3olOAS1ahKGfe01fehtwPRROQnCK8=
 =GCbY
 -----END PGP SIGNATURE-----

Merge tag 'devicetree-for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

Pull devicetree updates from Rob Herring:
 "DeviceTree updates for 3.13.  This is a bit larger pull request than
  usual for this cycle with lots of clean-up.

   - Cross arch clean-up and consolidation of early DT scanning code.
   - Clean-up and removal of arch prom.h headers.  Makes arch specific
     prom.h optional on all but Sparc.
   - Addition of interrupts-extended property for devices connected to
     multiple interrupt controllers.
   - Refactoring of DT interrupt parsing code in preparation for
     deferred probe of interrupts.
   - ARM cpu and cpu topology bindings documentation.
   - Various DT vendor binding documentation updates"

* tag 'devicetree-for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (82 commits)
  powerpc: add missing explicit OF includes for ppc
  dt/irq: add empty of_irq_count for !OF_IRQ
  dt: disable self-tests for !OF_IRQ
  of: irq: Fix interrupt-map entry matching
  MIPS: Netlogic: replace early_init_devtree() call
  of: Add Panasonic Corporation vendor prefix
  of: Add Chunghwa Picture Tubes Ltd. vendor prefix
  of: Add AU Optronics Corporation vendor prefix
  of/irq: Fix potential buffer overflow
  of/irq: Fix bug in interrupt parsing refactor.
  of: set dma_mask to point to coherent_dma_mask
  of: add vendor prefix for PHYTEC Messtechnik GmbH
  DT: sort vendor-prefixes.txt
  of: Add vendor prefix for Cadence
  of: Add empty for_each_available_child_of_node() macro definition
  arm/versatile: Fix versatile irq specifications.
  of/irq: create interrupts-extended property
  microblaze/pci: Drop PowerPC-ism from irq parsing
  of/irq: Create of_irq_parse_and_map_pci() to consolidate arch code.
  of/irq: Use irq_of_parse_and_map()
  ...
2013-11-12 16:52:17 +09:00
Linus Torvalds 4b4d2b4634 H8/300 has been dead for several years, the kernel for it has
not compiled for ages, and recent versions of gcc for it are broken.
 Remove support for it.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJSev31AAoJEMsfJm/On5mBzSAQAKRBYLqtf3nJGm9pXGDhZPGG
 7KSQ8S11pg/wnXYW6P/XJhFRBrYkOOCeqVKQHtmxG8MmXQkkOz95rsIvBbUzU/FT
 yJAPKpOHdh1yLhBGgCj3WhGtjVwpbut1/y9n2M5SpGautUgxfLj9fJiswSJx0n7t
 VRWKwfIpBFPLPs9w6hdDf94tIXhSX8Me2gd3LDCPBEQ2SZYd8rtBasYtDeC2+FLa
 Xow4ZQrCU7hpYscSUFzJpok35hl7weGhJ9jjXwtic4byFHvdiyHUwCOaEWC0hqNi
 fOLWFbvBogqjyAktfZhfyL9R9/7lGlLshLQNmJWR3bO+nCJ21h9ATw0R4gLBdT4/
 lzLRnJ/4GdtbvmdqRxNjxxR4zHkZ+tE8HmaCmUzvqGfQyA5sJNBRrBDcWLUOVlO9
 0iIZsJBZjSQXKXSk9P5xH4G0tlbAFEUnEHKsrt/mgsD9Z3SgbPKAIWSBAJA0AMQk
 DXZaXrBRilXOPUCZASZfmK8AQFC1GYB0tz7nT4x1mjT2/JClgG2kHCAGhNmI+CbK
 l9VRIgBydppLFPOGhZLSNGQp29xBhw9JgOVns4a1k7kJQEw9ht38h8Q2ckRYxXhP
 /z53eZKMQk62quWlyLRgR9mWqZc2CIifLVdFjiOELMh7wKPwL6eGrrrGBDbPtctS
 PX5K26geb0oA3ZMjpBLr
 =V6n6
 -----END PGP SIGNATURE-----

Merge tag 'h8300-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging

Pull h8300 platform removal from Guenter Roeck:
 "The patch series has been in -next for more than one relase cycle.  I
  did get a number of Acks, and no objections.

  H8/300 has been dead for several years, the kernel for it has not
  compiled for ages, and recent versions of gcc for it are broken.
  Remove support for it"

* tag 'h8300-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  CREDITS: Add Yoshinori Sato for h8300
  fs/minix: Drop dependency on H8300
  Drop remaining references to H8/300 architecture
  Drop MAINTAINERS entry for H8/300
  watchdog: Drop references to H8300 architecture
  net/ethernet: Drop H8/300 Ethernet driver
  net/ethernet: smsc9194: Drop conditional code for H8/300
  ide: Drop H8/300 driver
  Drop support for Renesas H8/300 (h8300) architecture
2013-11-12 14:13:14 +09:00
Andreas Dilger 3f61c0cc70 ext4: add prototypes for macro-generated functions
It isn't very easy to find the declarations for the functions created
by EXT4_INODE_BIT_FNS() because the names are generated by macros:

    ext4_test_inode_flag, ext4_set_inode_flag, ext4_clear_inode_flag
    ext4_test_inode_state, ext4_set_inode_state, ext4_clear_inode_state

Add explicit declarations for these functions so that grep and tags
can find them.

Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-11-11 22:40:40 -05:00
Andreas Dilger 9206c56155 ext4: return non-zero st_blocks for inline data
Return a non-zero st_blocks to userspace for statfs() and friends.
Some versions of tar will assume that files with st_blocks == 0
do not contain any data and will skip reading them entirely.

Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-11-11 22:38:12 -05:00
Miao Xie 91aef86f3b Btrfs: rename btrfs_start_all_delalloc_inodes
rename the function -- btrfs_start_all_delalloc_inodes(), and make its
name be compatible to btrfs_wait_ordered_roots(), since they are always
used at the same place.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:13:58 -05:00
Miao Xie b02441999e Btrfs: don't wait for the completion of all the ordered extents
It is very likely that there are lots of ordered extents in the filesytem,
if we wait for the completion of all of them when we want to reclaim some
space for the metadata space reservation, we would be blocked for a long
time. The performance would drop down suddenly for a long time.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:13:44 -05:00
Miao Xie 9f3a074d10 Btrfs: don't wait for all the async delalloc when shrinking delalloc
It was very likely that there were lots of async delalloc pages in the
filesystem, if we waited until all the pages were flushed, we would be
blocked for a long time, and the performance would also drop down.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:13:37 -05:00
Miao Xie c61a16a701 Btrfs: fix the confusion between delalloc bytes and metadata bytes
In shrink_delalloc(), what we need reclaim is the metadata space, so
flushing pages by to_reclaim is not reasonable, it is very likely that
the pages we flush are not enough. And then we had to invoke the flush
function for several times, at the worst, we need call flush_space for
several times. It wasted time.

We improve this problem by converting the metadata space size we need
reserve to the delalloc bytes, By this way, we can flush the pages
by a reasonable number.

(Now we use a fixed number to do conversion, it is not flexible, maybe
 we can find a good way to improve it in the future.)

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:13:30 -05:00
Miao Xie 18cd8ea6df Btrfs: pick up the code for the item number calculation in flush_space()
This patch picked up the code that was used to calculate the number of
the items for which we need reserve space, and we will use it in the next
patch.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:13:23 -05:00
Miao Xie 38c135af8e Btrfs: wait for the ordered extent only when we want
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:13:15 -05:00
Miao Xie d3ee29e396 Btrfs: remove unnecessary initialization and memory barrior in shrink_delalloc()
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:13:07 -05:00
Wang Shilong 3b7a016f44 Btrfs: avoid unnecessary scrub workers allocation
We only allocate scrub workers if we pass all the necessary
checks, for example, there are no operation in progress.

Besides, move mutex lock protection outside of scrub_workers_get()
/scrub_workers_put().

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:12:58 -05:00
Josef Bacik 007d31f755 Btrfs: check file extent type before anything else
I hit this problem with my no holes patch and it made me realize what the
problem was for bz 60834.  If the first item in the leaf is an inline extent and
we try to read anything starting from disk_bytenr onward we will read off the
end of the leaf.  So we need to check to see what it's type is, and if it's not
REG we can just break out.  This should fix this problem.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:12:49 -05:00
Rashika f570e757b5 btrfs: Remove useless variable in write_ctree_super()
The function write_ctree_super() in disk-io.c uses variable ret to return
the result of function write_all_supers(). Since, this variable serves
no purpose, hence the patch removes it and returns the call of the
called function.

Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:12:40 -05:00
Dulshani Gunawardhana 678712545b btrfs: Fix checkpatch.pl warning of spacing issues
Fix spacing issues detected via checkpatch.pl in accordance with the
kernel style guidelines.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:12:31 -05:00
Dulshani Gunawardhana d9b0d9ba04 btrfs: Replace kmalloc with kmalloc_array
Replace kmalloc(size * nr, ) with kmalloc_array(nr, size), thus making
it easier to check is that the calculation doesn't wrap or return a smaller allocation

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:12:22 -05:00
Dulshani Gunawardhana e248e04e77 btrfs: Enclose macros with complex values within parenthesis
Enclose macros with complex values within parenthesis in accordance to
checkpatch.pl.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:12:06 -05:00
Dulshani Gunawardhana fae7f21cec btrfs: Use WARN_ON()'s return value in place of WARN_ON(1)
Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source
code that outputs a more descriptive warnings. Also fix the styling
warning of redundant braces that came up as a result of this fix.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:53 -05:00
Dulshani Gunawardhana b19e684393 btrfs: Remove redundant local zero structure
Remove redundant local zero structure, replacing it by the kernel's
global ZERO_PAGE.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:39 -05:00
Dulshani Gunawardhana 3c45bfc152 btrfs: Pack struct btrfs_device
Pack the structure btrfs_device in volumes.h to eliminate holes detected
by pahole, thus reducing binary memory footprint.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:26 -05:00
Rashika 95e94d14b4 btrfs: Replace multiple atomic_inc() with atomic_add()
This patch replaces multiple atomic_inc() with atomic_add() in
delayed-inode.c to reduce source code and have few instructions
for compilation.

Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:19 -05:00
Rashika 2e9f595497 btrfs: Add helper function for free_root_pointers()
The function free_root_pointers() in disk-io.h contains redundant code.
Therefore, this patch adds a helper function free_root_extent_buffers()
to free_root_pointers() to eliminate redundancy.

Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:13 -05:00
Liu Bo 48ec47364b Btrfs: fix a crash when running balance and defrag concurrently
Running balance and defrag concurrently can end up with a crash:

kernel BUG at fs/btrfs/relocation.c:4528!
RIP: 0010:[<ffffffffa01ac33b>]  [<ffffffffa01ac33b>] btrfs_reloc_cow_block+ 0x1eb/0x230 [btrfs]
Call Trace:
  [<ffffffffa01398c1>] ? update_ref_for_cow+0x241/0x380 [btrfs]
  [<ffffffffa0180bad>] ? copy_extent_buffer+0xad/0x110 [btrfs]
  [<ffffffffa0139da1>] __btrfs_cow_block+0x3a1/0x520 [btrfs]
  [<ffffffffa013a0b6>] btrfs_cow_block+0x116/0x1b0 [btrfs]
  [<ffffffffa013ddad>] btrfs_search_slot+0x43d/0x970 [btrfs]
  [<ffffffffa0153c57>] btrfs_lookup_file_extent+0x37/0x40 [btrfs]
  [<ffffffffa0172a5e>] __btrfs_drop_extents+0x11e/0xae0 [btrfs]
  [<ffffffffa013b3fd>] ? generic_bin_search.constprop.39+0x8d/0x1a0 [btrfs]
  [<ffffffff8117d14a>] ? kmem_cache_alloc+0x1da/0x200
  [<ffffffffa0138e7a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
  [<ffffffffa0173ef0>] btrfs_drop_extents+0x60/0x90 [btrfs]
  [<ffffffffa016b24d>] relink_extent_backref+0x2ed/0x780 [btrfs]
  [<ffffffffa0162fe0>] ? btrfs_submit_bio_hook+0x1e0/0x1e0 [btrfs]
  [<ffffffffa01b8ed7>] ? iterate_inodes_from_logical+0x87/0xa0 [btrfs]
  [<ffffffffa016b909>] btrfs_finish_ordered_io+0x229/0xac0 [btrfs]
  [<ffffffffa016c3b5>] finish_ordered_fn+0x15/0x20 [btrfs]
  [<ffffffffa018cbe5>] worker_loop+0x125/0x4e0 [btrfs]
  [<ffffffffa018cac0>] ? btrfs_queue_worker+0x300/0x300 [btrfs]
  [<ffffffff81075ea0>] kthread+0xc0/0xd0
  [<ffffffff81075de0>] ? insert_kthread_work+0x40/0x40
  [<ffffffff8164796c>] ret_from_fork+0x7c/0xb0
  [<ffffffff81075de0>] ? insert_kthread_work+0x40/0x40
----------------------------------------------------------------------

It turns out to be that balance operation will bump root's @last_snapshot,
which enables snapshot-aware defrag path, and backref walking stuff will
find data reloc tree as refs' parent, and hit the BUG_ON() during COW.

As data reloc tree's data is just for relocation purpose, and will be deleted right
after relocation is done, it's unnecessary to walk those refs belonged to data reloc
tree, it'd be better to skip them.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:07 -05:00
Liu Bo 6f519564d7 Btrfs: do not run snapshot-aware defragment on error
If something wrong happens in write endio, running snapshot-aware defragment
can end up with undefined results, maybe a crash, so we should avoid it.

In order to share similar code, this also adds a helper to free the struct for
snapshot-aware defrag.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:00 -05:00
Filipe David Borba Manana 269d040ff2 Btrfs: log recovery, don't unlink inode always on error
If we get any error while doing a dir index/item lookup in the
log tree, we were always unlinking the corresponding inode in
the subvolume. It makes sense to unlink only if the lookup failed
to find the dir index/item, which corresponds to NULL or -ENOENT,
and not when other errors happen (like a transient -ENOMEM or -EIO).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:10:48 -05:00
Filipe David Borba Manana 488111aa0e Btrfs: fix csum search offset/length calculation in log tree
We were setting the csums search offset and length to the right values if
the extent is compressed, but later on right before doing the csums lookup
we were overriding these two parameters regardless of compression being
set or not for the extent.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:10:42 -05:00
Filipe David Borba Manana e46f5388cd Btrfs: fix verification of dir_item
We were ignoring the name component of the dir_item. Both the
name and data must fit within BTRFS_MAX_XATTR_SIZE(root).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:10:36 -05:00
Wang Shilong 9b011adfe1 Btrfs: remove scrub_super_lock holding in btrfs_sync_log()
Originally, we introduced scrub_super_lock to synchronize
tree log code with scrubbing super.

However we can replace scrub_super_lock with device_list_mutex,
because writing super will hold this mutex, this will reduce an extra
lock holding when writing supers in sync log code.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:10:13 -05:00
Wang Shilong 7fdf4b608d Btrfs: use 'u64' rather than 'int' to get extent's generation
We define a 'int' to get extent's generation by mistake,fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:10:06 -05:00
Miao Xie 9dced186f9 Btrfs: fix the free space write out failure when there is no data space
After running space balance on a new fs, the fs check program outputed the
following warning message:
 free space inode generation (0) did not match free space cache generation (20)

Steps to reproduce:
 # mkfs.btrfs -f <dev>
 # mount <dev> <mnt>
 # btrfs balance start <mnt>
 # umount <mnt>
 # btrfs check <dev>

It was because there was no data space after the space balance, and the free
space write out task didn't try to allocate a new data chunk for the free space
inode when doing the reservation. So the data space reservation failed, and in
order to tell the free space loader that this free space inode could not be
trusted, the generation of the free space inode wasn't updated. Then the check
program found this problem and outputed the above message.

But in fact, it is safe that we try to allocate a new data chunk when we find
the data space is not enough. The patch fixes the above problem by this way.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:08:49 -05:00
Josef Bacik 9e6a0c52b7 Btrfs: stop committing the transaction so much during relocate
I noticed with my horrible snapshot excercisor that we were taking forever to
relocate the larger the file system got.  This appeared to be because we were
committing the transaction _constantly_.  There were a few places where we do
braindead things with metadata reservation, like start a transaction and then
try to refill the block rsv, which not only keeps us from committing a
transaction during the enospc stuff, but keeps us from doing some of the harder
flushing work which will make us more likely to need to commit the transaction.
We also were checking the block rsv and committing the transaction if the block
rsv was below a certain threshold, but we were doing this in a place where we
don't actually keep anything in the block rsv so this was always ending up false
so we always committed the transaction in this case.  I tested this to make sure
it didn't break anything, but it takes about 10 hours to get the box to this
state so I don't know how much of an impact it will really make.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:08:31 -05:00
Josef Bacik 9f23e289ed Btrfs: make sure the delalloc workers actually flush compressed writes
When using delalloc workers in a non-waiting way (like for enospc handling) we
can end up not actually waiting for the dirty pages to be started if we have
compression.  We need to add an extra filemap flush to make sure any async
extents that have started are actually moved along before returning.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:08:22 -05:00
Josef Bacik 9385876917 Btrfs: take ordered root lock when removing ordered operations inode
A user reported a list corruption warning from btrfs_remove_ordered_extent, it
is because we aren't taking the ordered_root_lock when we remove the inode from
the ordered operations list.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:08:10 -05:00
Josef Bacik d788a34929 Btrfs: don't abort transaction in run_delalloc_nocow
This is just the write path, the only reason we start a transaction is so we can
check cross references, we don't make any actual changes, so there is no reason
to abort the transaction if we fail.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:07:58 -05:00
Josef Bacik 02ecd2c278 Btrfs: do not bug_on if we try to cow a free space cache inode
We can just return an error and we'll bail out properly.  We still want to catch
this case to make sure we don't have a bug somewhere, so just warn if this pops
up.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:07:49 -05:00
Josef Bacik 0ef8b72607 Btrfs: return an error from btrfs_wait_ordered_range
I noticed that if the free space cache has an error writing out it's data it
won't actually error out, it will just carry on.  This is because it doesn't
check the return value of btrfs_wait_ordered_range, which didn't actually return
anything.  So fix this in order to keep us from making free space cache look
valid when it really isnt.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:07:35 -05:00
Josef Bacik ed2590953b Btrfs: stop using vfs_read in send
Apparently we don't actually close the files until we return to userspace, so
stop using vfs_read in send.  This is actually better for us since we can avoid
all the extra logic of holding the file we're sending open and making sure to
clean it up.  This will fix people who have been hitting too many files open
errors when trying to send.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:07:11 -05:00
Stefan Behrens 301993a4a1 Btrfs: check_int, remove warning for mixed-mode
In mixed-mode, when a data-block was later reused for metadata, a
warning was printed. This condition is now filtered out and the
warning is eliminated in this case.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:04:25 -05:00
Stefan Behrens a5f519c91d Btrfs: fix check_int 'leaf item out of bounce' regression
Yet another cleanup patch broke code for which no xfstest exists.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:04:06 -05:00
Filipe David Borba Manana 5599488708 Btrfs: optimize extent item search in run_delayed_extent_op
Instead of doing another extent tree search if the first search failed
to find a metadata item, check if the previous item in the leaf is an
extent item and use it if it is, otherwise do the second tree search
for an extent item.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:03:53 -05:00
Jeff Mahoney cab45e22da btrfs: add tracing for failed reservations
When debugging ENOSPC issues, it's nice to be able to see which
reservations failed as well as the ones which succeeded.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:03:37 -05:00
Zach Brown 8b558c5f09 btrfs: remove fs/btrfs/compat.h
fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink()
and inc_nlink().  This doesn't belong in mainline.

Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:03:19 -05:00
Zach Brown 1877e1a747 btrfs: remove move_pages()
move_pages() has an inefficient backwards byte copy of regions of two
different pages.  They're different pages so the regions won't overlap
and it could use memcpy().

At that point, though, move_pages() would be a slightly dimmer
re-implementation of copy_pages() that lacked the test for overlapping
page regions.

So remove move_pages() and just call copy_pages().

Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:03:09 -05:00
Zach Brown 4546bcaeba btrfs: use get_seconds() instead of btrfs wrapper
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:03:00 -05:00
Filipe David Borba Manana 8185554d3e Btrfs: fix incorrect inode acl reset
When a directory has a default ACL and a subdirectory is created
under that directory, btrfs_init_acl() is called when the
subdirectory's inode is created to initialize the inode's ACL
(inherited from the parent directory) but it was clearing the ACL
from the inode after setting it if posix_acl_create() returned
success, instead of clearing it only if it returned an error.

To reproduce this issue:

$ mkfs.btrfs -f /dev/loop0
$ mount /dev/loop0 /mnt
$ mkdir /mnt/acl
$ setfacl -d --set u::rwx,g::rwx,o::- /mnt/acl
$ getfacl /mnt/acl
user::rwx
group::rwx
other::r-x
default:user::rwx
default:group::rwx
default:other::---

$ mkdir /mnt/acl/dir1
$ getfacl /mnt/acl/dir1
user::rwx
group::rwx
other::---

After unmounting and mounting again the filesystem, fgetacl returned the
expected ACL:

$ umount /mnt/acl
$ mount /dev/loop0 /mnt
$ getfacl /mnt/acl/dir1
user::rwx
group::rwx
other::---
default:user::rwx
default:group::rwx
default:other::---

Meaning that the underlying xattr was persisted.

Reported-by: Giuseppe Fierro <giuseppe@fierro.org>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:02:51 -05:00
Stefan Behrens ff76b05655 Btrfs: Don't allocate inode that is already in use
Due to an off-by-one error, it is possible to reproduce a bug
when the inode cache is used.

The same inode number is assigned twice, the second time this
leads to an EEXIST in btrfs_insert_empty_items().

The issue can happen when a file is removed right after a subvolume
is created and then a new inode number is created before the
inodes in free_inode_pinned are processed.
unlink() calls btrfs_return_ino() which calls start_caching() in this
case which adds [highest_ino + 1, BTRFS_LAST_FREE_OBJECTID] by
searching for the highest inode (which already cannot find the
unlinked one anymore in btrfs_find_free_objectid()). So if this
unlinked inode's number is equal to the highest_ino + 1 (or >= this value
instead of > this value which was the off-by-one error), we mustn't add
the inode number to free_ino_pinned (caching_thread() does it right).
In this case we need to try directly to add the number to the inode_cache
which will fail in this case.

When this inode number is allocated while it is still in free_ino_pinned,
it is allocated and still added to the free inode cache when the
pinned inodes are processed, thus one of the following inode number
allocations will get an inode that is already in use and fail with EEXIST
in btrfs_insert_empty_items().

One example which was created with the reproducer below:
Create a snapshot, work in the newly created snapshot for the rest.
In unlink(inode 34284) call btrfs_return_ino() which calls start_caching().
start_caching() calls add_free_space [34284, 18446744073709517077].
In btrfs_return_ino(), call start_caching pinned [34284, 1] which is wrong.
mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
btrfs_unpin_free_ino calls add_free_space [34284, 1].
mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
EEXIST when the new inode is inserted.

One possible reproducer is this one:
 #!/bin/sh
 # preparation
TEST_DEV=/dev/sdc1
TEST_MNT=/mnt
umount ${TEST_MNT} 2>/dev/null || true
mkfs.btrfs -f ${TEST_DEV}
mount ${TEST_DEV} ${TEST_MNT} -o \
 rw,relatime,compress=lzo,space_cache,inode_cache
btrfs subv create ${TEST_MNT}/s1
for i in `seq 34027`; do touch ${TEST_MNT}/s1/${i}; done
btrfs subv snap ${TEST_MNT}/s1 ${TEST_MNT}/s2
FILENAME=`find ${TEST_MNT}/s1/ -inum 4085 | sed 's|^.*/\([^/]*\)$|\1|'`
rm ${TEST_MNT}/s2/$FILENAME
touch ${TEST_MNT}/s2/$FILENAME
 # the following steps can be repeated to reproduce the issue again and again
[ -e ${TEST_MNT}/s3 ] && btrfs subv del ${TEST_MNT}/s3
btrfs subv snap ${TEST_MNT}/s2 ${TEST_MNT}/s3
rm ${TEST_MNT}/s3/$FILENAME
touch ${TEST_MNT}/s3/$FILENAME
ls -alFi ${TEST_MNT}/s?/$FILENAME
touch ${TEST_MNT}/s3/_1 || logger FAILED
ls -alFi ${TEST_MNT}/s?/_1
touch ${TEST_MNT}/s3/_2 || logger FAILED
ls -alFi ${TEST_MNT}/s?/_2
touch ${TEST_MNT}/s3/__1 || logger FAILED
ls -alFi ${TEST_MNT}/s?/__1
touch ${TEST_MNT}/s3/__2 || logger FAILED
ls -alFi ${TEST_MNT}/s?/__2
 # if the above is not enough, add the following loop:
for i in `seq 3 9`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
 #for i in `seq 3 34027`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
 # one of the touch(1) calls in s3 fail due to EEXIST because the inode is
 # already in use that btrfs_find_ino_for_alloc() returns.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:02:36 -05:00
Filipe David Borba Manana e8b0d724d5 Btrfs: fix btrfs_prev_leaf() previous key computation
If we decrement the key type, we must reset its offset to the largest
possible offset (u64)-1. If we decrement the key's objectid, then we
must reset the key's type and offset to their largest possible values,
(u8)-1 and (u64)-1 respectively. Not doing so can make us miss an
items in the tree.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:02:26 -05:00
Filipe David Borba Manana e93ae26fe1 Btrfs: optimize tree-log.c:count_inode_refs()
Avoid repeated tree searches by processing all inode ref items in
a leaf at once instead of processing one at a time, followed by a
path release and a tree search for a key with a decremented offset.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:02:19 -05:00
Geyslan G. Bem 229eed4348 btrfs: simplify kmalloc+copy_from_user to memdup_user
Use memdup_user rather than duplicating its implementation
This is a little bit restricted to reduce false positives

The semantic patch that makes this report is available
in scripts/coccinelle/api/memdup_user.cocci.

More information about semantic patching is available at
http://coccinelle.lip6.fr/

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:01:51 -05:00
chandan 5ede859b00 Btrfs: btrfs_add_ordered_operation: Fix last modified transaction comparison.
Comparison of an inode's last modified transaction with the last committed
transaction is incorrect. Fix it.

Signed-off-by: chandan <chandan@linux.vnet.ibm.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:01:37 -05:00
Filipe David Borba Manana 3c77bd94ec Btrfs: don't leak delayed node on path allocation failure
If the path allocation failed, we would return without decrementing
the reference count in the delayed node we got before, resulting
in a leak.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:01:27 -05:00
Stefan Behrens 361c093d7f Btrfs: Wait for uuid-tree rebuild task on remount read-only
If the user remounts the filesystem read-only while the uuid-tree
scan and rebuild task is still running (this happens once after the
filesystem was mounted with an old kernel, or when forced with the
mount options), the remount should wait on the tasks completion
before setting the filesystem read-only. Otherwise the background
task continues to write to the filesystem which is apparently not
what users expect.

The reproducer:

TEST_DEV=/dev/sdzzzzz1
TEST_MNT=/mnt
mkfs.btrfs -f $TEST_DEV
mount $TEST_DEV $TEST_MNT
for i in `seq 50000`; do btrfs subvolume create ${TEST_MNT}/$i; done
umount $TEST_MNT
mount $TEST_DEV $TEST_MNT -o rescan_uuid_tree
sleep 1
ps -elf | fgrep '[btrfs-uuid]' | grep -v grep
mount $TEST_DEV $TEST_MNT -o ro,remount
ps -elf | fgrep '[btrfs-uuid]' | grep -v grep
sleep 1
umount $TEST_MNT

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:01:18 -05:00
Stefan Behrens 27087f3701 Btrfs: init device stats for new devices
Device stats are only initialized (read from tree items) on mount.
Trying to read device stats after adding or replacing new devices will
return errors.

btrfs_init_new_device() and btrfs_init_dev_replace_tgtdev() are the two
functions that allocate and initialize new btrfs_device structures after
a filesystem is mounted. They set the device stats to zero by using
kzalloc() which is correct for new devices. The only missing thing was
to declare these stats as being valid (device->dev_stats_valid = 1) and
this patch adds this missing code.

This is the reproducer:

TEST_DEV1=/dev/sdzzzzz1
TEST_DEV2=/dev/sdzzzzz2
TEST_DEV3=/dev/sdzzzzz3
TEST_MNT=/mnt
mkfs.btrfs $TEST_DEV1
mount $TEST_DEV1 $TEST_MNT
btrfs device add $TEST_DEV2 $TEST_MNT
btrfs device stat $TEST_MNT
btrfs replace start -B $TEST_DEV2 $TEST_DEV3 $TEST_MNT
btrfs device stat $TEST_MNT
umount $TEST_MNT

Reported-by: Ondrej Kunc <kunc88@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:01:09 -05:00
Liu Bo 30d133fc22 Btrfs: fixup error path in __btrfs_inc_extent_ref
When we fail to add a reference after a non-inline insertion by some reasons,
eg. ENOSPC, we'll abort the transaction, but we don't return this error to
the caller who has to walk around again to find something wrong, that's
unnecessary.

Also fixup other error paths to keep it simple.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:01:00 -05:00
Ilya Dryomov e649e587cb Btrfs: disallow 'btrfs {balance,replace} cancel' on ro mounts
For both balance and replace, cancelling involves changing the on-disk
state and committing a transaction, which is not a good thing to do on
read-only filesystems.

Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:00:50 -05:00
Ilya Dryomov adfa97cbdf Btrfs: don't leak ioctl args in btrfs_ioctl_dev_replace
struct btrfs_ioctl_dev_replace_args memory is leaked if replace is
requested on a read-only filesystem.  Fix it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:00:37 -05:00
Ilya Dryomov f747cab7b7 Btrfs: nuke a bogus rw_devices decrement in __btrfs_close_devices
On mount failures, __btrfs_close_devices can be called well before
dev-replace state is read and ->is_tgtdev_for_dev_replace is set.  This
leads to a bogus decrement of ->rw_devices and sets off a WARN_ON in
__btrfs_close_devices if replace target device happens to be on the
lists and we fail early in the mount sequence.  Fix this by checking
the devid instead of ->is_tgtdev_for_dev_replace before the decrement:
for replace targets devid is always equal to BTRFS_DEV_REPLACE_DEVID.

Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:00:24 -05:00
Geyslan G. Bem 03b2f08b5f btrfs: Fix memory leakage in the tree-log.c
In add_inode_ref() function:

Initializes local pointers.

Reduces the logical condition with the __add_inode_ref() return
value by using only one 'goto out'.

Centralizes the exiting, ensuring the freeing of all used memory.

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:00:11 -05:00
Liu Bo 498456d33e Btrfs: kill unused code in btrfs_search_forward
After commit de78b51a28
(btrfs: remove cache only arguments from defrag path), @blockptr is no more
used.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:59:56 -05:00
Liu Bo 8319bfe136 Btrfs: cleanup dead code of defragment
@is_extent is no more needed since we don't defrag extent root.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:59:45 -05:00
Filipe David Borba Manana efd0c4055a Btrfs: remove unnecessary key copy when logging inode
The btrfs_insert_empty_item() function doesn't modify its
key argument.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:59:30 -05:00
Chandra Seetharaman 452c75c3d2 Btrfs: Simplify the logic in alloc_extent_buffer() for existing extent buffer case
alloc_extent_buffer() uses radix_tree_lookup() when radix_tree_insert()
fails with EEXIST. That part of the code is very similar to the code in
find_extent_buffer(). This patch replaces radix_tree_lookup() and
surrounding code in alloc_extent_buffer() with find_extent_buffer().

Note that radix_tree_lookup() does not need to be protected by
tree->buffer_lock. It is protected by eb->refs.

While at it, this patch
  - changes the other usage of radix_tree_lookup() in alloc_extent_buffer()
    with find_extent_buffer() to reduce redundancy.
  - removes the unused argument 'len' to find_extent_buffer().

Signed-Off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Zach Brown <zab@redhat.com>

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:59:11 -05:00
Josef Bacik 7f4ca37c48 Btrfs: fix up seek_hole/seek_data handling
Whoever wrote this was braindead.  Also it doesn't work right if you have
VACANCY's since we assumed you would only have that at the end of the file,
which won't be the case in the near future.  I tested this with generic/285 and
generic/286 as well as the btrfs tests that use fssum since it uses
seek_hole/seek_data to verify things are ok.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:58:56 -05:00
Josef Bacik 4277a9c3b3 Btrfs: add an assert to btrfs_lookup_csums_range for alignment
I was hitting weird issues when trying to remove hole extents and it turned out
it was because I was sending non-aligned offsets down to
btrfs_lookup_csums_range.  So add an assert for this in case somebody trips over
this in the future.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:58:45 -05:00
Josef Bacik ed9e8af88e Btrfs: fix hole check in log_one_extent
I added an assert to make sure we were looking up aligned offsets for csums and
I tripped it when running xfstests.  This is because log_one_extent was checking
if block_start == 0 for a hole instead of EXTENT_MAP_HOLE.  This worked out fine
in practice it seems, but it adds a lot of extra work that is uneeded.  With
this fix I'm no longer tripping my assert.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:58:32 -05:00
Josef Bacik 0e30db86a4 Btrfs: add a sanity test for a vacant extent at the front of a file
Btrfs_get_extent was not handling this case properly, add a test to make sure we
don't regress.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:58:19 -05:00
Josef Bacik 25a50341b6 Btrfs: handle a missing extent for the first file extent
While trying to kill our hole extents I noticed I was seeing problems where we
seek into a file and then start writing and then try to fiemap that file later.
This is because we search for offset 0, don't find anything and so back up one
slot, which puts us at the inode ref or something like that, which means we goto
not_found and create an extent map for our entire search area.  This isn't quite
what we want, we want to move forward one slot and see if there is an extent
there so we can limit our hole extent.  This patch fixes this problem, I will
add a testcase for this as well.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:58:05 -05:00
Josef Bacik 96192499c2 Btrfs: stop all workers after we free block groups
Stefan was hitting a panic in the async worker stuff because we had outstanding
read bios while we were stopping the worker threads.  You could reproduce this
easily if you mount -o nospace_cache and ran generic/273.  This is because the
caching thread stuff is still going and we were stopping all the worker threads.
We need to stop the workers after this work is done, and the free block groups
code will wait for all the caching threads to stop first so we don't run into
this problem.  With this patch we no longer panic.  Thanks,

Cc: stable@vger.kernel.org
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:57:49 -05:00
Josef Bacik aaedb55bc0 Btrfs: add tests for btrfs_get_extent
I'm going to be removing hole extents in the near future so I wanted to make a
sanity test for btrfs_get_extent to make sure I don't break anything in the
meantime.  This patch just puts btrfs_get_extent through its paces by giving it
a completely unreasonable mapping to look at and make sure it is giving us back
maps that make sense.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:57:30 -05:00
Josef Bacik 294e30fee3 Btrfs: add tests for find_lock_delalloc_range
So both Liu and I made huge messes of find_lock_delalloc_range trying to fix
stuff, me first by fixing extent size, then him by fixing something I broke and
then me again telling him to fix it a different way.  So this is obviously a
candidate for some testing.  This patch adds a pseudo fs so we can allocate fake
inodes for tests that need an inode or pages.  Then it addes a bunch of tests to
make sure find_lock_delalloc_range is acting the way it is supposed to.  With
this patch and all of our previous patches to find_lock_delalloc_range I am sure
it is working as expected now.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:56:51 -05:00
Josef Bacik 857cc2fc29 Btrfs: free reserved space on error in a few places
While trying to track down a reserved space leak I noticed a few places where we
won't properly clean up reserved space if we have an error, this patch fixes
those up.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:56:41 -05:00
Josef Bacik 0be5dc67c4 Btrfs: fixup reserved trace points
In trying to track down where we were leaking reserved space I noticed our
reserve extent tracepoints are a little off.  First we were saying that the
reserved space had been alloced in btrfs_reserve_extent, which isn't the case,
this needs to be triggered when we actually allocate the space when we run the
delayed ref.  We were also missing a few places where we should have been
tracing the btrfs_reserve_extent_free tracepoint.  With these in place I was
able to put together where we were leaking reserved space.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:56:31 -05:00
Josef Bacik 2b1360da35 Btrfs: free up block groups after everything
If we abort a transaction we will do the tree log cleanup at unmount, but this
happens after we free up the block groups.  This makes all the leak detection
warnings go off because we think we've leaked space but in reality we just
haven't cleaned it up yet.  So instead do the block group cleanup stuff after
free'ing the fs roots so we don't get these warnings.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:56:17 -05:00
Josef Bacik 681ae50917 Btrfs: cleanup reserved space when freeing tree log on error
On error we will wait and free the tree log at unmount without a transaction.
This means that the actual freeing of the blocks doesn't happen which means we
complain about space leaks on unmount.  So to fix this just skip the transaction
specific cleanup part of the tree log free'ing if we don't have a transaction
and that way we can free up our reserved space and our counters stay happy.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:56:11 -05:00
Josef Bacik eb58bb371a Btrfs: do not free the dirty bytes from the trans block rsv on cleanup
The transactions should be cleaning up their reservations on failure, this just
causes us to have warnings on unmount because we go negative by free'ing
reservations that have already been free'ed.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:55:58 -05:00
Filipe David Borba Manana 80d94fb3df Btrfs: fix memory leaks on transaction commit failure
Structures of the types tree_mod_elem and qgroup_update are allocated
during transaction commit but were not being released if the call to
btrfs_run_delayed_items() returned an error.

Stack trace reported by kmemleak:

unreferenced object 0xffff880679f0b398 (size 128):
  comm "umount", pid 21508, jiffies 4295967793 (age 36718.112s)
  hex dump (first 32 bytes):
    60 b5 f0 79 06 88 ff ff 00 00 00 00 00 00 00 00  `..y............
    00 00 00 00 00 00 00 00 50 1c 00 00 00 00 00 00  ........P.......
  backtrace:
    [<ffffffff81742d26>] kmemleak_alloc+0x26/0x50
    [<ffffffff811889c2>] kmem_cache_alloc_trace+0x112/0x200
    [<ffffffffa046f2d3>] tree_mod_log_insert_key.constprop.45+0x93/0x150 [btrfs]
    [<ffffffffa04720f9>] __btrfs_cow_block+0x299/0x4f0 [btrfs]
    [<ffffffffa0472510>] btrfs_cow_block+0x120/0x1f0 [btrfs]
    [<ffffffffa0476679>] btrfs_search_slot+0x449/0x930 [btrfs]
    [<ffffffffa048eecf>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
    [<ffffffffa04eb49c>] __btrfs_update_delayed_inode+0x1c/0x1d0 [btrfs]
    [<ffffffffa04eb9e2>] __btrfs_run_delayed_items+0x162/0x1e0 [btrfs]
    [<ffffffffa04eba63>] btrfs_delayed_inode_exit+0x3/0x20 [btrfs]
    [<ffffffffa0499c63>] btrfs_commit_transaction+0x203/0xa50 [btrfs]
    [<ffffffffa046b519>] btrfs_sync_fs+0x69/0x110 [btrfs]
    [<ffffffff811cb210>] __sync_filesystem+0x30/0x60
    [<ffffffff811cb2bb>] sync_filesystem+0x4b/0x70
    [<ffffffff8119ce7b>] generic_shutdown_super+0x3b/0xf0
    [<ffffffff8119cfc6>] kill_anon_super+0x16/0x30
unreferenced object 0xffff880677e0dd88 (size 32):
  comm "umount", pid 21508, jiffies 4295967793 (age 36718.112s)
  hex dump (first 32 bytes):
    78 75 11 a9 06 88 ff ff 00 c0 e0 77 06 88 ff ff  xu.........w....
    40 c3 a2 70 06 88 ff ff 00 00 00 00 00 00 00 00  @..p............
  backtrace:
    [<ffffffff81742d26>] kmemleak_alloc+0x26/0x50
    [<ffffffff811889c2>] kmem_cache_alloc_trace+0x112/0x200
    [<ffffffffa04fa54f>] btrfs_qgroup_record_ref+0xf/0x90 [btrfs]
    [<ffffffffa04e1914>] btrfs_add_delayed_tree_ref+0xf4/0x170 [btrfs]
    [<ffffffffa048518a>] btrfs_free_tree_block+0x9a/0x220 [btrfs]
    [<ffffffffa0472163>] __btrfs_cow_block+0x303/0x4f0 [btrfs]
    [<ffffffffa0472510>] btrfs_cow_block+0x120/0x1f0 [btrfs]
    [<ffffffffa0476679>] btrfs_search_slot+0x449/0x930 [btrfs]
    [<ffffffffa048eecf>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
    [<ffffffffa04eb49c>] __btrfs_update_delayed_inode+0x1c/0x1d0 [btrfs]
    [<ffffffffa04eb9e2>] __btrfs_run_delayed_items+0x162/0x1e0 [btrfs]
    [<ffffffffa04eba63>] btrfs_delayed_inode_exit+0x3/0x20 [btrfs]
    [<ffffffffa0499c63>] btrfs_commit_transaction+0x203/0xa50 [btrfs]
    [<ffffffffa046b519>] btrfs_sync_fs+0x69/0x110 [btrfs]
    [<ffffffff811cb210>] __sync_filesystem+0x30/0x60
    [<ffffffff811cb2bb>] sync_filesystem+0x4b/0x70

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:55:46 -05:00
Ilya Dryomov 539f358a30 Btrfs: fix the dev-replace suspend sequence
Replace progresses strictly from lower to higher offsets, and the
progress is tracked in chunks, by storing the physical offset of the
dev_extent which is being copied in the cursor_left field of
btrfs_dev_replace_item.  When we are done copying the chunk,
left_cursor is updated to point one byte past the dev_extent, so that
on resume we can skip the dev_extents that have already been copied.

There is a major bug (which goes all the way back to the inception of
dev-replace in 3.8) in the way left_cursor is bumped: the bump is done
unconditionally, without any regard to the scrub_chunk return value.
On suspend (and also on any kind of error) scrub_chunk returns early,
i.e. without completing the copy.  This leads to us skipping the chunk
that hasn't been fully copied yet when resuming.

Fix this by doing the cursor_left update only if scrub_chunk ret is 0.
(On suspend scrub_chunk returns with -ECANCELED, so this fix covers
both suspend and error cases.)

Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:55:36 -05:00
Filipe David Borba Manana 778ba82b17 Btrfs: improve inode hash function/inode lookup
Currently the hash value used for adding an inode to the VFS's inode
hash table consists of the plain inode number, which is a 64 bits
integer. This results in hash table buckets (hlist_head lists) with
too many elements for at least 2 important scenarios:

1) When we have many subvolumes. Each subvolume has its own btree
   where its files and directories are added to, and each has its
   own objectid (inode number) namespace. This means that if we have
   N subvolumes, and all have inode number X associated to a file or
   directory, the corresponding inodes all map to the same hash table
   entry, resulting in a bucket (hlist_head list) with N elements;

2) On 32 bits machines. Th VFS hash values are unsigned longs, which
   are 32 bits wide on 32 bits machines, and the inode (objectid)
   numbers are 64 bits unsigned integers. We simply cast the inode
   numbers to hash values, which means that for all inodes with the
   same 32 bits lower half, the same hash bucket is used for all of
   them. For example, all inodes with a number (objectid) between
   0x0000_0000_ffff_ffff and 0xffff_ffff_ffff_ffff will end up in
   the same hash table bucket.

This change ensures the inode's hash value depends both on the
objectid (inode number) and its subvolume's (btree root) objectid.
For 32 bits machines, this change gives better entropy by making
the hash value depend on both the upper and lower 32 bits of the
64 bits hash previously computed.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:55:19 -05:00
Filipe David Borba Manana 3d41d70252 Btrfs: remove unnecessary tree search when logging inode
In tree-log.c:btrfs_log_inode(), we keep calling btrfs_search_forward()
until it returns a key whose objectid is higher than our inode or until
the key's type is higher than our maximum allowed type.

At the end of the loop, we increment our mininum search key's objectid
and type regardless of our desired target objectid and maximum desired
type, which causes another loop iteration that will call again
btrfs_search_forward() just to figure out we've gone beyond our maximum
key and exit the loop. Therefore while incrementing our minimum key,
don't do it blindly and exit the loop immiediately if the next search
key's objectid or type is beyond what we seek.

Also after incrementing the type, set the key's offset to 0, which was
missing and could make us loose some of the inode's items.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:55:11 -05:00
Filipe David Borba Manana 6174d3cb43 Btrfs: remove unused max_key arg from btrfs_search_forward
It is not used for anything.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:54:57 -05:00