Commit Graph

32630 Commits

Author SHA1 Message Date
Jeff Layton 38d77c50b4 cifs: track the enablement of signing in the TCP_Server_Info
Currently, we determine this according to flags in the sec_mode, flags
in the global_secflags and via other methods. That makes the semantics
very hard to follow and there are corner cases where we don't handle
this correctly.

Add a new bool to the TCP_Server_Info that acts as a simple flag to tell
us whether signing is enabled on this connection or not, and fix up the
places that need to determine this to use that flag.

This is a bit weird for the SMB2 case, where signing is per-session.
SMB2 needs work in this area already though. The existing SMB2 code has
similar logic to what we're using here, so there should be no real
change in behavior. These changes should make it easier to implement
per-session signing in the future though.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:43 -05:00
Jeff Layton 1e3cc57e47 add new fields to smb_vol to track the requested security flavor
We have this to some degree already in secFlgs, but those get "or'ed" so
there's no way to know what the last option requested was. Add new fields
that will eventually supercede the secFlgs field in the cifs_ses.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:43 -05:00
Jeff Layton 28e11bd86d cifs: add new fields to cifs_ses to track requested security flavor
Currently we have the overrideSecFlg field, but it's quite cumbersome
to work with. Add some new fields that will eventually supercede it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:43 -05:00
Jeff Layton e598d1d8fb cifs: track the flavor of the NEGOTIATE reponse
Track what sort of NEGOTIATE response we get from the server, as that
will govern what sort of authentication types this socket will support.

There are three possibilities:

LANMAN: server sent legacy LANMAN-type response

UNENCAP: server sent a newer-style response, but extended security bit
wasn't set. This socket will only support unencapsulated auth types.

EXTENDED: server sent a newer-style response with the extended security
bit set. This is necessary to support krb5 and ntlmssp auth types.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:42 -05:00
Jeff Layton 515d82ffd0 cifs: add new "Unspecified" securityEnum value
Add a new securityEnum value to cover the case where a sec= option
was not explicitly set.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:42 -05:00
Jeff Layton 9193400b69 cifs: factor out check for extended security bit into separate function
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:42 -05:00
Jeff Layton 9ddec56131 cifs: move handling of signed connections into separate function
Move the sanity checks for signed connections into a separate function.
SMB2's was a cut-and-paste job from CIFS code, so we can make them use
the same function.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-06-24 01:56:41 -05:00
Jeff Layton 2190eca1d0 cifs: break out lanman NEGOTIATE handling into separate function
...this also gets rid of some #ifdef ugliness too.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-06-24 01:56:41 -05:00
Jeff Layton 31d9e2bd5f cifs: break out decoding of security blob into separate function
...cleanup.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-06-24 01:56:41 -05:00
Jeff Layton 281e2e7d06 cifs: remove the cifs_ses->flags field
This field is completely unused:

CIFS_SES_W9X is completely unused. CIFS_SES_LANMAN and CIFS_SES_OS2
are set but never checked. CIFS_SES_NT4 is checked, but never set.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-06-24 01:56:40 -05:00
Jeff Layton 3534b8508e cifs: throw a warning if negotiate or sess_setup ops are passed NULL server or session pointers
These look pretty cargo-culty to me, but let's be certain. Leave
them in place for now. Pop a WARN if it ever does happen. Also,
move to a more standard idiom for setting the "server" pointer.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-06-24 01:56:40 -05:00
Jeff Layton 7d06645969 cifs: make decode_ascii_ssetup void return
...rc is always set to 0.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-06-24 01:56:39 -05:00
Jeff Layton ffa598a537 cifs: remove useless memset in LANMAN auth code
It turns out that CIFS_SESS_KEY_SIZE == CIFS_ENCPWD_SIZE, so this
memset doesn't do anything useful.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-06-24 01:56:39 -05:00
Jeff Layton 6f709494a7 cifs: remove protocolEnum definition
The field that held this was removed quite some time ago.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-06-24 01:56:39 -05:00
Jeff Layton a0b3df5cf1 cifs: add a "nosharesock" mount option to force new sockets to server to be created
Some servers set max_vcs to 1 and actually do enforce that limit. Add a
new mount option to work around this behavior that forces a mount
request to open a new socket to the server instead of reusing an
existing one.

I'd prefer to come up with a solution that doesn't require this, so
consider this a debug patch that you can use to determine whether this
is the real problem.

Cc: Jim McDonough <jmcd@samba.org>
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-06-24 01:56:38 -05:00
Randy Dunlap acdb37c361 fs: fix new splice.c kernel-doc warning
Fix new kernel-doc warning in fs/splice.c:

  Warning(fs/splice.c:1298): No description found for parameter 'opos'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-23 16:19:56 -10:00
Eric Sandeen 427d9fe233 xfs: check on-disk (not incore) btree root size in dfrag.c
xfs_swap_extents_check_format() contains checks to make sure that
original and the temporary files during defrag are compatible;
Gabriel VLASIU ran into a case where xfs_fsr returned EINVAL
because the tests found the btree root to be of size 120,
while the fork offset was only 104; IOW, they overlapped.

However, this is just due to an error in the
xfs_swap_extents_check_format() tests, because it is checking
the in-memory btree root size against the on-disk fork offset.
We should be checking the on-disk sizes in both cases.

This patch adds a new macro to calculate this size, and uses
it in the tests.

With this change, the filesystem image provided by Gabriel
allows for proper file degragmentation.

Reported-by: Gabriel VLASIU <gabriel@vlasiu.net>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-20 13:26:09 -05:00
Al Viro 7995bd2871 splice: don't pass the address of ->f_pos to methods
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-20 19:02:45 +04:00
Andy Adamson 62f288a02f NFSv4.1 end back channel session draining
We need to ensure that we clear NFS4_SLOT_TBL_DRAINING on the back
channel when we're done recovering the session.

Regression introduced by commit 774d5f14e (NFSv4.1 Fix a pNFS session
draining deadlock)

Signed-off-by: Andy Adamson <andros@netapp.com>
[Trond: Changed order to start back-channel first. Minor code cleanup]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>=3.10]
2013-06-20 10:19:21 -04:00
Aruna Balakrishnaiah a5e4797b0f powerpc/pseries: Read common partition via pstore
This patch exploits pstore subsystem to read details of common partition
in NVRAM to a separate file in /dev/pstore. For instance, common partition
details will be stored in a file named [common-nvram-6].

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-20 17:05:02 +10:00
Aruna Balakrishnaiah f33f748c96 powerpc/pseries: Read of-config partition via pstore
This patch set exploits the pstore subsystem to read details of
of-config partition in NVRAM to a separate file in /dev/pstore.
For instance, of-config partition details will be stored in a
file named [of-nvram-5].

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-20 17:04:59 +10:00
Aruna Balakrishnaiah 69020eea97 powerpc/pseries: Read rtas partition via pstore
This patch set exploits the pstore subsystem to read details of rtas partition
in NVRAM to a separate file in /dev/pstore. For instance, rtas details will be
stored in a file named [rtas-nvram-4].

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-20 17:04:52 +10:00
Abhijith Das 6a98c333ed GFS2: Fix fstrim boundary conditions
This patch correctly distinguishes two boundary conditions:

1. When the given range is entire within the unaccounted space between
   two rgrps, and
2. The range begins beyond the end of the filesystem

Also fix the unit of the returned value r.len (total trimming) to be in bytes 
instead of the (incorrect) 512 byte blocks

With this patch, GFS2 passes multiple iterations of all the relevant xfstests
(251, 260, 288) with different fs block sizes.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-19 21:41:26 +01:00
Benjamin Marzinski 2b12eea656 GFS2: fix warning message
This patch fixes a warning message introduced in the recent
"GFS2: aggressively issue revokes in gfs2_log_flush" patch.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-19 21:29:19 +01:00
Jie Liu 39a45d8463 xfs: Remove XFS_MOUNT_RETERR
XFS_MOUNT_RETERR is going to be set at xfs_parseargs() if
mp->m_dalign is enabled, so any time we enter "if (mp->m_dalign)"
branch in xfs_update_alignment(), XFS_MOUNT_RETERR is set and so
we always be emitting a warning and returning an error.

Hence, we can remove it and get rid of a couple of redundant
check up against it at xfs_upate_alignment().

Thanks Dave Chinner for the suggestions of simplify the code
in xfs_parseargs().

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-19 14:54:17 -05:00
Jie Liu 2fb8b5027d xfs: Remove two dead transaction log reservaion macros
Upstream commit 5b292ae3a9
	xfs: make use of xfs_calc_buf_res() in xfs_trans.c

Beginning from above commit, neither XFS_ALLOCFREE_LOG_RES() nor
XFS_DIROP_LOG_RES() is used by those routines for calculating
transaction space reservations, so it's safe to remove them now.

Also, with a slightly update for the relevant comments to reflect
the ideas of why those log count numbers should be.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-19 14:26:16 -05:00
Jie Liu 635c4d0bd9 xfs: return FIEMAP_EXTENT_UNKNOWN for delayed allocation extent
For FIEMAP ioctl(2), if an extent is in delayed allocation
state, we need to return the FIEMAP_EXTENT_UNKNOWN flag except
the FIEMAP_EXTENT_DELALLOC because its data location is unknown.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-19 14:18:32 -05:00
Mark Tinguely 725eb1eb2a xfs: fix the symbolic link assert in xfs_ifree
Adding an extended attribute to a symbolic link can force that
link to an remote extent. xfs_inactive() incorrectly assumes
that any symbolic link small enough to be in the inode core
is incore, resulting in the remote extent to not be removed.
xfs_ifree() will assert on presence of this leaked remote extent.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-19 14:14:43 -05:00
Bryan Schumaker 7017310ad7 NFS: Apply v4.1 capabilities to v4.2
This fixes POSIX locks and possibly a few other v4.2 features, like
readdir plus.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-19 13:55:43 -04:00
Wei Yongjun 06452eb053 dlm: remove duplicated include from lowcomms.c
Remove duplicated include.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-19 09:52:09 -05:00
David Howells dcfae32f89 FS-Cache: Don't use spin_is_locked() in assertions
Under certain circumstances, spin_is_locked() is hardwired to 0 - even when the
code would normally be in a locked section where it should return 1.  This
means it cannot be used for an assertion that checks that a spinlock is locked.

Remove such usages from FS-Cache.

The following oops might otherwise be observed:

FS-Cache: Assertion failed
BUG: failure at fs/fscache/operation.c:270/fscache_start_operations()!
Kernel panic - not syncing: BUG!
CPU: 0 PID: 10 Comm: kworker/u2:1 Not tainted 3.10.0-rc1-00133-ge7ebb75 #2
Workqueue: fscache_operation fscache_op_work_func [fscache]
7f091c48 603c8947 7f090000 7f9b1361 7f25f080 00000001 7f26d440 7f091c90
60299eb8 7f091d90 602951c5 7f26d440 3000000008 7f091da0 7f091cc0 7f091cd0
00000007 00000007 00000006 7f091ae0 00000010 0000010e 7f9af330 7f091ae0
Call Trace:
7f091c88: [<60299eb8>] dump_stack+0x17/0x19
7f091c98: [<602951c5>] panic+0xf4/0x1e9
7f091d38: [<6002b10e>] set_signals+0x1e/0x40
7f091d58: [<6005b89e>] __wake_up+0x4e/0x70
7f091d98: [<7f9aa003>] fscache_start_operations+0x43/0x50 [fscache]
7f091da8: [<7f9aa1e3>] fscache_op_complete+0x1d3/0x220 [fscache]
7f091db8: [<60082985>] unlock_page+0x55/0x60
7f091de8: [<7fb25bb0>] cachefiles_read_copier+0x250/0x330 [cachefiles]
7f091e58: [<7f9ab03c>] fscache_op_work_func+0xac/0x120 [fscache]
7f091e88: [<6004d5b0>] process_one_work+0x250/0x3a0
7f091ef8: [<6004edc7>] worker_thread+0x177/0x2a0
7f091f38: [<6004ec50>] worker_thread+0x0/0x2a0
7f091f58: [<60054418>] kthread+0xd8/0xe0
7f091f68: [<6005bb27>] finish_task_switch.isra.64+0x37/0xa0
7f091fd8: [<600185cf>] new_thread_handler+0x8f/0xb0

Reported-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-By: Milosz Tanski <milosz@adfin.com>
2013-06-19 14:16:47 +01:00
David Howells 1bb4b7f98f FS-Cache: The retrieval remaining-pages counter needs to be atomic_t
struct fscache_retrieval contains a count of the number of pages that still
need some processing (n_pages).  This is decremented as the pages are
processed.

However, this needs to be atomic as fscache_retrieval_complete() (I think) just
occasionally may be called from cachefiles_read_backing_file() and
cachefiles_read_copier() simultaneously.

This happens when an fscache_read_or_alloc_pages() request containing a lot of
pages (say a couple of hundred) is being processed.  The read on each backing
page is dispatched individually because we need to insert a monitor into the
waitqueue to catch when the read completes.  However, under low-memory
conditions, we might be forced to wait in the allocator - and this gives the
I/O on the backing page a chance to complete first.

When the I/O completes, fscache_enqueue_retrieval() chucks the retrieval onto
the workqueue without waiting for the operation to finish the initial I/O
dispatch (we want to release any pages we can as soon as we can), thus both can
end up running simultaneously and potentially attempting to partially complete
the retrieval simultaneously (ENOMEM may occur, backing pages may already be in
the page cache).

This was demonstrated by parallelling the non-atomic counter with an atomic
counter and printing both of them when the assertion fails.  At this point, the
atomic counter has reached zero, but the non-atomic counter has not.

To fix this, make the counter an atomic_t.

This results in the following bug appearing

	FS-Cache: Assertion failed
	3 == 5 is false
	------------[ cut here ]------------
	kernel BUG at fs/fscache/operation.c:421!

or

	FS-Cache: Assertion failed
	3 == 5 is false
	------------[ cut here ]------------
	kernel BUG at fs/fscache/operation.c:414!

With a backtrace like the following:

RIP: 0010:[<ffffffffa0211b1d>] fscache_put_operation+0x1ad/0x240 [fscache]
Call Trace:
 [<ffffffffa0213185>] fscache_retrieval_work+0x55/0x270 [fscache]
 [<ffffffffa0213130>] ? fscache_retrieval_work+0x0/0x270 [fscache]
 [<ffffffff81090b10>] worker_thread+0x170/0x2a0
 [<ffffffff81096d10>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff810909a0>] ? worker_thread+0x0/0x2a0
 [<ffffffff81096966>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810968d0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
Haicheng Li 2144bc78d4 cachefiles: remove unused macro list_to_page()
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
David Howells 1362729b16 FS-Cache: Simplify cookie retention for fscache_objects, fixing oops
Simplify the way fscache cache objects retain their cookie.  The way I
implemented the cookie storage handling made synchronisation a pain (ie. the
object state machine can't rely on the cookie actually still being there).

Instead of the the object being detached from the cookie and the cookie being
freed in __fscache_relinquish_cookie(), we defer both operations:

 (*) The detachment of the object from the list in the cookie now takes place
     in fscache_drop_object() and is thus governed by the object state machine
     (fscache_detach_from_cookie() has been removed).

 (*) The release of the cookie is now in fscache_object_destroy() - which is
     called by the cache backend just before it frees the object.

This means that the fscache_cookie struct is now available to the cache all the
way through from ->alloc_object() to ->drop_object() and ->put_object() -
meaning that it's no longer necessary to take object->lock to guarantee access.

However, __fscache_relinquish_cookie() doesn't wait for the object to go all
the way through to destruction before letting the netfs proceed.  That would
massively slow down the netfs.  Since __fscache_relinquish_cookie() leaves the
cookie around, in must therefore break all attachments to the netfs - which
includes ->def, ->netfs_data and any outstanding page read/writes.

To handle this, struct fscache_cookie now has an n_active counter:

 (1) This starts off initialised to 1.

 (2) Any time the cache needs to get at the netfs data, it calls
     fscache_use_cookie() to increment it - if it is not zero.  If it was zero,
     then access is not permitted.

 (3) When the cache has finished with the data, it calls fscache_unuse_cookie()
     to decrement it.  This does a wake-up on it if it reaches 0.

 (4) __fscache_relinquish_cookie() decrements n_active and then waits for it to
     reach 0.  The initialisation to 1 in step (1) ensures that we only get
     wake ups when we're trying to get rid of the cookie.

This leaves __fscache_relinquish_cookie() a lot simpler.


***
This fixes a problem in the current code whereby if fscache_invalidate() is
followed sufficiently quickly by fscache_relinquish_cookie() then it is
possible for __fscache_relinquish_cookie() to have detached the cookie from the
object and cleared the pointer before a thread is dispatched to process the
invalidation state in the object state machine.

Since the pending write clearance was deferred to the invalidation state to
make it asynchronous, we need to either wait in relinquishment for the stores
tree to be cleared in the invalidation state or we need to handle the clearance
in relinquishment.

Further, if the relinquishment code does clear the tree, then the invalidation
state need to make the clearance contingent on still having the cookie to hand
(since that's where the tree is rooted) and we have to prevent the cookie from
disappearing for the duration.

This can lead to an oops like the following:

BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
...
RIP: 0010:[<ffffffff8151023e>] _spin_lock+0xe/0x30
...
CR2: 000000000000000c ...
...
Process kslowd002 (...)
....
Call Trace:
 [<ffffffffa01c3278>] fscache_invalidate_writes+0x38/0xd0 [fscache]
 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
 [<ffffffff8105e759>] ? find_busiest_queue+0x69/0x150
 [<ffffffff8110ddd4>] ? slow_work_enqueue+0x104/0x180
 [<ffffffffa01c1303>] fscache_object_slow_work_execute+0x5e3/0x9d0 [fscache]
 [<ffffffff81096b67>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff8110e233>] slow_work_execute+0x233/0x310
 [<ffffffff8110e515>] slow_work_thread+0x205/0x360
 [<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8110e310>] ? slow_work_thread+0x0/0x360
 [<ffffffff81096936>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810968a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

The parameter to fscache_invalidate_writes() was object->cookie which is NULL.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
David Howells caaef6900b FS-Cache: Fix object state machine to have separate work and wait states
Fix object state machine to have separate work and wait states as that makes
it easier to envision.

There are now three kinds of state:

 (1) Work state.  This is an execution state.  No event processing is performed
     by a work state.  The function attached to a work state returns a pointer
     indicating the next state to which the OSM should transition.  Returning
     NO_TRANSIT repeats the current state, but goes back to the scheduler
     first.

 (2) Wait state.  This is an event processing state.  No execution is
     performed by a wait state.  Wait states are just tables of "if event X
     occurs, clear it and transition to state Y".  The dispatcher returns to
     the scheduler if none of the events in which the wait state has an
     interest are currently pending.

 (3) Out-of-band state.  This is a special work state.  Transitions to normal
     states can be overridden when an unexpected event occurs (eg. I/O error).
     Instead the dispatcher disables and clears the OOB event and transits to
     the specified work state.  This then acts as an ordinary work state,
     though object->state points to the overridden destination.  Returning
     NO_TRANSIT resumes the overridden transition.

In addition, the states have names in their definitions, so there's no need for
tables of state names.  Further, the EV_REQUEUE event is no longer necessary as
that is automatic for work states.

Since the states are now separate structs rather than values in an enum, it's
not possible to use comparisons other than (non-)equality between them, so use
some object->flags to indicate what phase an object is in.

The EV_RELEASE, EV_RETIRE and EV_WITHDRAW events have been squished into one
(EV_KILL).  An object flag now carries the information about retirement.

Similarly, the RELEASING, RECYCLING and WITHDRAWING states have been merged
into an KILL_OBJECT state and additional states have been added for handling
waiting dependent objects (JUMPSTART_DEPS and KILL_DEPENDENTS).

A state has also been added for synchronising with parent object initialisation
(WAIT_FOR_PARENT) and another for initiating look up (PARENT_READY).

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
David Howells 493f7bc114 FS-Cache: Wrap checks on object state
Wrap checks on object state (mostly outside of fs/fscache/object.c) with
inline functions so that the mechanism can be replaced.

Some of the state checks within object.c are left as-is as they will be
replaced.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
David Howells 610be24ee4 FS-Cache: Uninline fscache_object_init()
Uninline fscache_object_init() so as not to expose some of the FS-Cache
internals to the cache backend.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
David Howells 0c59a95d90 FS-Cache: Don't sleep in page release if __GFP_FS is not set
Don't sleep in __fscache_maybe_release_page() if __GFP_FS is not set.  This
goes some way towards mitigating fscache deadlocking against ext4 by way of
the allocator, eg:

INFO: task flush-8:0:24427 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-8:0       D ffff88003e2b9fd8     0 24427      2 0x00000000
 ffff88003e2b9138 0000000000000046 ffff880012e3a040 ffff88003e2b9fd8
 0000000000011c80 ffff88003e2b9fd8 ffffffff81a10400 ffff880012e3a040
 0000000000000002 ffff880012e3a040 ffff88003e2b9098 ffffffff8106dcf5
Call Trace:
 [<ffffffff8106dcf5>] ? __lock_is_held+0x31/0x53
 [<ffffffff81219b61>] ? radix_tree_lookup_element+0xf4/0x12a
 [<ffffffff81454bed>] schedule+0x60/0x62
 [<ffffffffa01d349c>] __fscache_wait_on_page_write+0x8b/0xa5 [fscache]
 [<ffffffff810498a8>] ? __init_waitqueue_head+0x4d/0x4d
 [<ffffffffa01d393a>] __fscache_maybe_release_page+0x30c/0x324 [fscache]
 [<ffffffffa01d369a>] ? __fscache_maybe_release_page+0x6c/0x324 [fscache]
 [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
 [<ffffffffa01fd7b2>] nfs_fscache_release_page+0x68/0x94 [nfs]
 [<ffffffffa01ef73e>] nfs_release_page+0x7e/0x86 [nfs]
 [<ffffffff810aa553>] try_to_release_page+0x32/0x3b
 [<ffffffff810b6c70>] shrink_page_list+0x535/0x71a
 [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
 [<ffffffff810b7352>] shrink_inactive_list+0x20a/0x2dd
 [<ffffffff81071a13>] ? mark_held_locks+0xbe/0xea
 [<ffffffff810b7a65>] shrink_lruvec+0x34c/0x3eb
 [<ffffffff810b7bd3>] do_try_to_free_pages+0xcf/0x355
 [<ffffffff810b7fc8>] try_to_free_pages+0x9a/0xa1
 [<ffffffff810b08d2>] __alloc_pages_nodemask+0x494/0x6f7
 [<ffffffff810d9a07>] kmem_getpages+0x58/0x155
 [<ffffffff810dc002>] fallback_alloc+0x120/0x1f3
 [<ffffffff8106db23>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffff810dbed3>] ____cache_alloc_node+0x177/0x186
 [<ffffffff81162a6c>] ? ext4_init_io_end+0x1c/0x37
 [<ffffffff810dc403>] kmem_cache_alloc+0xf1/0x176
 [<ffffffff810b17ac>] ? test_set_page_writeback+0x101/0x113
 [<ffffffff81162a6c>] ext4_init_io_end+0x1c/0x37
 [<ffffffff81162ce4>] ext4_bio_write_page+0x20f/0x3af
 [<ffffffff8115cc02>] mpage_da_submit_io+0x26e/0x2f6
 [<ffffffff811088e5>] ? __find_get_block_slow+0x38/0x133
 [<ffffffff81161348>] mpage_da_map_and_submit+0x3a7/0x3bd
 [<ffffffff81161a60>] ext4_da_writepages+0x30d/0x426
 [<ffffffff810b3359>] do_writepages+0x1c/0x2a
 [<ffffffff81102f4d>] __writeback_single_inode+0x3e/0xe5
 [<ffffffff81103995>] writeback_sb_inodes+0x1bd/0x2f4
 [<ffffffff81103b3b>] __writeback_inodes_wb+0x6f/0xb4
 [<ffffffff81103c81>] wb_writeback+0x101/0x195
 [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
 [<ffffffff811043aa>] ? wb_do_writeback+0xaa/0x173
 [<ffffffff8110434a>] wb_do_writeback+0x4a/0x173
 [<ffffffff81071bbc>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff81038554>] ? del_timer+0x4b/0x5b
 [<ffffffff811044e0>] bdi_writeback_thread+0x6d/0x147
 [<ffffffff81104473>] ? wb_do_writeback+0x173/0x173
 [<ffffffff81048fbc>] kthread+0xd0/0xd8
 [<ffffffff81455eb2>] ? _raw_spin_unlock_irq+0x29/0x3e
 [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
 [<ffffffff81456aac>] ret_from_fork+0x7c/0xb0
 [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
2 locks held by flush-8:0/24427:
 #0:  (&type->s_umount_key#41){.+.+..}, at: [<ffffffff810e3b73>] grab_super_passive+0x4c/0x76
 #1:  (jbd2_handle){+.+...}, at: [<ffffffff81190d81>] start_this_handle+0x475/0x4ea


The problem here is that another thread, which is attempting to write the
to-be-stored NFS page to the on-ext4 cache file is waiting for the journal
lock, eg:

INFO: task kworker/u:2:24437 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u:2     D ffff880039589768     0 24437      2 0x00000000
 ffff8800395896d8 0000000000000046 ffff8800283bf040 ffff880039589fd8
 0000000000011c80 ffff880039589fd8 ffff880039f0b040 ffff8800283bf040
 0000000000000006 ffff8800283bf6b8 ffff880039589658 ffffffff81071a13
Call Trace:
 [<ffffffff81071a13>] ? mark_held_locks+0xbe/0xea
 [<ffffffff81455e73>] ? _raw_spin_unlock_irqrestore+0x3a/0x50
 [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
 [<ffffffff81071bbc>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff81454bed>] schedule+0x60/0x62
 [<ffffffff81190c23>] start_this_handle+0x317/0x4ea
 [<ffffffff810498a8>] ? __init_waitqueue_head+0x4d/0x4d
 [<ffffffff81190fcc>] jbd2__journal_start+0xb3/0x12e
 [<ffffffff81176606>] __ext4_journal_start_sb+0xb2/0xc6
 [<ffffffff8115f137>] ext4_da_write_begin+0x109/0x233
 [<ffffffff810a964d>] generic_file_buffered_write+0x11a/0x264
 [<ffffffff811032cf>] ? __mark_inode_dirty+0x2d/0x1ee
 [<ffffffff810ab1ab>] __generic_file_aio_write+0x2a5/0x2d5
 [<ffffffff810ab24a>] generic_file_aio_write+0x6f/0xd0
 [<ffffffff81159a2c>] ext4_file_write+0x38c/0x3c4
 [<ffffffff810e0915>] do_sync_write+0x91/0xd1
 [<ffffffffa00a17f0>] cachefiles_write_page+0x26f/0x310 [cachefiles]
 [<ffffffffa01d470b>] fscache_write_op+0x21e/0x37a [fscache]
 [<ffffffff81455eb2>] ? _raw_spin_unlock_irq+0x29/0x3e
 [<ffffffffa01d2479>] fscache_op_work_func+0x78/0xd7 [fscache]
 [<ffffffff8104455a>] process_one_work+0x232/0x3a8
 [<ffffffff810444ff>] ? process_one_work+0x1d7/0x3a8
 [<ffffffff81044ee0>] worker_thread+0x214/0x303
 [<ffffffff81044ccc>] ? manage_workers+0x245/0x245
 [<ffffffff81048fbc>] kthread+0xd0/0xd8
 [<ffffffff81455eb2>] ? _raw_spin_unlock_irq+0x29/0x3e
 [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
 [<ffffffff81456aac>] ret_from_fork+0x7c/0xb0
 [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
4 locks held by kworker/u:2/24437:
 #0:  (fscache_operation){.+.+.+}, at: [<ffffffff810444ff>] process_one_work+0x1d7/0x3a8
 #1:  ((&op->work)){+.+.+.}, at: [<ffffffff810444ff>] process_one_work+0x1d7/0x3a8
 #2:  (sb_writers#14){.+.+.+}, at: [<ffffffff810ab22c>] generic_file_aio_write+0x51/0xd0
 #3:  (&sb->s_type->i_mutex_key#19){+.+.+.}, at: [<ffffffff810ab236>] generic_file_aio_write+0x5b/0x

fscache already tries to cancel pending stores, but it can't cancel a write
for which I/O is already in progress.

An alternative would be to accept writing garbage to the cache under extreme
circumstances and to kill the afflicted cache object if we have to do this.
However, we really need to know how strapped the allocator is before deciding
to do that.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
J. Bruce Fields 6bd5e82b09 CacheFiles: name i_mutex lock class explicitly
Just some cleanup.

(And note the caller of this function may, for example, call vfs_unlink
on a child, so the "1" (I_MUTEX_PARENT) really was what was intended
here.)

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
Sebastian Andrzej Siewior ee8be57bc3 fs/fscache: remove spin_lock() from the condition in while()
The spinlock() within the condition in while() will cause a compile error
if it is not a function. This is not a problem on mainline but it does not
look pretty and there is no reason to do it that way.
That patch writes it a little differently and avoids the double condition.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
Benjamin Marzinski 5d054964f5 GFS2: aggressively issue revokes in gfs2_log_flush
This patch looks at all the outstanding blocks in all the transactions
on the log, and moves the completed ones to the ail2 list.  Then it
issues revokes for these blocks.  This will hopefully speed things up
in situations where there is a lot of contention for glocks, especially
if they are acquired serially.

revoke_lo_before_commit will issue at most one log block's full of these
preemptive revokes. The amount of reserved log space that
gfs2_log_reserve() ignores has been incremented to allow for this extra
block.

This patch also consolidates the common revoke instructions into one
function, gfs2_add_revoke().

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-19 09:41:59 +01:00
Trond Myklebust 7dc0ac70f8 NFSv4.1: Clean up layout segment comparison helper names
Give them names that are a bit more consistent with the general
pNFS naming scheme.

 - lo_seg_contained -> pnfs_lseg_range_contained
 - lo_seg_intersecting -> pnfs_lseg_range_intersecting
 - cmp_layout -> pnfs_lseg_range_cmp
 - is_matching_lseg -> pnfs_lseg_range_match

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-18 13:47:18 -04:00
Trond Myklebust 3cb2df17ae NFSv4.1: layout segment comparison helpers should take 'const' parameters
Also strip off the unnecessary 'inline' declarations.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-18 13:47:18 -04:00
Trond Myklebust c8d74d9b68 NFSv4: Move the DNS resolver into the NFSv4 module
The other protocols don't use it, so make it local to NFSv4, and
remove the EXPORT.
Also ensure that we only compile in cache_lib.o if we're using
the legacy DNS resolver.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Bryan Schumaker <bjschuma@netapp.com>
2013-06-18 13:47:18 -04:00
Djalal Harouni fe2d5395c4 NFSv4: SETCLIENTID add the format string for the NETID
Make sure that NFSv4 SETCLIENTID does not parse the NETID as a
format string.

Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-18 13:45:01 -04:00
Greg Kroah-Hartman bb07b00be7 Merge 3.10-rc6 into driver-core-next
We want these fixes here too.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-17 16:57:20 -07:00
Maxim Patlasov 14c14414d1 fuse: hold i_mutex in fuse_file_fallocate()
Changing size of a file on server and local update (fuse_write_update_size)
should be always protected by inode->i_mutex. Otherwise a race like this is
possible:

1. Process 'A' calls fallocate(2) to extend file (~FALLOC_FL_KEEP_SIZE).
fuse_file_fallocate() sends FUSE_FALLOCATE request to the server.
2. Process 'B' calls ftruncate(2) shrinking the file. fuse_do_setattr()
sends shrinking FUSE_SETATTR request to the server and updates local i_size
by i_size_write(inode, outarg.attr.size).
3. Process 'A' resumes execution of fuse_file_fallocate() and calls
fuse_write_update_size(inode, offset + length). But 'offset + length' was
obsoleted by ftruncate from previous step.

Changed in v2 (thanks Brian and Anand for suggestions):
 - made relation between mutex_lock() and fuse_set_nowrite(inode) more
   explicit and clear.
 - updated patch description to use ftruncate(2) in example

Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-06-18 01:39:03 +02:00
Jeff Liu 1ebdf3611c xfs: Remove struct xfs_chash from xfs_mount
Remove struct xfs_chash from struct xfs_mount as there is no user of
it nowadays.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-17 17:54:21 -05:00
Jie Liu 34d7f603b9 xfs: Don't keep silent if sunit/swidth can not be changed via mount
As per the mount man page, sunit and swidth can be changed via
mount options.  For XFS, on the face of it, those options seems
works if the specified alignments is properly, e.g.
# mount -o sunit=4096,swidth=8192 /dev/sdb1 /mnt
# mount | grep sdb1
/dev/sdb1 on /mnt type xfs (rw,sunit=4096,swidth=8192)

However, neither sunit nor swidth is shown from the xfs_info output.
# xfs_info /mnt
meta-data=/dev/sdb1    isize=256    agcount=4, agsize=262144 blks
         =             sectsz=512   attr=2
data     =             bsize=4096   blocks=1048576, imaxpct=25
         =             sunit=0      swidth=0 blks
		       ^^^^^^^^^^^^^^^^^^^^^^^^^^
naming   =version 2    bsize=4096   ascii-ci=0
log      =internal     bsize=4096   blocks=2560, version=2
         =             sectsz=512   sunit=0 blks, lazy-count=1
realtime =none         extsz=4096   blocks=0, rtextents=0

The reason is that the alignment can only be changed if the relevant
super block is already configured with alignments, otherwise, the
given value is silently ignored.

With this fix, the attempt to mount a storage without strip alignment
setup on a super block will get an error with a warning in syslog to
indicate the true cause, e.g.
# mount -o sunit=4096,swidth=8192 /dev/sdb1 /mnt
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
	dmesg | tail  or so
.......
XFS (sdb1): cannot change alignment: superblock does not support data
alignment

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Mark Tinguely <tinguely@sgi.com>
Cc: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-17 17:49:02 -05:00
Jie Liu 897366f0e4 xfs: Remove redundant error variable from xfs_growfs_data_private()
Commit eab4e633 "xfs: uncached buffer reads need to return an error".

Remove redundant error variable, using the function level error variable
to store bp->b_error instead.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-17 17:43:04 -05:00
Joe Perches b2410e92b7 xfs: Convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-17 17:42:25 -05:00
Jon Ernst 03b40e3496 ext4: delete unused variables
This patch removed several unused variables.

Signed-off-by: Jon Ernst <jonernst07@gmx.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-17 08:56:26 -04:00
Namjae Jeon 696c018c77 f2fs: add remount_fs callback support
Add the f2fs_remount function call which will be used
during the filesystem remounting. This function
will help us to change the mount options specific to
f2fs.

Also modify the f2fs background_gc mount option, which
will allow the user to dynamically trun on/off the
garbage collection in f2fs based on the background_gc
value. If background_gc=on, Garbage collection will
be turned off & if background_gc=off, Garbage collection
will be truned on.

By default the garbage collection is on in f2fs.

Change Log:
v2: Incorporated the review comments by Gu Zheng.
    Removing the restore part for VFS flags
    Updating comments with proper flag conditions
    Display GC background option as ON/OFF
    Revised conditions to stop GC in case of remount

v1: Initial changes for adding remount_fs callback
support.

Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: change /** with /* for the coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-17 21:17:55 +09:00
Linus Torvalds d0ff934881 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fixes from Al Viro:
 "Several fixes + obvious cleanup (you've missed a couple of open-coded
  can_lookup() back then)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  snd_pcm_link(): fix a leak...
  use can_lookup() instead of direct checks of ->i_op->lookup
  move exit_task_namespaces() outside of exit_notify()
  fput: task_work_add() can fail if the caller has passed exit_task_work()
  ncpfs: fix rmdir returns Device or resource busy
2013-06-14 19:18:56 -10:00
Linus Torvalds d58c6ff0b7 xfs: fixes for 3.10-rc6
- Remove noisy warnings about experimental support which spams the logs
 - Add padding to align directory and attr structures correctly
 - Set block number on child buffer on a root btree split
 - Disable verifiers during log recovery for non-CRC filesystems
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJRu4gPAAoJENaLyazVq6ZO0GwP/j7i8hEl6hoFZZJ2WX7niFCP
 t0r218J9JZDCLSk7+rY26gmxOzifRHAIt5TRwwqSCbNnZbuQZsqFUpvDMSMY3XOj
 4qnUlO6diRLonN5ixrOb5YMTQJ8YHG7cB4jvxBDAqPqEfNpRyqikxstcH6KBmtSU
 duqhuQMdmHAjMUqfpdt5ewueOCmw6jI79ZqvMnEfSHW7YS7G4SrKYa71HkfRR6CD
 +K/FqEoDO/9psbsFlrkQ4Uvqngp8c9c0wQULxreN0BSdRbVqHfrS6eAWGhT3K2HW
 7ZGxEiTcwR5XCtDQjhw7vbZQEMeMcl6yZ6J7e+jJc53maySOOrqCaYyyrhzZFw4H
 Xh52pcVJtGuGVBHDxpfhI5e7KI4DjEugQK9AaONy02bhhTh3r3CKu5pprDyenyHr
 9s/DG8u/gJX8tm8DSBlIXv2iCvY4mTeesYkMaLHgC8uLXmItkRBoUaj1NQvnsTqo
 EF1xVVqh3aiueD4+cvu3+x4J4dTFmYQ++Oi3Zt1YpjBBb/h3n3KFUfizhRIp9r43
 R4UO5W3b6s4q/1oC+bO6Qlxfny9vcyz+UrkcLpbuo+cRTC3bKi85v2Gaaw69bcB1
 1SZCFRuVvDvzffX6Nir699Dj/uU4GETvDw/+y/igcKcETx6L4AgQPV9y/izJq5zr
 zLhC+OSCDvuOGaOmRvco
 =bijX
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.10-rc6' of git://oss.sgi.com/xfs/xfs

Pull xfs fixes from Ben Myers:
 - Remove noisy warnings about experimental support which spams the logs
 - Add padding to align directory and attr structures correctly
 - Set block number on child buffer on a root btree split
 - Disable verifiers during log recovery for non-CRC filesystems

* tag 'for-linus-v3.10-rc6' of git://oss.sgi.com/xfs/xfs:
  xfs: don't shutdown log recovery on validation errors
  xfs: ensure btree root split sets blkno correctly
  xfs: fix implicit padding in directory and attr CRC formats
  xfs: don't emit v5 superblock warnings on write
2013-06-14 19:16:31 -10:00
Al Viro 0525290119 use can_lookup() instead of direct checks of ->i_op->lookup
a couple of places got missed back when Linus has introduced that one...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-15 05:41:45 +04:00
Oleg Nesterov e7b2c40692 fput: task_work_add() can fail if the caller has passed exit_task_work()
fput() assumes that it can't be called after exit_task_work() but
this is not true, for example free_ipc_ns()->shm_destroy() can do
this. In this case fput() silently leaks the file.

Change it to fallback to delayed_fput_work if task_work_add() fails.
The patch looks complicated but it is not, it changes the code from

	if (PF_KTHREAD) {
		schedule_work(...);
		return;
	}
	task_work_add(...)

to
	if (!PF_KTHREAD) {
		if (!task_work_add(...))
			return;
		/* fallback */
	}
	schedule_work(...);

As for shm_destroy() in particular, we could make another fix but I
think this change makes sense anyway. There could be another similar
user, it is not safe to assume that task_work_add() can't fail.

Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-15 05:39:08 +04:00
Rob Herring 2644665458 pstore/ram: remove the power of buffer size limitation
There doesn't appear to be any reason for the overall pstore RAM buffer to
be a power of 2 size, so remove it. The individual console, ftrace and oops
buffers are still a power of 2 size.

Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Anton Vorontsov <anton@enomsg.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-06-14 15:54:41 -07:00
Rob Herring 0405a5cec3 pstore/ram: avoid atomic accesses for ioremapped regions
For persistent RAM outside of main memory, the memory may have limitations
on supported accesses. For internal RAM on highbank platform exclusive
accesses are not supported and will hang the system. So atomic_cmpxchg
cannot be used. This commit uses spinlock protection for buffer size and
start updates on ioremapped regions instead.

Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Anton Vorontsov <anton@enomsg.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-06-14 15:54:21 -07:00
Dave Chinner d302cf1d31 xfs: don't shutdown log recovery on validation errors
Unfortunately, we cannot guarantee that items logged multiple times
and replayed by log recovery do not take objects back in time. When
they are taken back in time, the go into an intermediate state which
is corrupt, and hence verification that occurs on this intermediate
state causes log recovery to abort with a corruption shutdown.

Instead of causing a shutdown and unmountable filesystem, don't
verify post-recovery items before they are written to disk. This is
less than optimal, but there is no way to detect this issue for
non-CRC filesystems If log recovery successfully completes, this
will be undone and the object will be consistent by subsequent
transactions that are replayed, so in most cases we don't need to
take drastic action.

For CRC enabled filesystems, leave the verifiers in place - we need
to call them to recalculate the CRCs on the objects anyway. This
recovery problem can be solved for such filesystems - we have a LSN
stamped in all metadata at writeback time that we can to determine
whether the item should be replayed or not. This is a separate piece
of work, so is not addressed by this patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 9222a9cf86)
2013-06-14 15:59:45 -05:00
Dave Chinner 088c9f67c3 xfs: ensure btree root split sets blkno correctly
For CRC enabled filesystems, the BMBT is rooted in an inode, so it
passes through a different code path on root splits than the
freespace and inode btrees. This is much less traversed by xfstests
than the other trees. When testing on a 1k block size filesystem,
I've been seeing ASSERT failures in generic/234 like:

XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP || cur->bc_private.b.allocated == 0, file: fs/xfs/xfs_btree.c, line: 317

which are generally preceded by a lblock check failure. I noticed
this in the bmbt stats:

$ pminfo -f xfs.btree.block_map

xfs.btree.block_map.lookup
    value 39135

xfs.btree.block_map.compare
    value 268432

xfs.btree.block_map.insrec
    value 15786

xfs.btree.block_map.delrec
    value 13884

xfs.btree.block_map.newroot
    value 2

xfs.btree.block_map.killroot
    value 0
.....

Very little coverage of root splits and merges. Indeed, on a 4k
filesystem, block_map.newroot and block_map.killroot are both zero.
i.e. the code is not exercised at all, and it's the only generic
btree infrastructure operation that is not exercised by a default run
of xfstests.

Turns out that on a 1k filesystem, generic/234 accounts for one of
those two root splits, and that is somewhat of a smoking gun. In
fact, it's the same problem we saw in the directory/attr code where
headers are memcpy()d from one block to another without updating the
self describing metadata.

Simple fix - when copying the header out of the root block, make
sure the block number is updated correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit ade1335afe)
2013-06-14 15:59:31 -05:00
Dave Chinner 5170711df7 xfs: fix implicit padding in directory and attr CRC formats
Michael L. Semon has been testing CRC patches on a 32 bit system and
been seeing assert failures in the directory code from xfs/080.
Thanks to Michael's heroic efforts with printk debugging, we found
that the problem was that the last free space being left in the
directory structure was too small to fit a unused tag structure and
it was being corrupted and attempting to log a region out of bounds.
Hence the assert failure looked something like:

.....
#5 calling xfs_dir2_data_log_unused() 36 32
#1 4092 4095 4096
#2 8182 8183 4096
XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568

Where #1 showed the first region of the dup being logged (i.e. the
last 4 bytes of a directory buffer) and #2 shows the corrupt values
being calculated from the length of the dup entry which overflowed
the size of the buffer.

It turns out that the problem was not in the logging code, nor in
the freespace handling code. It is an initial condition bug that
only shows up on 32 bit systems. When a new buffer is initialised,
where's the freespace that is set up:

[  172.316249] calling xfs_dir2_leaf_addname() from xfs_dir_createname()
[  172.316346] #9 calling xfs_dir2_data_log_unused()
[  172.316351] #1 calling xfs_trans_log_buf() 60 63 4096
[  172.316353] #2 calling xfs_trans_log_buf() 4094 4095 4096

Note the offset of the first region being logged? It's 60 bytes into
the buffer. Once I saw that, I pretty much knew that the bug was
going to be caused by this.

Essentially, all direct entries are rounded to 8 bytes in length,
and all entries start with an 8 byte alignment. This means that we
can decode inplace as variables are naturally aligned. With the
directory data supposedly starting on a 8 byte boundary, and all
entries padded to 8 bytes, the minimum freespace in a directory
block is supposed to be 8 bytes, which is large enough to fit a
unused data entry structure (6 bytes in size). The fact we only have
4 bytes of free space indicates a directory data block alignment
problem.

And what do you know - there's an implicit hole in the directory
data block header for the CRC format, which means the header is 60
byte on 32 bit intel systems and 64 bytes on 64 bit systems. Needs
padding. And while looking at the structures, I found the same
problem in the attr leaf header. Fix them both.

Note that this only affects 32 bit systems with CRCs enabled.
Everything else is just fine. Note that CRC enabled filesystems created
before this fix on such systems will not be readable with this fix
applied.

Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Debugged-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 8a1fd2950e)
2013-06-14 15:59:16 -05:00
Dave Chinner 47ad2fcba9 xfs: don't emit v5 superblock warnings on write
We write the superblock every 30s or so which results in the
verifier being called. Right now that results in this output
every 30s:

XFS (vda): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
Use of these features in this kernel is at your own risk!

And spamming the logs.

We don't need to check for whether we support v5 superblocks or
whether there are feature bits we don't support set as these are
only relevant when we first mount the filesytem. i.e. on superblock
read. Hence for the write verification we can just skip all the
checks (and hence verbose output) altogether.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 34510185ab)
2013-06-14 15:58:47 -05:00
Dave Chinner 9222a9cf86 xfs: don't shutdown log recovery on validation errors
Unfortunately, we cannot guarantee that items logged multiple times
and replayed by log recovery do not take objects back in time. When
they are taken back in time, the go into an intermediate state which
is corrupt, and hence verification that occurs on this intermediate
state causes log recovery to abort with a corruption shutdown.

Instead of causing a shutdown and unmountable filesystem, don't
verify post-recovery items before they are written to disk. This is
less than optimal, but there is no way to detect this issue for
non-CRC filesystems If log recovery successfully completes, this
will be undone and the object will be consistent by subsequent
transactions that are replayed, so in most cases we don't need to
take drastic action.

For CRC enabled filesystems, leave the verifiers in place - we need
to call them to recalculate the CRCs on the objects anyway. This
recovery problem can be solved for such filesystems - we have a LSN
stamped in all metadata at writeback time that we can to determine
whether the item should be replayed or not. This is a separate piece
of work, so is not addressed by this patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-14 15:29:31 -05:00
Mike Christie 86e92ad299 dlm: disable nagle for SCTP
For TCP we disable Nagle and I cannot think of why it would be needed
for SCTP. When disabled it seems to improve dlm_lock operations like it
does for TCP.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14 13:07:11 -05:00
Mike Christie 5d6898714f dlm: retry failed SCTP sends
Currently if a SCTP send fails, we lose the data we were trying
to send because the writequeue_entry is released when we do the send.
When this happens other nodes will then hang waiting for a reply.

This adds support for SCTP to retry the send operation.

I also removed the retry limit for SCTP use, because we want
to make sure we try every path during init time and for longer
failures we want to continually retry in case paths come back up
while trying other paths. We will do this until userspace tells us
to stop.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14 13:07:11 -05:00
Mike Christie 98e1b60ecc dlm: try other IPs when sctp init assoc fails
Currently, if we cannot create a association to the first IP addr
that is added to DLM, the SCTP init assoc code will just retry
the same IP. This patch adds a simple failover schemes where we
will try one of the addresses that was passed into DLM.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14 13:07:11 -05:00
Mike Christie b390ca38d2 dlm: clear correct bit during sctp init failure handling
We should be testing and cleaing the init pending bit because later
when sctp_init_assoc is recalled it will be checking that it is not set
and set the bit.

We do not want to touch CF_CONNECT_PENDING here because we will queue
swork and process_send_sockets will then call the connect_action function.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14 13:07:11 -05:00
Mike Christie e1631d0c48 dlm: set sctp assoc id during setup
sctp_assoc was not getting set so later lookups failed.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14 13:07:10 -05:00
Mike Christie efad7e6b1a dlm: clear correct init bit during sctp setup
We were clearing the base con's init pending flags, but the
con for the node was the one with the pending bit set.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14 13:07:10 -05:00
Josef Bacik 8c2a1a3028 Btrfs: exclude logged extents before replying when we are mixed
With non-mixed block groups we replay the logs before we're allowed to do any
writes, so we get away with not pinning/removing the data extents until right
when we replay them.  However with mixed block groups we allocate out of the
same pool, so we could easily allocate a metadata block that was logged in our
tree log.  To deal with this we just need to notice that we have mixed block
groups and do the normal excluding/removal dance during the pin stage of the log
replay and that way we don't allocate metadata blocks from areas we have logged
data extents.  With this patch we now pass xfstests generic/311 with mixed
block groups turned on.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:17 -04:00
Josef Bacik 01cd33674e Btrfs: put our inode if orphan cleanup fails
When we cross into a different subvol when doing a lookup we will run the orhpan
cleanup.  If this fails however we do not drop the ref to the inode we were
looking up before we return an error, which leads to busy inodes on umount.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:16 -04:00
Josef Bacik c69b26b011 Btrfs: add some missing iput()'s in btrfs_orphan_cleanup
There are some error cases that we don't do an iput() on our inode, fix this.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:15 -04:00
Josef Bacik e78417d192 Btrfs: do not pin while under spin lock
When testing a corrupted fs I noticed I was getting sleep while atomic errors
when the transaction aborted.  This is because btrfs_pin_extent may need to
allocate memory and we are calling this under the spin lock.  Fix this by moving
it out and doing the pin after dropping the spin lock but before dropping the
mutex, the same way it works when delayed refs run normally.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:13 -04:00
Thomas Meyer a5959bc0a1 Btrfs: Cocci spatch "memdup.spatch"
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:12 -04:00
Thomas Meyer 97a184fe81 Btrfs: Cocci spatch "ptr_ret.spatch"
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:11 -04:00
Jan Schmidt b382a324b6 Btrfs: fix qgroup rescan resume on mount
When called during mount, we cannot start the rescan worker thread until
open_ctree is done. This commit restuctures the qgroup rescan internals to
enable a clean deferral of the rescan resume operation.

First of all, the struct qgroup_rescan is removed, saving us a malloc and
some initialization synchronizations problems. Its only element (the worker
struct) now lives within fs_info just as the rest of the rescan code.

Then setting up a rescan worker is split into several reusable stages.
Currently we have three different rescan startup scenarios:
	(A) rescan ioctl
	(B) rescan resume by mount
	(C) rescan by quota enable

Each case needs its own combination of the four following steps:
	(1) set the progress [A, C: zero; B: state of umount]
	(2) commit the transaction [A]
	(3) set the counters [A, C: zero; B: state of umount]
	(4) start worker [A, B, C]

qgroup_rescan_init does step (1). There's no extra function added to commit
a transaction, we've got that already. qgroup_rescan_zero_tracking does
step (3). Step (4) is nothing more than a call to the generic
btrfs_queue_worker.

We also get rid of a double check for the rescan progress during
btrfs_qgroup_account_ref, which is no longer required due to having step 2
from the list above.

As a side effect, this commit prepares to move the rescan start code from
btrfs_run_qgroups (which is run during commit) to a less time critical
section.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:10 -04:00
Jan Schmidt eb1716af88 Btrfs: avoid double free of fs_info->qgroup_ulist
When btrfs_read_qgroup_config or btrfs_quota_enable return non-zero, we've
already freed the fs_info->qgroup_ulist. The final btrfs_free_qgroup_config
called from quota_disable makes another ulist_free(fs_info->qgroup_ulist)
call.

We set fs_info->qgroup_ulist to NULL on the mentioned error paths, turning
the ulist_free in btrfs_free_qgroup_config into a noop.

Cc: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:08 -04:00
Jan Schmidt 4373519db4 Btrfs: fix memory patcher through fs_info->qgroup_ulist
Commit 5b7c665e introduced fs_info->qgroup_ulist, that is allocated during
btrfs_read_qgroup_config and meant to be used later by the qgroup accounting
code. However, it is always freed before btrfs_read_qgroup_config returns,
becuase the commit mentioned above adds a check for (ret), where a check
for (ret < 0) would have been the right choice. This commit fixes the check.

Cc: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:07 -04:00
Josef Bacik d52be818e6 Btrfs: simplify unlink reservations
Dave pointed out a problem where if you filled up a file system as much as
possible you couldn't remove any files.  The whole unlink reservation thing is
convoluted because it tries to guess if it's going to add space to unlink
something or not, and has all these odd uncommented cases where it simply does
not try.  So to fix this I've added a way to conditionally steal from the global
reserve if we can't make our normal reservation.  If we have more than half the
space in the global reserve free we will go ahead and steal from the global
reserve.  With this patch Dave's reproducer now works and I can rm all the files
on the file system.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:06 -04:00
Miao Xie c6adc9cc08 Btrfs: merge pending IO for tree log write back
Before applying this patch, we flushed the log tree of the fs/file
tree firstly, and then flushed the log root tree. It is ineffective,
especially on the hard disk. This patch improved this problem by wrapping
the above two flushes by the same blk_plug.

By test, the performance of the sync write went up ~60%(2.9MB/s -> 4.6MB/s)
on my scsi disk whose disk buffer was enabled.

Test step:
 # mkfs.btrfs -f -m single <disk>
 # mount <disk> <mnt>
 # dd if=/dev/zero of=<mnt>/file0 bs=32K count=1024 oflag=sync

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:05 -04:00
Liu Bo a96fbc7288 Btrfs: allow file data clone within a file
We did not allow file data clone within the same file because of
deadlock issues.

However, we now use nested lock to avoid deadlock between the
parent directory and the child file.

So it's safe to do file clone within the same file when the two
ranges are not overlapped.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:03 -04:00
Liu Bo b7394eb91c Btrfs: remove unused code in btrfs_del_root
'leaf' and 'ri' is not used somehow.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:02 -04:00
Liu Bo 2da1c669f0 Btrfs: kill replicate code in replay_one_buffer
EXTREF is treated same as REF, so we can make the code tidy.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:01 -04:00
Liu Bo 33157e05db Btrfs: check if leaf's parent exists before pushing items around
During splitting a leaf, pushing items around to hopefully get some space only
works when we have a parent, ie. we have at least one sibling leaf.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:58 -04:00
Liu Bo fdd99c7294 Btrfs: dont do log_removal in insert_new_root
As for splitting a leaf, root is just the leaf, and tree mod log does not apply
on leaf, so in this case, we don't do log_removal.

As for splitting a node, the old root is kept as a normal node and we have nicely
put records in tree mod log for moving keys and items, so in this case we don't do
that either.

As above, insert_new_root can get rid of log_removal.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:57 -04:00
Wei Yongjun 4b286cd1f5 Btrfs: return error code in btrfs_check_trunc_cache_free_space()
Fix to return error code instead always return 0 from function
btrfs_check_trunc_cache_free_space().
Introduced by commit 7b61cd9224
(Btrfs: don't use global block reservation for inode cache truncation)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:56 -04:00
Josef Bacik 139f807a1e Btrfs: fix estale with btrfs send
This fixes bugzilla 57491.  If we take a snapshot of a fs with a unlink ongoing
and then try to send that root we will run into problems.  When comparing with a
parent root we will search the parents and the send roots commit_root, which if
we've just created the snapshot will include the file that needs to be evicted
by the orphan cleanup.  So when we find a changed extent we will try and copy
that info into the send stream, but when we lookup the inode we use the normal
root, which no longer has the inode because the orphan cleanup deleted it.  The
best solution I have for this is to check our otransid with the generation of
the commit root and if they match just commit the transaction again, that way we
get the changes from the orphan cleanup.  With this patch the reproducer I made
for this bugzilla no longer returns ESTALE when trying to do the send.  Thanks,

Cc: stable@vger.kernel.org
Reported-by: Chris Wilson <jakdaw@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:55 -04:00
Anand Jain 183860f6a0 btrfs: device delete to get errors from the kernel
when user runs command btrfs dev del the raid requisite error if any
goes to the /var/log/messages, its not good idea to clutter messages
with these user (knowledge) errors, further user don't have to review
the system messages to know problem with the cli it should be dropped
to the user as part of the cli return.

to bring this feature created a set of the ERROR defined
BTRFS_ERROR_DEV* error codes and created their error string.

I expect this enum to be added with other error which we might
want to communicate to the user land

v3:
moved the code with in the file no logical change

v1->v2:
introduce error codes for the device mgmt usage

v1:
adds a parameter in the ioctl arg struct to carry the error string

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:53 -04:00
Josef Bacik c73e293678 Btrfs: do delay iput in sync_fs
We get lock inversion with umount if we allow iputs from sync_fs, so use the
delay iput flag to keep this from happening.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:52 -04:00
Miao Xie 4a9d8bdee3 Btrfs: make the state of the transaction more readable
We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.

This patch improved the above problem. In this patch, we define 6 states
for the transaction,
  enum btrfs_trans_state {
	TRANS_STATE_RUNNING		= 0,
	TRANS_STATE_BLOCKED		= 1,
	TRANS_STATE_COMMIT_START	= 2,
	TRANS_STATE_COMMIT_DOING	= 3,
	TRANS_STATE_UNBLOCKED		= 4,
	TRANS_STATE_COMPLETED		= 5,
	TRANS_STATE_MAX			= 6,
  }
and just use 1 variant to track those state.

In order to make the blocked handle types for each state more clear,
we introduce a array:
  unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
	[TRANS_STATE_RUNNING]		= 0U,
	[TRANS_STATE_BLOCKED]		= (__TRANS_USERSPACE |
					   __TRANS_START),
	[TRANS_STATE_COMMIT_START]	= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH),
	[TRANS_STATE_COMMIT_DOING]	= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH |
					   __TRANS_JOIN),
	[TRANS_STATE_UNBLOCKED]		= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH |
					   __TRANS_JOIN |
					   __TRANS_JOIN_NOLOCK),
	[TRANS_STATE_COMPLETED]		= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH |
					   __TRANS_JOIN |
					   __TRANS_JOIN_NOLOCK),
  }
it is very intuitionistic.

Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:51 -04:00
Miao Xie 581227d0d2 Btrfs: remove the time check in btrfs_commit_transaction()
We checked the commit time to avoid committing the transaction
frequently, but it is unnecessary because:
- It made the transaction commit spend more time, and delayed the
  operation of the external writers(TRANS_START/TRANS_USERSPACE).
- Except the space that we have to commit transaction, such as
  snapshot creation, btrfs doesn't commit the transaction on its
  own initiative.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:50 -04:00
Miao Xie 3f1e3fa65c Btrfs: remove unnecessary varient ->num_joined in btrfs_transaction structure
We used ->num_joined track if there were some writers which join the current
transaction when the committer was sleeping. If some writers joined the current
transaction, we has to continue the while loop to do some necessary stuff, such
as flush the ordered operations. But it is unnecessary because we will do it
after the while loop.

Besides that, tracking ->num_joined would make the committer drop into the while
loop when there are lots of internal writers(TRANS_JOIN).

So we remove ->num_joined and don't track if there are some writers which join
the current transaction when the committer is sleeping.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:48 -04:00
Miao Xie 824366177a Btrfs: don't flush the delalloc inodes in the while loop if flushoncommit is set
It is unnecessary to flush the delalloc inodes again and again because
we don't care the dirty pages which are introduced after the flush, and
they will be flush in the transaction commit.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:47 -04:00
Miao Xie 0860adfdb2 Btrfs: don't wait for all the writers circularly during the transaction commit
btrfs_commit_transaction has the following loop before we commit the
transaction.

do {
    // attempt to do some useful stuff and/or sleep
} while (atomic_read(&cur_trans->num_writers) > 1 ||
	 (should_grow && cur_trans->num_joined != joined));

This is used to prevent from the TRANS_START to get in the way of a
committing transaction. But it does not prevent from TRANS_JOIN, that
is we would do this loop for a long time if some writers JOIN the
current transaction endlessly.

Because we need join the current transaction to do some useful stuff,
we can not block TRANS_JOIN here. So we introduce a external writer
counter, which is used to count the TRANS_USERSPACE/TRANS_START writers.
If the external writer counter is zero, we can break the above loop.

In order to make the code more clear, we don't use enum variant
to define the type of the transaction handle, use bitmask instead.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:46 -04:00
Miao Xie 25d8c284c7 Btrfs: remove the code for the impossible case in cleanup_transaction()
If the transaction is removed from the transaction list, it means the
transaction has been committed successfully. So it is impossible to
call cleanup_transaction(), otherwise there is something wrong with
the code logic. Thus, we use BUG_ON() instead of the original handle.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:45 -04:00
Miao Xie ac6738792f Btrfs: cleanup unnecessary assignment when cleaning up all the residual transaction
When we umount a fs with serious errors, we will invoke btrfs_cleanup_transactions()
to clean up the residual transaction. At this time, It is impossible to start a new
transaction, so we needn't assign trans_no_join to 1, and also needn't clear running
transaction every time we destroy a residual transaction.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:44 -04:00
Miao Xie 6a03843df4 Btrfs: just flush the delalloc inodes in the source tree before snapshot creation
Before applying this patch, we need flush all the delalloc inodes in
the fs when we want to create a snapshot, it wastes time, and make
the transaction commit be blocked for a long time. It means some other
user operation would also be blocked for a long time.

This patch improves this problem, we just flush the delalloc inodes that
in the source trees before snapshot creation, so the transaction commit
will complete quickly.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:42 -04:00
Miao Xie 199c2a9c3d Btrfs: introduce per-subvolume ordered extent list
The reason we introduce per-subvolume ordered extent list is the same
as the per-subvolume delalloc inode list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:41 -04:00
Miao Xie eb73c1b7ce Btrfs: introduce per-subvolume delalloc inode list
When we create a snapshot, we need flush all delalloc inodes in the
fs, just flushing the inodes in the source tree is OK. So we introduce
per-subvolume delalloc inode list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:40 -04:00
Miao Xie b0feb9d96e Btrfs: introduce grab/put functions for the root of the fs/file tree
The grab/put funtions will be used in the next patch, which need grab
the root object and ensure it is not freed. We use reference counter
instead of the srcu lock is to aovid blocking the memory reclaim task,
which invokes synchronize_srcu().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:38 -04:00
Miao Xie cb517eabba Btrfs: cleanup the similar code of the fs root read
There are several functions whose code is similar, such as
  btrfs_find_last_root()
  btrfs_read_fs_root_no_radix()

Besides that, some functions are invoked twice, it is unnecessary,
for example, we are sure that all roots which is found in
  btrfs_find_orphan_roots()
have their orphan items, so it is unnecessary to check the orphan
item again.

So cleanup it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:37 -04:00
Miao Xie babbf170c7 Btrfs: make the snap/subv deletion end more early when the fs is R/O
The snapshot/subvolume deletion might spend lots of time, it would make
the remount task wait for a long time. This patch improve this problem,
we will break the deletion if the fs is remounted to be R/O. It will make
the users happy.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:36 -04:00
Miao Xie dc7f370c05 Btrfs: move the R/O check out of btrfs_clean_one_deleted_snapshot()
If the fs is remounted to be R/O, it is unnecessary to call
btrfs_clean_one_deleted_snapshot(), so move the R/O check out of
this function. And besides that, it can make the check logic in the
caller more clear.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:34 -04:00
Miao Xie 05323cd135 Btrfs: make the cleaner complete early when the fs is going to be umounted
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:33 -04:00
Miao Xie d027824564 Btrfs: remove unnecessary ->s_umount in cleaner_kthread()
In order to avoid the R/O remount, we acquired ->s_umount lock during
we deleted the dead snapshots and subvolumes. But it is unnecessary,
because we have cleaner_mutex.

We use cleaner_mutex to protect the process of the dead snapshots/subvolumes
deletion. And when we remount the fs to be R/O, we also acquire this mutex to
do cleanup after we change the status of the fs. That is this lock can serialize
the above operations, the cleaner can be aware of the status of the fs, and if
the cleaner is deleting the dead snapshots/subvolumes, the remount task will
wait for it. So it is safe to remove ->s_umount in cleaner_kthread().

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:32 -04:00
Stefan Behrens 3c64a1aba7 Btrfs: cleanup: don't check the same thing twice
btrfs_read_fs_root_no_name() already checks if btrfs_root_refs()
is zero and returns ENOENT in this case. There is no need to do
it again in six places.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:30 -04:00
Stefan Behrens b1b195969f Btrfs: cleanup, btrfs_read_fs_root_no_name() doesn't return NULL
No need to check for NULL in send.c and disk-io.c.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:29 -04:00
Stefan Behrens 78a1068b28 Btrfs: delete unused function
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:28 -04:00
Liu Bo 5798b92d2b Btrfs: remove useless copy in quota_ctl
We don't need to copy it back to user side as it remains unchanged.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:27 -04:00
Andreas Philipp 1c89cdd1ce Minor format cleanup.
Clean up the format of the definitions of BTRFS_BLOCK_GROUP_RAID5 and
BTRFS_BLOCK_GROUP_RAID6.

Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:25 -04:00
Tsutomu Itoh 924794c936 Btrfs: cleanup unused arguments in send.c
sctx is removed from the argument of the function that
doesn't use sctx.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:24 -04:00
Stefan Behrens 8f69dbd236 Btrfs: fix a comment
The size parameter to btrfs_extend_item() is the number of bytes
to add to the item, not the size of the item after the operation
(like it is for btrfs_truncate_item(), there the size parameter
is not the number of bytes to take away, but the total size of
the item after truncation).
Fix it in the comment.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:23 -04:00
Jan Schmidt 57254b6ebc Btrfs: add ioctl to wait for qgroup rescan completion
btrfs_qgroup_wait_for_completion waits until the currently running qgroup
operation completes. It returns immediately when no rescan process is in
progress. This is useful to automate things around the rescan process (e.g.
testing).

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:22 -04:00
Wang Shilong 1e8f915868 Btrfs: introduce qgroup_ulist to avoid frequently allocating/freeing ulist
When doing qgroup accounting, we call ulist_alloc()/ulist_free() every time
when we want to walk qgroup tree.

By introducing 'qgroup_ulist', we only need to call ulist_alloc()/ulist_free()
once. This reduce some sys time to allocate memory, see the measurements below

fsstress -p 4 -n 10000 -d $dir

With this patch:

real    0m50.153s
user    0m0.081s
sys     0m6.294s

real    0m51.113s
user    0m0.092s
sys     0m6.220s

real    0m52.610s
user    0m0.096s
sys     0m6.125s	avg 6.213
-----------------------------------------------------
Without the patch:

real    0m54.825s
user    0m0.061s
sys     0m10.665s

real    1m6.401s
user    0m0.089s
sys     0m11.218s

real    1m13.768s
user    0m0.087s
sys     0m10.665s       avg 10.849

we can see the sys time reduce ~43%.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:21 -04:00
David Sterba 85965600f5 btrfs: show compiled-in config features at module load time
We want to know if there are debugging features compiled in, this may
affect performance. The message is printed before the sanity checks.
Also kill version.h file that serves no purpose, we don't use any
version tag for kernel module.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:19 -04:00
David Sterba e6d2960582 btrfs: move ifdef around sanity checks out of init_btrfs_fs
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:18 -04:00
David Sterba 905d0f564e btrfs: add prefix to sanity tests messages
And change the message level to KERN_INFO.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:17 -04:00
David Sterba 8d599ae1bf btrfs: add debug check for extent_io range alignment
The 'end' value must exactly cover the end of the interval, which means
one byte less than the expected block alignment, or in case of a file
smaller than one block, one byte less than the inode size.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:15 -04:00
Henrik Nordvik 15b0a89d71 Btrfs: fix check on same raid type flag twice
Code checked for raid 5 flag in two else-if branches, so code would never be reached. Probably a copy-paste bug.

Signed-off-by: Henrik Nordvik <henrikno@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:14 -04:00
Bob Peterson 512cbf02fd GFS2: fix regression in dir_double_exhash
Recent commit e8830d8 introduced a bug in function dir_double_exhash;
it was failing to set h in the fall-back case. This patch corrects it.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-14 13:27:17 +01:00
Steven Whitehouse 6d4ade986f GFS2: Add atomic_open support
I've restricted atomic_open to only operate on regular files, although
I still don't understand why atomic_open should not be possible also for
directories on GFS2. That can always be added in later though, if it
makes sense.

The ->atomic_open function can be passed negative dentries, which
in most cases means either ENOENT (->lookup) or a call to d_instantiate
(->create). In the GFS2 case though, we need to actually perform the
look up, since we do not know whether there has been a new inode created
on another node. The look up calls d_splice_alias which then tries to
rehash the dentry - so the solution here is to simply check for that
in d_splice_alias. The same issue is likely to affect any other cluster
filesystem implementing ->atomic_open

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields fieldses org>
Cc: Jeff Layton <jlayton@redhat.com>
2013-06-14 11:17:15 +01:00
Linus Torvalds a2648ebb7e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is an assortment of crash fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: stop all workers before cleaning up roots
  Btrfs: fix use-after-free bug during umount
  Btrfs: init relocate extent_io_tree with a mapping
  btrfs: Drop inode if inode root is NULL
  Btrfs: don't delete fs_roots until after we cleanup the transaction
2013-06-13 22:34:14 -07:00
Jaegeuk Kim 354a3399dc f2fs: recover wrong pino after checkpoint during fsync
If a file is linked, f2fs loose its parent inode number so that fsync calls
for the linked file should do checkpoint all the time.
But, if we can recover its parent inode number after the checkpoint, we can
adjust roll-forward mechanism for the further fsync calls, which is able to
improve the fsync performance significatly.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:45 +09:00
Haicheng Li b25958b6ec f2fs: optimize do_write_data_page()
Since "need_inplace_update() == true" is a very rare case, using unlikely()
to give compiler a chance to optimize the code.

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:44 +09:00
Haicheng Li 8d8451af68 f2fs: make locate_dirty_segment() as static
It's used only locally and could be static.

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:44 +09:00
Haicheng Li e79efe3b69 f2fs: remove unnecessary parameter "offset" from __add_sum_entry()
We can get the value directly from pointer "curseg".

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:43 +09:00
Jaegeuk Kim b3783873cc f2fs: avoid freqeunt write_inode calls
If update_inode is called, we don't need to do write_inode.
So, let's use a *dirty* flag for each inode.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:43 +09:00
Namjae Jeon d7cc950b4c f2fs: optimise the truncate_data_blocks_range() range
The function truncate_data_blocks_range() decrements the valid
block count of inode via dec_valid_block_count(). Since this
function updates the i_blocks field of inode, we can update this
field once we have calculated total the number of blocks
to be freed.

Therefore we can decrement valid blocks outside of the for loop.

	if (nr_free) {
+		dec_valid_block_count(sbi, dn->inode, nr_free);
 		set_page_dirty(dn->node_page);
 		sync_inode_page(dn);
 	}

'nr_free' tells the total number of blocks freed. So, we can
just directly pass this value to dec_valid_block_count() and update
the i_blocks.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:42 +09:00
Namjae Jeon 6a3e8ef0de f2fs: use the F2FS specific flags in f2fs_ioctl()
In f2fs_ioctl() function, it is using generic flags.
Since F2FS specific flags are defined. So lets use
those flags.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:26 +09:00
Dave Chinner ade1335afe xfs: ensure btree root split sets blkno correctly
For CRC enabled filesystems, the BMBT is rooted in an inode, so it
passes through a different code path on root splits than the
freespace and inode btrees. This is much less traversed by xfstests
than the other trees. When testing on a 1k block size filesystem,
I've been seeing ASSERT failures in generic/234 like:

XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP || cur->bc_private.b.allocated == 0, file: fs/xfs/xfs_btree.c, line: 317

which are generally preceded by a lblock check failure. I noticed
this in the bmbt stats:

$ pminfo -f xfs.btree.block_map

xfs.btree.block_map.lookup
    value 39135

xfs.btree.block_map.compare
    value 268432

xfs.btree.block_map.insrec
    value 15786

xfs.btree.block_map.delrec
    value 13884

xfs.btree.block_map.newroot
    value 2

xfs.btree.block_map.killroot
    value 0
.....

Very little coverage of root splits and merges. Indeed, on a 4k
filesystem, block_map.newroot and block_map.killroot are both zero.
i.e. the code is not exercised at all, and it's the only generic
btree infrastructure operation that is not exercised by a default run
of xfstests.

Turns out that on a 1k filesystem, generic/234 accounts for one of
those two root splits, and that is somewhat of a smoking gun. In
fact, it's the same problem we saw in the directory/attr code where
headers are memcpy()d from one block to another without updating the
self describing metadata.

Simple fix - when copying the header out of the root block, make
sure the block number is updated correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-13 14:18:02 -05:00
Dave Chinner 8a1fd2950e xfs: fix implicit padding in directory and attr CRC formats
Michael L. Semon has been testing CRC patches on a 32 bit system and
been seeing assert failures in the directory code from xfs/080.
Thanks to Michael's heroic efforts with printk debugging, we found
that the problem was that the last free space being left in the
directory structure was too small to fit a unused tag structure and
it was being corrupted and attempting to log a region out of bounds.
Hence the assert failure looked something like:

.....
#5 calling xfs_dir2_data_log_unused() 36 32
#1 4092 4095 4096
#2 8182 8183 4096
XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568

Where #1 showed the first region of the dup being logged (i.e. the
last 4 bytes of a directory buffer) and #2 shows the corrupt values
being calculated from the length of the dup entry which overflowed
the size of the buffer.

It turns out that the problem was not in the logging code, nor in
the freespace handling code. It is an initial condition bug that
only shows up on 32 bit systems. When a new buffer is initialised,
where's the freespace that is set up:

[  172.316249] calling xfs_dir2_leaf_addname() from xfs_dir_createname()
[  172.316346] #9 calling xfs_dir2_data_log_unused()
[  172.316351] #1 calling xfs_trans_log_buf() 60 63 4096
[  172.316353] #2 calling xfs_trans_log_buf() 4094 4095 4096

Note the offset of the first region being logged? It's 60 bytes into
the buffer. Once I saw that, I pretty much knew that the bug was
going to be caused by this.

Essentially, all direct entries are rounded to 8 bytes in length,
and all entries start with an 8 byte alignment. This means that we
can decode inplace as variables are naturally aligned. With the
directory data supposedly starting on a 8 byte boundary, and all
entries padded to 8 bytes, the minimum freespace in a directory
block is supposed to be 8 bytes, which is large enough to fit a
unused data entry structure (6 bytes in size). The fact we only have
4 bytes of free space indicates a directory data block alignment
problem.

And what do you know - there's an implicit hole in the directory
data block header for the CRC format, which means the header is 60
byte on 32 bit intel systems and 64 bytes on 64 bit systems. Needs
padding. And while looking at the structures, I found the same
problem in the attr leaf header. Fix them both.

Note that this only affects 32 bit systems with CRCs enabled.
Everything else is just fine. Note that CRC enabled filesystems created
before this fix on such systems will not be readable with this fix
applied.

Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Debugged-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-13 10:30:03 -05:00
Jie Liu 72dac95d44 ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
Return the FIEMAP_EXTENT_UNKNOWN flag as well except the
FIEMAP_EXTENT_DELALLOC because the data location of an
delayed allocation extent is unknown.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
2013-06-12 23:13:59 -04:00
Paul Gortmaker 75497d0607 jbd2: remove debug dependency on debug_fs and update Kconfig help text
Commit b6e96d0067 ("jbd2: use module parameters instead of debugfs
for jbd_debug") removed any need for a dependency on DEBUG_FS.  It
also moved the /sys variables out from underneath the typical debugfs
mount point.  Delete the dependency and update the /sys path to where
the debug settings are currently.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 23:07:51 -04:00
Paul Gortmaker 169f1a2a87 jbd2: use a single printk for jbd_debug()
Since the jbd_debug() is implemented with two separate printk()
calls, it can lead to corrupted and misleading debug output like
the following (see lines marked with "*"):

[  290.339362] (fs/jbd2/journal.c, 203): kjournald2: kjournald2 wakes
[  290.339365] (fs/jbd2/journal.c, 155): kjournald2: commit_sequence=42103, commit_request=42104
[  290.339369] (fs/jbd2/journal.c, 158): kjournald2: OK, requests differ
[* 290.339376] (fs/jbd2/journal.c, 648): jbd2_log_wait_commit:
[* 290.339379] (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: want 42104, j_commit_sequence=42103
[* 290.339382] JBD2: starting commit of transaction 42104
[  290.339410] (fs/jbd2/revoke.c, 566): jbd2_journal_write_revoke_records: Wrote 0 revoke records
[  290.376555] (fs/jbd2/commit.c, 1088): jbd2_journal_commit_transaction: JBD2: commit 42104 complete, head 42079

i.e. the debug output from log_wait_commit and journal_commit_transaction
have become interleaved.  The output should have been:

(fs/jbd2/journal.c, 648): jbd2_log_wait_commit: JBD2: want 42104, j_commit_sequence=42103
(fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: starting commit of transaction 42104

It is expected that this is not easy to replicate -- I was only able
to cause it on preempt-rt kernels, and even then only under heavy
I/O load.

Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Suggested-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 23:04:04 -04:00
Paul Gortmaker cfc7bc896f jbd2: fix duplicate debug label for phase 2
Currently we see this output:

  $git grep phase fs/jbd2
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 1\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 2\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 2\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 3\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 4\n");
  [...]

There is clearly a duplicate label for phase 2, and they are
both active (i.e. not in #if ... #else block).  Rename them to
be "2a" and "2b" so the debug output is unambiguous.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:56:35 -04:00
Paul Gortmaker 0ef54180e0 jbd2: drop checkpoint mutex when waiting in __jbd2_log_wait_for_space()
While trying to debug an an issue under extreme I/O loading
on preempt-rt kernels, the following backtrace was observed
via SysRQ output:

rm              D ffff8802203afbc0  4600  4878   4748 0x00000000
 ffff8802217bfb78 0000000000000082 ffff88021fc2bb80 ffff88021fc2bb80
 ffff88021fc2bb80 ffff8802217bffd8 ffff8802217bffd8 ffff8802217bffd8
 ffff88021f1d4c80 ffff88021fc2bb80 ffff8802217bfb88 ffff88022437b000
Call Trace:
 [<ffffffff8172dc34>] schedule+0x24/0x70
 [<ffffffff81225b5d>] jbd2_log_wait_commit+0xbd/0x140
 [<ffffffff81060390>] ? __init_waitqueue_head+0x50/0x50
 [<ffffffff81223635>] jbd2_log_do_checkpoint+0xf5/0x520
 [<ffffffff81223b09>] __jbd2_log_wait_for_space+0xa9/0x1f0
 [<ffffffff8121dc40>] start_this_handle.isra.10+0x2e0/0x530
 [<ffffffff81060390>] ? __init_waitqueue_head+0x50/0x50
 [<ffffffff8121e0a3>] jbd2__journal_start+0xc3/0x110
 [<ffffffff811de7ce>] ? ext4_rmdir+0x6e/0x230
 [<ffffffff8121e0fe>] jbd2_journal_start+0xe/0x10
 [<ffffffff811f308b>] ext4_journal_start_sb+0x5b/0x160
 [<ffffffff811de7ce>] ext4_rmdir+0x6e/0x230
 [<ffffffff811435c5>] vfs_rmdir+0xd5/0x140
 [<ffffffff8114370f>] do_rmdir+0xdf/0x120
 [<ffffffff8105c6b4>] ? task_work_run+0x44/0x80
 [<ffffffff81002889>] ? do_notify_resume+0x89/0x100
 [<ffffffff817361ae>] ? int_signal+0x12/0x17
 [<ffffffff81145d85>] sys_unlinkat+0x25/0x40
 [<ffffffff81735f22>] system_call_fastpath+0x16/0x1b

What is interesting here, is that we call log_wait_commit, from
within wait_for_space, but we are still holding the checkpoint_mutex
as it surrounds mostly the whole of wait_for_space.  And then, as we
are waiting, journal_commit_transaction can run, and if the JBD2_FLUSHED
bit is set, then we will also try to take the same checkpoint_mutex.

It seems that we need to drop the checkpoint_mutex while sitting in
jbd2_log_wait_commit, if we want to guarantee that progress can be made
by jbd2_journal_commit_transaction().  There does not seem to be
anything preempt-rt specific about this, other then perhaps increasing
the odds of it happening.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:47:35 -04:00
Paul Gortmaker 3ca841c106 jbd2: relocate assert after state lock in journal_commit_transaction()
The state lock is taken after we are doing an assert on the state
value, not before.  So we might in fact be doing an assert on a
transient value.  Ensure the state check is within the scope of
the state lock being taken.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:46:35 -04:00
Dmitry Monakhov 4418e14112 ext4: Fix fsync error handling after filesystem abort
If filesystem was aborted after inode's write back is complete
but before its metadata was updated we may return success
results in data loss.
In order to handle fs abort correctly we have to check
fs state once we discover that it is in MS_RDONLY state

Test case: http://patchwork.ozlabs.org/patch/244297

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:38:04 -04:00
Dmitry Monakhov 06a407f13d ext4: fix data integrity for ext4_sync_fs
Inode's data or non journaled quota may be written w/o jounral so we
_must_ send a barrier at the end of ext4_sync_fs. But it can be
skipped if journal commit will do it for us.

Also fix data integrity for nojournal mode.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:25:07 -04:00
Dmitry Monakhov 9ff8644624 jbd2: optimize jbd2_journal_force_commit
Current implementation of jbd2_journal_force_commit() is suboptimal because
result in empty and useless commits. But callers just want to force and wait
any unfinished commits. We already have jbd2_journal_force_commit_nested()
which does exactly what we want, except we are guaranteed that we do not hold
journal transaction open.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:24:07 -04:00
Linus Torvalds a568fa1c91 Merge branch 'akpm' (updates from Andrew Morton)
Merge misc fixes from Andrew Morton:
 "Bunch of fixes and one little addition to math64.h"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (27 commits)
  include/linux/math64.h: add div64_ul()
  mm: memcontrol: fix lockless reclaim hierarchy iterator
  frontswap: fix incorrect zeroing and allocation size for frontswap_map
  kernel/audit_tree.c:audit_add_tree_rule(): protect `rule' from kill_rules()
  mm: migration: add migrate_entry_wait_huge()
  ocfs2: add missing lockres put in dlm_mig_lockres_handler
  mm/page_alloc.c: fix watermark check in __zone_watermark_ok()
  drivers/misc/sgi-gru/grufile.c: fix info leak in gru_get_config_info()
  aio: fix io_destroy() regression by using call_rcu()
  rtc-at91rm9200: use shadow IMR on at91sam9x5
  rtc-at91rm9200: add shadow interrupt mask
  rtc-at91rm9200: refactor interrupt-register handling
  rtc-at91rm9200: add configuration support
  rtc-at91rm9200: add match-table compile guard
  fs/ocfs2/namei.c: remove unecessary ERROR when removing non-empty directory
  swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion
  drivers/rtc/rtc-twl.c: fix missing device_init_wakeup() when booted with device tree
  cciss: fix broken mutex usage in ioctl
  audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLE
  drivers/rtc/rtc-cmos.c: fix accidentally enabling rtc channel
  ...
2013-06-12 16:29:53 -07:00
Xue jiufei 27749f2ff0 ocfs2: add missing lockres put in dlm_mig_lockres_handler
dlm_mig_lockres_handler() is missing a dlm_lockres_put() on an error path.

Signed-off-by: joyce <xuejiufei@huawei.com>
Reviewed-by: shencanquan <shencanquan@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:46 -07:00
Kent Overstreet 4fcc712f5c aio: fix io_destroy() regression by using call_rcu()
There was a regression introduced by 36f5588905 ("aio: refcounting
cleanup"), reported by Jens Axboe - the refcounting cleanup switched to
using RCU in the shutdown path, but the synchronize_rcu() was done in
the context of the io_destroy() syscall greatly increasing the time it
could block.

This patch switches it to call_rcu() and makes shutdown asynchronous
(more asynchronous than it was originally; before the refcount changes
io_destroy() would still wait on pending kiocbs).

Note that there's a global quota on the max outstanding kiocbs, and that
quota must be manipulated synchronously; otherwise io_setup() could
return -EAGAIN when there isn't quota available, and userspace won't
have any way of waiting until shutdown of the old kioctxs has finished
(besides busy looping).

So we release our quota before kioctx shutdown has finished, which
should be fine since the quota never corresponded to anything real
anyways.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Tested-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:46 -07:00
Goldwyn Rodrigues e099127169 fs/ocfs2/namei.c: remove unecessary ERROR when removing non-empty directory
While removing a non-empty directory, the kernel dumps a message:

  (rmdir,21743,1):ocfs2_unlink:953 ERROR: status = -39

Suppress the error message from being printed in the dmesg so users
don't panic.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:45 -07:00
Xiaowei.Hu 7869e59067 ocfs2: ocfs2_prep_new_orphaned_file() should return ret
If an error occurs, for example an EIO in __ocfs2_prepare_orphan_dir,
ocfs2_prep_new_orphaned_file will release the inode_ac, then when the
caller of ocfs2_prep_new_orphaned_file gets a 0 return, it will refer to
a NULL ocfs2_alloc_context struct in the following functions.  A kernel
panic happens.

Signed-off-by: "Xiaowei.Hu" <xiaowei.hu@oracle.com>
Reviewed-by: shencanquan <shencanquan@huawei.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:44 -07:00
Kees Cook 637241a900 kmsg: honor dmesg_restrict sysctl on /dev/kmsg
The dmesg_restrict sysctl currently covers the syslog method for access
dmesg, however /dev/kmsg isn't covered by the same protections.  Most
people haven't noticed because util-linux dmesg(1) defaults to using the
syslog method for access in older versions.  With util-linux dmesg(1)
defaults to reading directly from /dev/kmsg.

To fix /dev/kmsg, let's compare the existing interfaces and what they
allow:

 - /proc/kmsg allows:
  - open (SYSLOG_ACTION_OPEN) if CAP_SYSLOG since it uses a destructive
    single-reader interface (SYSLOG_ACTION_READ).
  - everything, after an open.

 - syslog syscall allows:
  - anything, if CAP_SYSLOG.
  - SYSLOG_ACTION_READ_ALL and SYSLOG_ACTION_SIZE_BUFFER, if
    dmesg_restrict==0.
  - nothing else (EPERM).

The use-cases were:
 - dmesg(1) needs to do non-destructive SYSLOG_ACTION_READ_ALLs.
 - sysklog(1) needs to open /proc/kmsg, drop privs, and still issue the
   destructive SYSLOG_ACTION_READs.

AIUI, dmesg(1) is moving to /dev/kmsg, and systemd-journald doesn't
clear the ring buffer.

Based on the comments in devkmsg_llseek, it sounds like actions besides
reading aren't going to be supported by /dev/kmsg (i.e.
SYSLOG_ACTION_CLEAR), so we have a strict subset of the non-destructive
syslog syscall actions.

To this end, move the check as Josh had done, but also rename the
constants to reflect their new uses (SYSLOG_FROM_CALL becomes
SYSLOG_FROM_READER, and SYSLOG_FROM_FILE becomes SYSLOG_FROM_PROC).
SYSLOG_FROM_READER allows non-destructive actions, and SYSLOG_FROM_PROC
allows destructive actions after a capabilities-constrained
SYSLOG_ACTION_OPEN check.

 - /dev/kmsg allows:
  - open if CAP_SYSLOG or dmesg_restrict==0
  - reading/polling, after open

Addresses https://bugzilla.redhat.com/show_bug.cgi?id=903192

[akpm@linux-foundation.org: use pr_warn_once()]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Tested-by: Josh Boyer <jwboyer@redhat.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:44 -07:00
Theodore Ts'o 981250ca89 ext4: don't use EXT4_FREE_BLOCKS_FORGET unnecessarily
Commit 18888cf0883c: "ext4: speed up truncate/unlink by not using
bforget() unless needed" removed the use of EXT4_FREE_BLOCKS_FORGET in
the most important codepath for file systems using extents, but a
similar optimization also can be done for file systems using indirect
blocks, and for the two special cases in the ext4 extents code.

Cc: Andrey Sidorov <qrxd43@motorola.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 11:48:29 -04:00
Theodore Ts'o 2ed5724d5a ext4: add cond_resched() to ext4_free_blocks() & ext4_mb_regular_allocator()
For a file systems with a very large number of block groups, if all of
the block group bitmaps are in memory and the file system is
relatively badly fragmented, it's possible ext4_mb_regular_allocator()
to take a long time trying to find a good match.  This is especially
true if the tuning parameter mb_max_to_scan has been sent to a very
large number.  So add a cond_resched() to avoid soft lockup warnings
and to provide better system responsiveness.

For ext4_free_blocks(), if we are deleting a large range of blocks,
and data=journal is enabled so that EXT4_FREE_BLOCKS_FORGET is passed,
the loop to call sb_find_get_block() and to call ext4_forget() can
take over 10-15 milliseocnds or more.  So it's better to add a
cond_resched() here a well.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 11:43:02 -04:00
Linus Torvalds 8d7a8fe2ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fixes from Sage Weil:
 "There is a pair of fixes for double-frees in the recent bundle for
  3.10, a couple of fixes for long-standing bugs (sleep while atomic and
  an endianness fix), and a locking fix that can be triggered when osds
  are going down"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix cleanup in rbd_add()
  rbd: don't destroy ceph_opts in rbd_add()
  ceph: ceph_pagelist_append might sleep while atomic
  ceph: add cpu_to_le32() calls when encoding a reconnect capability
  libceph: must hold mutex for reset_changed_osds()
2013-06-12 08:28:19 -07:00
Jaegeuk Kim 699489bbbe f2fs: sync dir->i_size with its block allocation
If new dentry block is allocated and its i_size is updated, we should update
its inode block together in order to sync i_size and its block allocation.
Otherwise, we can loose additional dentry block due to the unconsistent i_size.

Errorneous Scenario
-------------------

In the recovery routine,
 - recovery_dentry
 | - __f2fs_add_link
 | | - get_new_data_page
 | | | - i_size_write(new_i_size)
 | | | - mark_inode_dirty_sync(dir)
 | | - update_parent_metadata
 | | | - mark_inode_dirty(dir)
 |
 - write_checkpoint
   - sync_dirty_dir_inodes
     - filemap_flush(dentry_blocks)
       - f2fs_write_data_page
         - skip to write the last dentry block due to index < i_size

In the above flow, new_i_size is not updated to its inode block so that the
last dentry block will be lost accordingly.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-12 08:18:35 +09:00
Steven Whitehouse 5a00f3cc97 GFS2: Only do one directory search on create
Creation of a new inode requires a directory search in order to ensure
that we are not trying to create an inode with the same name as an
existing one. This was hidden away inside the create_ok() function.

In the case that there was an existing inode, and a lookup can be
substituted for a create (which is the case with regular files
when the O_EXCL flag is not in use) then we were doing a second
lookup in order to return the inode.

This patch merges these two lookups into one. This can be done by
passing a flag to gfs2_dir_search() to tell it to just return -EEXIST
in the cases where we don't actually want to look up the inode.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-11 13:45:29 +01:00
Jaegeuk Kim 2d4d9fb591 f2fs: fix i_blocks translation on various types of files
Basically an inode manages the number of allocated blocks with inode->i_blocks
which is represented in a unit of sectors, not file system blocks.
But, f2fs has used i_blocks in a unit of file system blocks, and f2fs_getattr
translates it to the number of sectors when fstat is called.

However, previously f2fs_file_inode_operations only has this, so this patch adds
it to all the types of inode_operations.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-11 16:01:09 +09:00
Gu Zheng 5fb08372a6 f2fs: set sb->s_fs_info before calling parse_options()
In f2fs_fill_super(), set sb->s_fs_info before calling parse_options(), then we can get
f2fs_sb_info via F2FS_SB(sb) in parse_options().
So that the second argument "sbi" of func parse_options() is no longer needed.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-11 16:01:08 +09:00
Jaegeuk Kim 8ae8f1627f f2fs: support xattr security labels
This patch adds the support of security labels for f2fs, which will be used
by Linus Security Models (LSMs).

Quote from http://en.wikipedia.org/wiki/Linux_Security_Modules:
"Linux Security Modules (LSM) is a framework that allows the Linux kernel to
support a variety of computer security models while avoiding favoritism toward
any single security implementation. The framework is licensed under the terms of
the GNU General Public License and is standard part of the Linux kernel since
Linux 2.6. AppArmor, SELinux, Smack and TOMOYO Linux are the currently accepted
modules in the official kernel.".

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-11 16:01:03 +09:00
Mikulas Patocka bbd465df73 hpfs: fix warnings when the filesystem fills up
This patch fixes warnings due to missing lock on write error path.

  WARNING: at fs/hpfs/hpfs_fn.h:353 hpfs_truncate+0x75/0x80 [hpfs]()
  Hardware name: empty
  Pid: 26563, comm: dd Tainted: P           O 3.9.4 #12
  Call Trace:
    hpfs_truncate+0x75/0x80 [hpfs]
    hpfs_write_begin+0x84/0x90 [hpfs]
    _hpfs_bmap+0x10/0x10 [hpfs]
    generic_file_buffered_write+0x121/0x2c0
    __generic_file_aio_write+0x1c7/0x3f0
    generic_file_aio_write+0x7c/0x100
    do_sync_write+0x98/0xd0
    hpfs_file_write+0xd/0x50 [hpfs]
    vfs_write+0xa2/0x160
    sys_write+0x51/0xa0
    page_fault+0x22/0x30
    system_call_fastpath+0x1a/0x1f

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: stable@kernel.org  # 2.6.39+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-08 17:39:40 -07:00
Linus Torvalds 81db4dbf59 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:

 - Trivial: unused variable removal

 - Posix-timers: Add the clock ID to the new proc interface to make it
   useful.  The interface is new and should be functional when we reach
   the final 3.10 release.

 - Cure a false positive warning in the tick code introduced by the
   overhaul in 3.10

 - Fix for a persistent clock detection regression introduced in this
   cycle

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Correct run-time detection of persistent_clock.
  ntp: Remove unused variable flags in __hardpps
  posix-timers: Show clock ID in proc file
  tick: Cure broadcast false positive pending bit warning
2013-06-08 15:51:21 -07:00
Bryan Schumaker 6b140b85d9 NFS: Add in v4.2 callback operation
NFS v4.2 adds a CB_OFFLOAD operation used by COPY and WRITE_PLUS.  Since
neither of these operations have been implemented yet, simply return
NFS4ERR_NOTSUPP.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:18 -04:00
Bryan Schumaker 459de2edb9 NFS: Make callbacks minor version generic
I found a few places that hardcode the minor version number rather than
making it dependent on the protocol the callback came in over.  This
patch makes it easier to add new minor versions in the future.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:18 -04:00
Steve Dickson f58eda9bc2 Kconfig: Add Kconfig entry for Labeled NFS V4 client
This patch adds the NFS_V4_SECURITY_LABEL entry which
enables security label support for the NFSv4 client

Signed-off-by: Steve Dickson <steved@redhat.com>
[trond: Make this non-interactive]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:17 -04:00
David Quigley c9bccef6b9 NFS: Extend NFS xattr handlers to accept the security namespace
The existing NFSv4 xattr handlers do not accept xattr calls to the security
namespace. This patch extends these handlers to accept xattrs from the security
namespace in addition to the default NFSv4 ACL namespace.

Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Matthew N. Dodd <Matthew.Dodd@sparta.com>
Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:17 -04:00
David Quigley aa9c266962 NFS: Client implementation of Labeled-NFS
This patch implements the client transport and handling support for labeled
NFS. The patch adds two functions to encode and decode the security label
recommended attribute which makes use of the LSM hooks added earlier. It also
adds code to grab the label from the file attribute structures and encode the
label to be sent back to the server.

Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Matthew N. Dodd <Matthew.Dodd@sparta.com>
Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:16 -04:00
David Quigley 14c43f7678 NFS: Add label lifecycle management
This patch adds the lifecycle management for the security label structure
introduced in an earlier patch. The label is not used yet but allocations and
freeing of the structure is handled.

Signed-off-by: Matthew N. Dodd <Matthew.Dodd@sparta.com>
Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:15 -04:00
David Quigley 1775fd3e80 NFS:Add labels to client function prototypes
After looking at all of the nfsv4 operations the label structure has been added
to the prototypes of the functions which can transmit label data.

Signed-off-by: Matthew N. Dodd <Matthew.Dodd@sparta.com>
Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:15 -04:00
David Quigley a09df2ca23 NFSv4: Extend fattr bitmaps to support all 3 words
The fattr handling bitmap code only uses the first two fattr words sofar. This
patch adds the 3rd word to being sent but doesn't populate it yet.

Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:14 -04:00
Steve Dickson e058f70b80 NFSv4: Introduce new label structure
In order to mimic the way that NFSv4 ACLs are implemented we have created a
structure to be used to pass label data up and down the call chain. This patch
adds the new structure and new members to the required NFSv4 call structures.

Signed-off-by: Matthew N. Dodd <Matthew.Dodd@sparta.com>
Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:14 -04:00
Steve Dickson 42c2c4249c NFSv4.2: Added NFS v4.2 support to the NFS client
This enable NFSv4.2 support. To enable this code the
CONFIG_NFS_V4_2 Kconfig define needs to be set and
the -o v4.2 mount option need to be used.

Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:13 -04:00
David Quigley 649f6e7718 LSM: Add flags field to security_sb_set_mnt_opts for in kernel mount data.
There is no way to differentiate if a text mount option is passed from user
space or the kernel. A flags field is being added to the
security_sb_set_mnt_opts hook to allow for in kernel security flags to be sent
to the LSM for processing in addition to the text options received from mount.
This patch also updated existing code to fix compilation errors.

Acked-by: Eric Paris <eparis@redhat.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov>
Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-08 16:20:12 -04:00
Josef Bacik 13e6c37b98 Btrfs: stop all workers before cleaning up roots
Dave reported a panic because the extent_root->commit_root was NULL in the
caching kthread.  That is because we just unset it in free_root_pointers, which
is not the correct thing to do, we have to either wait for the caching kthread
to complete or hold the extent_commit_sem lock so we know the thread has exited.
This patch makes the kthreads all stop first and then we do our cleanup.  This
should fix the race.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-08 15:11:35 -04:00
Liu Bo 2932505abe Btrfs: fix use-after-free bug during umount
Commit be283b2e67
(    Btrfs: use helper to cleanup tree roots) introduced the following bug,

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000034
 IP: [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
[...]
 Pid: 2463, comm: btrfs-cache-1 Tainted: G           O 3.9.0+ #4 innotek GmbH VirtualBox/VirtualBox
 RIP: 0010:[<ffffffffa039368c>]  [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
 Process btrfs-cache-1 (pid: 2463, threadinfo ffff880112d60000, task ffff880117679730)
[...]
 Call Trace:
  [<ffffffffa0398a99>] btrfs_search_slot+0x104/0x64d [btrfs]
  [<ffffffffa039aea4>] btrfs_next_old_leaf+0xa7/0x334 [btrfs]
  [<ffffffffa039b141>] btrfs_next_leaf+0x10/0x12 [btrfs]
  [<ffffffffa039ea13>] caching_thread+0x1a3/0x2e0 [btrfs]
  [<ffffffffa03d8811>] worker_loop+0x14b/0x48e [btrfs]
  [<ffffffffa03d86c6>] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
  [<ffffffff81068d3d>] kthread+0x8d/0x95
  [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43
  [<ffffffff8151e5ac>] ret_from_fork+0x7c/0xb0
  [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43
RIP  [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]

We've free'ed commit_root before actually getting to free block groups where
caching thread needs valid extent_root->commit_root.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-06-08 15:10:01 -04:00
Josef Bacik a9995eece3 Btrfs: init relocate extent_io_tree with a mapping
Dave reported a NULL pointer deref.  This is caused because he thought he'd be
smart and add sanity checks to the extent_io bit operations, but he didn't
expect a tree to have a NULL mapping.  To fix this we just need to init the
relocation's processed_blocks with the btree_inode->i_mapping.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-06-08 15:07:53 -04:00
Naohiro Aota 6379ef9fb2 btrfs: Drop inode if inode root is NULL
There is a path where btrfs_drop_inode() is called with its inode's root
is NULL: In btrfs_new_inode(), when btrfs_set_inode_index() fails,
iput() is called. We should handle this case before taking look at the
root->root_item.

Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-06-08 15:07:53 -04:00
Josef Bacik 7b5ff90ed0 Btrfs: don't delete fs_roots until after we cleanup the transaction
We get a use after free if we had a transaction to cleanup since there could be
delayed inodes which refer to their respective fs_root.  Thanks

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-06-08 15:07:53 -04:00
Tyler Hicks 0df5ed65c1 eCryptfs: Make extent and scatterlist crypt function parameters similar
The 'dest' abbreviation is only used in crypt_scatterlist(), while all
other functions in crypto.c use 'dst' so dest_sg should be renamed to
dst_sg.

The crypt_stat parameter is typically the first parameter in internal
eCryptfs functions so crypt_stat and dst_page should be swapped in
crypt_extent().

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2013-06-07 17:28:29 -07:00
Tyler Hicks 406c93df09 eCryptfs: Collapse crypt_page_offset() into crypt_extent()
crypt_page_offset() simply initialized the two scatterlists and called
crypt_scatterlist() so it is simple enough to move into the only
function that calls it.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2013-06-07 17:28:28 -07:00
Tyler Hicks d78de61896 eCryptfs: Merge ecryptfs_encrypt_extent() and ecryptfs_decrypt_extent()
They are identical except if the src_page or dst_page index is used, so
they can be merged safely if page_index is conditionally assigned.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2013-06-07 17:28:27 -07:00
Tyler Hicks a8ca90e207 eCryptfs: Combine page_offset crypto functions
Combine ecryptfs_encrypt_page_offset() and
ecryptfs_decrypt_page_offset(). These two functions are functionally
identical so they can be safely merged if the caller can indicate
whether an encryption or decryption operation should occur.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2013-06-07 17:28:27 -07:00
Tyler Hicks 00a699400a eCryptfs: Combine encrypt_scatterlist() and decrypt_scatterlist()
These two functions are identical except for a debug printk and whether
they call crypto_ablkcipher_encrypt() or crypto_ablkcipher_decrypt(), so
they can be safely merged if the caller can indicate if encryption or
decryption should occur.

The debug printk is useless so it is removed.

Two new #define's are created to indicate if an ENCRYPT or DECRYPT
operation is desired.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2013-06-07 17:28:26 -07:00
Tyler Hicks 9c6043f412 eCryptfs: Decrypt pages in-place
When reading in a page, eCryptfs would allocate a helper page, fill it
with encrypted data from the lower filesytem, and then decrypt the data
from the encrypted page and store the result in the eCryptfs page cache
page.

The crypto API supports in-place crypto operations which means that the
allocation of the helper page is unnecessary when decrypting. This patch
gets rid of the unneeded page allocation by reading encrypted data from
the lower filesystem directly into the page cache page. The page cache
page is then decrypted in-place.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2013-06-07 17:28:25 -07:00
Tyler Hicks 28916d1ac1 eCryptfs: Accept one offset parameter in page offset crypto functions
There is no longer a need to accept different offset values for the
source and destination pages when encrypting/decrypting an extent in an
eCryptfs page. The two offsets can be collapsed into a single parameter.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2013-06-07 17:28:24 -07:00
Tyler Hicks 24d15266bd eCryptfs: Simplify lower file offset calculation
Now that lower filesystem IO operations occur for complete
PAGE_CACHE_SIZE bytes, the calculation for converting an eCryptfs extent
index into a lower file offset can be simplified.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2013-06-07 17:28:23 -07:00
Tyler Hicks 0f89617623 eCryptfs: Read/write entire page during page IO
When reading and writing encrypted pages, perform IO using the entire
page all at once rather than 4096 bytes at a time.

This only affects architectures where PAGE_CACHE_SIZE is larger than
4096 bytes.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2013-06-07 17:28:22 -07:00
Tyler Hicks 12003e5b18 eCryptfs: Use entire helper page during page crypto operations
When encrypting eCryptfs pages and decrypting pages from the lower
filesystem, utilize the entire helper page rather than only the first
4096 bytes.

This only affects architectures where PAGE_CACHE_SIZE is larger than
4096 bytes.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2013-06-07 17:28:21 -07:00
Thomas Meyer fc8b14d338 eCryptfs: Cocci spatch "memdup.spatch"
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2013-06-07 17:26:55 -07:00
Linus Torvalds b8e9dbacdd * Fixes how eCryptfs handles msync to sync both the upper and lower file
* A couple of MAINTAINERS updates
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCgAGBQJRsU9rAAoJENaSAD2qAscK3ccP/ia/C8LawYZbgbZq8VVGY0li
 LzBbO+OT2O47h7tmILOM2fgR86emt1FkKa+snWBGWiigGfeTyQX6g2/I9adEEY3D
 H8n5DBc6d6TlBO2ca1dXUkIe/b2b3Mj4NwxZIXDLwd15nJDwW5aQCFMtlrJ0UmBp
 q2Xuvj/AprXTz02jk7V7vcor5wd8VgveTnLTIclLQ1AwJ+Zt51nx5UJo4jqlQydF
 1MO+tvKRo4LWs+dl1j1J4VkZIfy3frXO8m2W9wc+pJRE6FO1jyDrbjjyaP5BfelU
 tfpKkx801tnp//1NTjOcakhifH/kCm43WQn4eV7qb386pV85nGEZrP1MdRoEVmv3
 39NlKLi9sUvLZ2DgAl1qsdh36VD9PF1z1uKlcRjABJb/5IOM2RfvM2FltfqLtMM0
 lenZOZb3r62CiKIoqz6CgQnU18XE3ICo/mBiuWm1FS2l/IgwKfU4dUV361KH4ykb
 KzHKFVDY68SdPHkYJtEGJ86pgLA7N6Lu97c8+zesnGzYK1HLf5I7MvP9toM0zUXX
 TEv68sbRGEOia7EGcyE9gSVZebhtneosi4/z73GJa93kYyc+6AleLL3y+X7MthwT
 WbvbzmrsU1JeRO/VM99xRYDspDM3kxnTQlhyGOw+GunUY6Zk7nl2IiNcNTPfGP3w
 QJhg09DfITcwOmf2ImFb
 =NgKp
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-3.10-rc5-msync' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull ecryptfs fixes from Tyler Hicks:
 - Fixes how eCryptfs handles msync to sync both the upper and lower
   file
 - A couple of MAINTAINERS updates

* tag 'ecryptfs-3.10-rc5-msync' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  eCryptfs: Check return of filemap_write_and_wait during fsync
  Update eCryptFS maintainers
  ecryptfs: fixed msync to flush data
2013-06-07 16:21:44 -07:00
Nick Dyer fc60bb8339 sysfs_notify is only possible on file attributes
If sysfs_notify is called on a binary attribute, bad things can
happen, so prevent it.

Note, no in-kernel usage of this is currently present, but in the
future, it's good to be safe.

Changes in V2:
- Also ignore sysfs_notify on dirs, links
- Use WARN_ON rather than silently failing
- Compiled and tested (huge apologies about first submission)

Signed-off-by: Nick Dyer <nick.dyer@itdev.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-07 16:05:50 -07:00
Linus Torvalds e432785934 Merge branch 'for-3.10' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fix from Steve French:
 "Fix one byte buffer overrun with prefixpaths on cifs mounts which can
  cause a problem with mount depending on the string length"

* 'for-3.10' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix off-by-one bug in build_unc_path_to_root
2013-06-07 16:05:43 -07:00
Dave Chiluk 698b822363 ncpfs: fix rmdir returns Device or resource busy
1d2ef59014 caused a regression in ncpfs such that
directories could no longer be removed.  This was because ncp_rmdir checked
to see if a dentry could be unhashed before allowing it to be removed. Since
1d2ef59014 introduced a change that incremented
dentry->d_count causing it to always be greater than 1 unhash would always
fail.  Thus causing the error path in ncp_rmdir to always be taken.  Removing
this error path is safe as unhashing is still accomplished by calls to dput
from vfs_rmdir.

Signed-off-by: Dave Chiluk <chiluk@canonical.com>
Signed-off-by: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-07 12:15:38 -04:00
Jaegeuk Kim 5deb82671a f2fs: fix iget/iput of dir during recovery
It is possible that iput is skipped after iget during the recovery.

In recover_dentry(),
 dir = f2fs_iget();
 ...
 if (de && inode->i_ino == le32_to_cpu(de->ino))
	goto out;

In this case, this dir is not able to be added in dirty_dir_inode_list.
The actual linking is done only when set_page_dirty() is called.

So let's add this newly got inode into the list explicitly, and put it at the
end of the recovery routine.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-07 13:21:37 +09:00
Linus Torvalds e6395b68ad xfs: update for 3.10-rc5
- Rework of dquot CRCs
 - Fix for remote attribute invalidation of a leaf
 - Fix ordering of transaction replay in recovery
 - Implement CRCs for inode unlinked list
 - Disable noattr2/attr2 mount options when CRCs are enabled
 - Bump the limitation of ACL entries for v5 superblocks
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJRsLfDAAoJENaLyazVq6ZOgBQP/jKupB/cOV3sewBsPDDPBR46
 xg3qaps6zpEWtXGWnXe8HF/u57YfoA5K+YVwq6+jkIsYFjP3dDdLPDeEeC3HoB9I
 VZPmV5VEACvUyD9WhMeSjAbRPAtweFFbTuZZqULv2SpG+tUaF8VUz7luUM4XpcFa
 NtxccORMmBBN1j71Qod4+xxJ1BM/KtXV1RBMudtPAWr+//LKwLm/9HFavw2uXeQW
 xgebmc95DXrFpjwXepHQXW/xTmVPclah034JC8kj+Q/VhmWvVIgo331gVL0M1+L9
 U7OObUdJAJ+5VN872TgwbzWMSRCPKZ9PTh78LZksYtweb5GjD/x6yVXFSWAix/O7
 q1EkYUfBR1YHZVTAfoE2QGCTkgJCsellBJWzIoov2/Qqq18QRA1EtznjtZ+WkqP6
 dKzgcnDb7LEPjg6Y8L4ZKGhylXGkPmCScTprjgPIWoAUS2ytURkaG+CCK5sp/Ldn
 KRu0zjbrcQrxAVkMo1E2hkOOZDD1qazJE5mfhOnQxLKsPmryM5tarzWmd9O+VjH8
 tiJosS3JGoz1rLOGKYjlqEr9G0zc/3Bmaz7tFnaHeCWTKUHxo7AXwnEfPmKle29h
 lbJFW86DU56QlNb/mLEE+v2ojC/PDpcYHddxG9Yo3B5CrDX8xJgXQ6ML0g86ceL3
 tuyMnOF8opR72Wavc0co
 =R+6z
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.10-rc5' of git://oss.sgi.com/xfs/xfs

Pull more xfs updates from Ben Myers:
 "Here are several fixes for filesystems with CRC support turned on:
  fixes for quota, remote attributes, and recovery.  There is also some
  feature work related to CRCs: the implementation of CRCs for the inode
  unlinked lists, disabling noattr2/attr2 options when appropriate, and
  bumping the maximum number of ACLs.

  I would have preferred to defer this last category of items to 3.11.
  This would require setting a feature bit for the on-disk changes, so
  there is some pressure to get these in 3.10.  I believe this
  represents the end of the CRC related queue.

   - Rework of dquot CRCs
   - Fix for remote attribute invalidation of a leaf
   - Fix ordering of transaction replay in recovery
   - Implement CRCs for inode unlinked list
   - Disable noattr2/attr2 mount options when CRCs are enabled
   - Bump the limitation of ACL entries for v5 superblocks"

* tag 'for-linus-v3.10-rc5' of git://oss.sgi.com/xfs/xfs:
  xfs: increase number of ACL entries for V5 superblocks
  xfs: disable noattr2/attr2 mount options for CRC enabled filesystems
  xfs: inode unlinked list needs to recalculate the inode CRC
  xfs: fix log recovery transaction item reordering
  xfs: fix remote attribute invalidation for a leaf
  xfs: rework dquot CRCs
2013-06-06 16:15:25 -07:00
Trond Myklebust c45ffdd269 NFSv4: Close another NFSv4 recovery race
State recovery currently relies on being able to find a valid
nfs_open_context in the inode->open_files list.
We therefore need to put the nfs_open_context on the list while
we're still protected by the sp->so_reclaim_seqcount in order
to avoid reboot races.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06 16:24:44 -04:00
Trond Myklebust 275bb30786 NFSv4: Move dentry instantiation into the NFSv4-specific atomic open code
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06 16:24:43 -04:00
Trond Myklebust 3efb972247 NFSv4: Refactor _nfs4_open_and_get_state to set ctx->state
Instead of having the callers set ctx->state, do it inside
_nfs4_open_and_get_state.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06 16:24:42 -04:00
Trond Myklebust 4197a055eb NFSv4: Cleanup: pass the nfs_open_context to nfs4_do_open
All the callers have an open_context at this point, and since we always
need one in order to do state recovery, it makes sense to use it as the
basis for the nfs4_do_open() call.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06 16:24:42 -04:00
Trond Myklebust 1a1a29fa84 NFSv4: Remove redundant check for FMODE_EXEC in nfs_finish_open
We already check the EXEC access mode in the lower layers.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06 16:24:41 -04:00
Trond Myklebust 5cc2216db8 NFSv4.1: Simplify setting the layout header credential
ctx->cred == ctx->state->owner->so_cred, so let's just use the former.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06 16:24:38 -04:00
Trond Myklebust 4f0b429df1 NFSv4.1: Enable state protection
Use the EXCHGID4_FLAG_BIND_PRINC_STATEID exchange_id flag to enable
stateid protection. This means that if we create a stateid using a
particular principal, then we must use the same principal if we
want to change that state.
IOW: if we OPEN a file using a particular credential, then we have
to use the same credential in subsequent OPEN_DOWNGRADE, CLOSE,
or DELEGRETURN operations that use that stateid.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06 16:24:37 -04:00
Trond Myklebust cd5875fefe NFSv4.1: Use layout credentials for get_deviceinfo calls
This is not strictly needed, since get_deviceinfo is not allowed to
return NFS4ERR_ACCESS or NFS4ERR_WRONG_CRED, but lets do it anyway
for consistency with other pNFS operations.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06 16:24:37 -04:00
Trond Myklebust ab7cb0dfab NFSv4.1: Ensure that test_stateid and free_stateid use correct credentials
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06 16:24:36 -04:00
Trond Myklebust 965e9c23de NFSv4.1: Ensure that reclaim_complete uses the right credential
We want to use the same credential for reclaim_complete as we used
for the exchange_id call.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06 16:24:35 -04:00
Trond Myklebust 9556000d8c NFSv4.1: Ensure that layoutreturn uses the correct credential
We need to use the same credential as was used for the layoutget
and/or layoutcommit operations.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06 16:24:35 -04:00
Trond Myklebust 6ab59344d9 NFSv4.1: Ensure that layoutget is called using the layout credential
Ensure that we use the same credential for layoutget, layoutcommit and
layoutreturn.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-06-06 16:24:34 -04:00
Theodore Ts'o 20970ba65d ext4: use ext4_da_writepages() for all modes
Rename ext4_da_writepages() to ext4_writepages() and use it for all
modes.  We still need to iterate over all the pages in the case of
data=journalling, but in the case of nodelalloc/data=ordered (which is
what file systems mounted using ext3 backwards compatibility will use)
this will allow us to use a much more efficient I/O submission path.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06 14:00:46 -04:00
Dave Chinner 0a8aa19397 xfs: increase number of ACL entries for V5 superblocks
The limit of 25 ACL entries is arbitrary, but baked into the on-disk
format.  For version 5 superblocks, increase it to the maximum nuber
of ACLs that can fit into a single xattr.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinuguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 5c87d4bc1a)
2013-06-06 10:52:15 -05:00
Dave Chinner f763fd440e xfs: disable noattr2/attr2 mount options for CRC enabled filesystems
attr2 format is always enabled for v5 superblock filesystems, so the
mount options to enable or disable it need to be cause mount errors.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit d3eaace84e)
2013-06-06 10:51:34 -05:00
Dave Chinner ad868afddb xfs: inode unlinked list needs to recalculate the inode CRC
The inode unlinked list manipulations operate directly on the inode
buffer, and so bypass the inode CRC calculation mechanisms. Hence an
inode on the unlinked list has an invalid CRC. Fix this by
recalculating the CRC whenever we modify an unlinked list pointer in
an inode, ncluding during log recovery. This is trivial to do and
results in  unlinked list operations always leaving a consistent
inode in the buffer.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 0a32c26e72)
2013-06-06 10:51:19 -05:00
Dave Chinner 7540617075 xfs: fix log recovery transaction item reordering
There are several constraints that inode allocation and unlink
logging impose on log recovery. These all stem from the fact that
inode alloc/unlink are logged in buffers, but all other inode
changes are logged in inode items. Hence there are ordering
constraints that recovery must follow to ensure the correct result
occurs.

As it turns out, this ordering has been working mostly by chance
than good management. The existing code moves all buffers except
cancelled buffers to the head of the list, and everything else to
the tail of the list. The problem with this is that is interleaves
inode items with the buffer cancellation items, and hence whether
the inode item in an cancelled buffer gets replayed is essentially
left to chance.

Further, this ordering causes problems for log recovery when inode
CRCs are enabled. It typically replays the inode unlink buffer long before
it replays the inode core changes, and so the CRC recorded in an
unlink buffer is going to be invalid and hence any attempt to
validate the inode in the buffer is going to fail. Hence we really
need to enforce the ordering that the inode alloc/unlink code has
expected log recovery to have since inode chunk de-allocation was
introduced back in 2003...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit a775ad7780)
2013-06-06 10:51:07 -05:00
Dave Chinner ea929536a4 xfs: fix remote attribute invalidation for a leaf
When invalidating an attribute leaf block block, there might be
remote attributes that it points to. With the recent rework of the
remote attribute format, we have to make sure we calculate the
length of the attribute correctly. We aren't doing that in
xfs_attr3_leaf_inactive(), so fix it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinuguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 59913f14df)
2013-06-06 10:50:52 -05:00
Dave Chinner bb9b8e86ad xfs: rework dquot CRCs
Calculating dquot CRCs when the backing buffer is written back just
doesn't work reliably. There are several places which manipulate
dquots directly in the buffers, and they don't calculate CRCs
appropriately, nor do they always set the buffer up to calculate
CRCs appropriately.

Firstly, if we log a dquot buffer (e.g. during allocation) it gets
logged without valid CRC, and so on recovery we end up with a dquot
that is not valid.

Secondly, if we recover/repair a dquot, we don't have a verifier
attached to the buffer and hence CRCs are not calculated on the way
down to disk.

Thirdly, calculating the CRC after we've changed the contents means
that if we re-read the dquot from the buffer, we cannot verify the
contents of the dquot are valid, as the CRC is invalid.

So, to avoid all the dquot CRC errors that are being detected by the
read verifier, change to using the same model as for inodes. That
is, dquot CRCs are calculated and written to the backing buffer at
the time the dquot is flushed to the backing buffer. If we modify
the dquot directly in the backing buffer, calculate the CRC
immediately after the modification is complete. Hence the dquot in
the on-disk buffer should always have a valid CRC.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 6fcdc59de2)
2013-06-06 10:50:35 -05:00
Theodore Ts'o f4afb4f4e3 ext4: optimize test_root()
The test_root() function could potentially loop forever due to
overflow issues.  So rewrite test_root() to avoid this issue; as a
bonus, it is 38% faster when benchmarked via a test loop:

int main(int argc, char **argv)
{
	int  i;

	for (i = 0; i < 1 << 24; i++) {
		if (test_root(i, 7))
			printf("%d\n", i);
	}
}

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06 11:40:37 -04:00
Theodore Ts'o 2f2e09eb15 ext4: add sanity check to ext4_get_group_info()
The group number passed to ext4_get_group_info() should be valid, but
let's add an assert to check this before we start creating a pointer
based on that group number and dereferencing it.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06 11:16:43 -04:00
Theodore Ts'o b302ef2d3c ext4: verify group number in verify_group_input() before using it
Check the group number for sanity earilier, before calling routines
such as ext4_bg_has_super() or ext4_group_overhead_blocks().

Reported-by: Jonathan Salwan <jonathan.salwan@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06 11:14:31 -04:00
Theodore Ts'o a1d8d9a757 ext4: add check to io_submit_init_bio
The bio_alloc() function can return NULL if the memory allocation
fails.  So we need to check for this.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-06 10:18:22 -04:00
Alexey Khoroshilov a9aefd707c GFS2: fix error propagation in init_threads()
If kthread_run() fails, init_threads() returns
IS_ERR(p) instead of PTR_ERR(p).

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-06 09:52:29 +01:00
Namjae Jeon b2b3460a94 f2fs: reorganise the function get_victim_by_default
Fix the function get_victim_by_default, where it checks
for the condition  that p.min_segno != NULL_SEGNO as
shown:

if (p.min_segno != NULL_SEGNO)
           goto got_it;

and if above condition is true then

got_it:
        if (p.min_segno != NULL_SEGNO) {

So this condition is being checked twice. Hence move the goto
statement after the if condition so that duplication of condition
check is avoided.

Also this function makes a call to get_max_cost() to compute
the max cost based on the f2fs_sbi_info and victim policy. Since
get_max_cost depends on on three parameters of victim_sel_policy
=> alloc_mode, gc_mode & ofs_unit, once this victim policy is
initialised, these value will not change till the execution
time of get_victim_by_default() & also f2fs_sbi_info structure
parameters will not change.

Hence making calls to get_max_cost() in while loop does not seems to
be a good point. Instead we can call it once in begining and store
the results in local variable, which later can serve our purpose
for comparing the cost with max cost inside the while loop.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-06 14:20:46 +09:00
Joe Perches eb8630d7d2 jfs: Update jfs_error
Use a more current logging style.

Add __printf format and argument verification.

Remove embedded function names from formats.
Add %pf, __builtin_return_address(0) to jfs_error.
Add newlines to formats for kernel style consistency.
(One format already had an erroneous newline)
Coalesce formats and align arguments.

Object size reduced ~1KiB.

$ size fs/jfs/built-in.o*
   text	   data	    bss	    dec	    hex	filename
 201891	  35488	  63936	 301315	  49903	fs/jfs/built-in.o.new
 202821	  35488	  64192	 302501	  49da5	fs/jfs/built-in.o.old

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-06-05 14:47:19 -05:00
Dave Kleikamp 21d1101f01 jfs: fix sparse warning in fs/jfs/xattr.c
CHECK   fs/jfs/xattr.c
fs/jfs/xattr.c:1092:5: warning: symbol 'jfs_initxattrs' was not declared. Should it be static?

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-06-05 14:47:16 -05:00
Dave Chinner 5c87d4bc1a xfs: increase number of ACL entries for V5 superblocks
The limit of 25 ACL entries is arbitrary, but baked into the on-disk
format.  For version 5 superblocks, increase it to the maximum nuber
of ACLs that can fit into a single xattr.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinuguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-05 11:26:53 -05:00
Dave Chinner d3eaace84e xfs: disable noattr2/attr2 mount options for CRC enabled filesystems
attr2 format is always enabled for v5 superblock filesystems, so the
mount options to enable or disable it need to be cause mount errors.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-05 11:21:06 -05:00
Dave Chinner 0a32c26e72 xfs: inode unlinked list needs to recalculate the inode CRC
The inode unlinked list manipulations operate directly on the inode
buffer, and so bypass the inode CRC calculation mechanisms. Hence an
inode on the unlinked list has an invalid CRC. Fix this by
recalculating the CRC whenever we modify an unlinked list pointer in
an inode, ncluding during log recovery. This is trivial to do and
results in  unlinked list operations always leaving a consistent
inode in the buffer.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-05 11:19:10 -05:00
Dave Chinner a775ad7780 xfs: fix log recovery transaction item reordering
There are several constraints that inode allocation and unlink
logging impose on log recovery. These all stem from the fact that
inode alloc/unlink are logged in buffers, but all other inode
changes are logged in inode items. Hence there are ordering
constraints that recovery must follow to ensure the correct result
occurs.

As it turns out, this ordering has been working mostly by chance
than good management. The existing code moves all buffers except
cancelled buffers to the head of the list, and everything else to
the tail of the list. The problem with this is that is interleaves
inode items with the buffer cancellation items, and hence whether
the inode item in an cancelled buffer gets replayed is essentially
left to chance.

Further, this ordering causes problems for log recovery when inode
CRCs are enabled. It typically replays the inode unlink buffer long before
it replays the inode core changes, and so the CRC recorded in an
unlink buffer is going to be invalid and hence any attempt to
validate the inode in the buffer is going to fail. Hence we really
need to enforce the ordering that the inode alloc/unlink code has
expected log recovery to have since inode chunk de-allocation was
introduced back in 2003...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-05 11:13:19 -05:00
Steven Whitehouse edd2e9acc0 GFS2: Remove no-op wrapper function
This wrapper function is no longer required, so get rid of it.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-05 09:51:23 +01:00
Thomas Meyer 2b6f8860e1 GFS2: Cocci spatch "ptr_ret.spatch"
Use PTR_RET in place of open coding this function.

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-05 09:51:08 +01:00
Bob Peterson cd51e61eac GFS2: Eliminate gfs2_rg_lops
With recent changes to the transactions, it appears that we
are no longer using the "log ops" for resource groups. Since the
log commit code processes the array of log ops, eliminating this
should be marginally better for performance. Therefore this patch
eliminates it.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-05 09:50:40 +01:00
Benjamin Marzinski 7f63257da1 GFS2: Sort buffer lists by inplace block number
This patch simply sort the data and metadata buffer lists by their
inplace block number.  This makes gfs2_log_flush issue the inplace IO
in sequential order, which will hopefully speed up writing the IO
out to disk.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-05 09:50:20 +01:00
Tyler Hicks bc5abcf7e4 eCryptfs: Check return of filemap_write_and_wait during fsync
Error out of ecryptfs_fsync() if filemap_write_and_wait() fails.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Paul Taysom <taysom@chromium.org>
Cc: Olof Johansson <olofj@chromium.org>
Cc: stable@vger.kernel.org # v3.6+
2013-06-04 23:53:31 -07:00
Linus Torvalds 6f66f9005b Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes
Pull gfs2 fixes from Steven Whitehouse:
 "There are four patches this time.

  The first fixes a problem where the wrong descriptor type was being
  written into the log for journaled data blocks.

  The second fixes a race relating to the deallocation of allocator
  data.

  The third provides a fallback if kmalloc is unable to satisfy a
  request to allocate a directory hash table.

  The fourth fixes the iopen glock caching so that inodes are deleted in
  a more timely manner after rmdir/unlink"

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
  GFS2: Don't cache iopen glocks
  GFS2: Fall back to vmalloc if kmalloc fails for dir hash tables
  GFS2: Increase i_writecount during gfs2_setattr_size
  GFS2: Set log descriptor type for jdata blocks
2013-06-05 09:06:28 +09:00
Linus Torvalds 8764d86100 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
 "One patch fixes an Oops introduced in 3.9 with the readdirplus
  feature.  The rest are fixes for async-dio in 3.10"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix alignment in short read optimization for async_dio
  fuse: return -EIOCBQUEUED from fuse_direct_IO() for all async requests
  fuse: fix readdirplus Oops in fuse_dentry_revalidate
  fuse: update inode size and invalidate attributes on fallocate
  fuse: truncate pagecache range on hole punch
  fuse: allocate for_background dio requests based on io->async state
2013-06-05 09:03:31 +09:00
Dave Chinner 59913f14df xfs: fix remote attribute invalidation for a leaf
When invalidating an attribute leaf block block, there might be
remote attributes that it points to. With the recent rework of the
remote attribute format, we have to make sure we calculate the
length of the attribute correctly. We aren't doing that in
xfs_attr3_leaf_inactive(), so fix it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinuguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-04 17:36:30 -05:00
Dave Chinner 6fcdc59de2 xfs: rework dquot CRCs
Calculating dquot CRCs when the backing buffer is written back just
doesn't work reliably. There are several places which manipulate
dquots directly in the buffers, and they don't calculate CRCs
appropriately, nor do they always set the buffer up to calculate
CRCs appropriately.

Firstly, if we log a dquot buffer (e.g. during allocation) it gets
logged without valid CRC, and so on recovery we end up with a dquot
that is not valid.

Secondly, if we recover/repair a dquot, we don't have a verifier
attached to the buffer and hence CRCs are not calculated on the way
down to disk.

Thirdly, calculating the CRC after we've changed the contents means
that if we re-read the dquot from the buffer, we cannot verify the
contents of the dquot are valid, as the CRC is invalid.

So, to avoid all the dquot CRC errors that are being detected by the
read verifier, change to using the same model as for inodes. That
is, dquot CRCs are calculated and written to the backing buffer at
the time the dquot is flushed to the backing buffer. If we modify
the dquot directly in the backing buffer, calculate the CRC
immediately after the modification is complete. Hence the dquot in
the on-disk buffer should always have a valid CRC.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-04 17:35:51 -05:00
Jan Kara 5dc23bdd5f ext4: remove ext4_ioend_wait()
Now that we clear PageWriteback after extent conversion, there's no
need to wait for io_end processing in ext4_evict_inode().  Running
AIO/DIO keeps file reference until aio_complete() is called so
ext4_evict_inode() cannot be called.  For io_end structures resulting
from buffered IO waiting is happening because we wait for
PageWriteback in truncate_inode_pages().

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:46:12 -04:00
Jan Kara c724585b62 ext4: don't wait for extent conversion in ext4_punch_hole()
We don't have to wait for extent conversion in ext4_punch_hole() as
buffered IO for the punched range has been flushed and waited upon
(thus all extent conversions for that range have completed).  Also we
wait for all DIO to finish using inode_dio_wait() so there cannot be
any extent conversions pending due to direct IO.

Also remove ext4_flush_unwritten_io() since it's unused now.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:44:36 -04:00
Jan Kara 38b8ff7db4 ext4: Remove wait for unwritten extents in ext4_ind_direct_IO()
We don't have to wait for unwritten extent conversion in
ext4_ind_direct_IO() as all writes that happened before DIO are
flushed by the generic code and extent conversion has happened before
we cleared PageWriteback bit.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:41:29 -04:00
Jan Kara 92e6222dfb ext4: remove i_mutex from ext4_file_sync()
After removal of ext4_flush_unwritten_io() call, ext4_file_sync()
doesn't need i_mutex anymore. Forcing of transaction commits doesn't
need i_mutex as there's nothing inode specific in that code apart from
grabbing transaction ids from the inode. So remove the lock.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:40:39 -04:00
Jan Kara 37b10dd063 ext4: use generic_file_fsync() in ext4_file_fsync() in nojournal mode
Just use the generic function instead of duplicating it.  We only need
to reshuffle the read-only check a bit (which is there to prevent
writing to a filesystem which has been remounted read-only after error
I assume).

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:40:09 -04:00
Jan Kara a115f749c1 ext4: remove wait for unwritten extent conversion from ext4_truncate()
Since PageWriteback bit is now cleared after extents are converted
from unwritten to written ones, we have full exclusion of writeback
path from truncate (truncate_inode_pages() waits for PageWriteback
bits to get cleared on all invalidated pages).  Exclusion from DIO
path is achieved by inode_dio_wait() call in ext4_setattr().  So
there's no need to wait for extent convertion in ext4_truncate()
anymore.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:30:00 -04:00
Jan Kara e83403959f ext4: protect extent conversion after DIO with i_dio_count
Make sure extent conversion after DIO happens while i_dio_count is
still elevated so that inode_dio_wait() waits until extent conversion
is done.  This removes the need for explicit waiting for extent
conversion in some cases.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:27:38 -04:00
Jan Kara b0857d309f ext4: defer clearing of PageWriteback after extent conversion
Currently PageWriteback bit gets cleared from put_io_page() called
from ext4_end_bio().  This is somewhat inconvenient as extent tree is
not fully updated at that time (unwritten extents are not marked as
written) so we cannot read the data back yet.  This design was
dictated by lock ordering as we cannot start a transaction while
PageWriteback bit is set (we could easily deadlock with
ext4_da_writepages()).  But now that we use transaction reservation
for extent conversion, locking issues are solved and we can move
PageWriteback bit clearing after extent conversion is done.  As a
result we can remove wait for unwritten extent conversion from
ext4_sync_file() because it already implicitely happens through
wait_on_page_writeback().

We implement deferring of PageWriteback clearing by queueing completed
bios to appropriate io_end and processing all the pages when io_end is
going to be freed instead of at the moment ext4_io_end() is called.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:23:41 -04:00
Jan Kara 2e8fa54e3b ext4: split extent conversion lists to reserved & unreserved parts
Now that we have extent conversions with reserved transaction, we have
to prevent extent conversions without reserved transaction (from DIO
code) to block these (as that would effectively void any transaction
reservation we did).  So split lists, work items, and work queues to
reserved and unreserved parts.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:21:02 -04:00
Jan Kara 6b523df4fb ext4: use transaction reservation for extent conversion in ext4_end_io
Later we would like to clear PageWriteback bit only after extent
conversion from unwritten to written extents is performed.  However it
is not possible to start a transaction after PageWriteback is set
because that violates lock ordering (and is easy to deadlock).  So we
have to reserve a transaction before locking pages and sending them
for IO and later we use the transaction for extent conversion from
ext4_end_io().

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 13:21:11 -04:00
Jan Kara 3613d22807 ext4: remove buffer_uninit handling
There isn't any need for setting BH_Uninit on buffers anymore.  It was
only used to signal we need to mark io_end as needing extent
conversion in add_bh_to_extent() but now we can mark the io_end
directly when mapping extent.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 13:19:34 -04:00
Jan Kara 4e7ea81db5 ext4: restructure writeback path
There are two issues with current writeback path in ext4.  For one we
don't necessarily map complete pages when blocksize < pagesize and
thus needn't do any writeback in one iteration.  We always map some
blocks though so we will eventually finish mapping the page.  Just if
writeback races with other operations on the file, forward progress is
not really guaranteed. The second problem is that current code
structure makes it hard to associate all the bios to some range of
pages with one io_end structure so that unwritten extents can be
converted after all the bios are finished.  This will be especially
difficult later when io_end will be associated with reserved
transaction handle.

We restructure the writeback path to a relatively simple loop which
first prepares extent of pages, then maps one or more extents so that
no page is partially mapped, and once page is fully mapped it is
submitted for IO. We keep all the mapping and IO submission
information in mpage_da_data structure to somewhat reduce stack usage.
Resulting code is somewhat shorter than the old one and hopefully also
easier to read.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 13:17:40 -04:00
Jan Kara fffb273997 ext4: better estimate credits needed for ext4_da_writepages()
We limit the number of blocks written in a single loop of
ext4_da_writepages() to 64 when inode uses indirect blocks.  That is
unnecessary as credit estimates for mapping logically continguous run
of blocks is rather low even for inode with indirect blocks.  So just
lift this limitation and properly calculate the number of necessary
credits.

This better credit estimate will also later allow us to always write
at least a single page in one iteration.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 13:01:11 -04:00
Jan Kara fa55a0ed03 ext4: improve writepage credit estimate for files with indirect blocks
ext4_ind_trans_blocks() wrongly used 'chunk' argument to decide whether
blocks mapped are logically contiguous. That is wrong since the argument
informs whether the blocks are physically contiguous. As the blocks
mapped are always logically contiguous and that's all
ext4_ind_trans_blocks() cares about, just remove the 'chunk' argument.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:56:55 -04:00
Jan Kara f2d50a65c9 ext4: deprecate max_writeback_mb_bump sysfs attribute
This attribute is now unused so deprecate it.  We still show the old
default value to keep some compatibility but we don't allow writing to
that attribute anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:51:16 -04:00
Jan Kara 39bba40b7a ext4: stop messing with nr_to_write in ext4_da_writepages()
Writeback code got better in how it submits IO and now the number of
pages requested to be written is usually higher than original 1024.
The number is now dynamically computed based on observed throughput
and is set to be about 0.5 s worth of writeback.  E.g. on ordinary
SATA drive this ends up somewhere around 10000 as my testing shows.
So remove the unnecessary smarts from ext4_da_writepages().

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:50:24 -04:00
Jan Kara 5fe2fe895a ext4: provide wrappers for transaction reservation calls
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:37:50 -04:00
Jan Kara 8f7d89f368 jbd2: transaction reservation support
In some cases we cannot start a transaction because of locking
constraints and passing started transaction into those places is not
handy either because we could block transaction commit for too long.
Transaction reservation is designed to solve these issues.  It
reserves a handle with given number of credits in the journal and the
handle can be later attached to the running transaction without
blocking on commit or checkpointing.  Reserved handles do not block
transaction commit in any way, they only reduce maximum size of the
running transaction (because we have to always be prepared to
accomodate request for attaching reserved handle).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:35:11 -04:00
Jan Kara f29fad7210 jbd2: remove unused waitqueues
j_wait_logspace and j_wait_checkpoint are unused.  Remove them.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:24:11 -04:00
Jan Kara fe1e8db598 jbd2: fix race in t_outstanding_credits update in jbd2_journal_extend()
jbd2_journal_extend() first checked whether transaction can accept
extending handle with more credits and then added credits to
t_outstanding_credits.  This can race with start_this_handle() adding
another handle to a transaction and thus overbooking a transaction.
Make jbd2_journal_extend() use atomic_add_return() to close the race.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:22:15 -04:00
Jan Kara 76c3990456 jbd2: cleanup needed free block estimates when starting a transaction
__jbd2_log_space_left() and jbd_space_needed() were kind of odd.
jbd_space_needed() accounted also credits needed for currently
committing transaction while it didn't account for credits needed for
control blocks.  __jbd2_log_space_left() then accounted for control
blocks as a fraction of free space.  Since results of these two
functions are always only compared against each other, this works
correct but is somewhat strange.  Move the estimates so that
jbd_space_needed() returns number of blocks needed for a transaction
including control blocks and __jbd2_log_space_left() returns free
space in the journal (with the committing transaction already
subtracted).  Rename functions to jbd2_log_space_left() and
jbd2_space_needed() while we are changing them.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:12:57 -04:00
Jan Kara 2f387f849b jbd2: remove outdated comment
The comment about credit estimates isn't true anymore. We do what the
comment describes now.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:10:11 -04:00
Jan Kara b34090e5e2 jbd2: refine waiting for shadow buffers
Currently when we add a buffer to a transaction, we wait until the
buffer is removed from BJ_Shadow list (so that we prevent any changes
to the buffer that is just written to the journal).  This can take
unnecessarily long as a lot happens between the time the buffer is
submitted to the journal and the time when we remove the buffer from
BJ_Shadow list.  (e.g.  We wait for all data buffers in the
transaction, we issue a cache flush, etc.)  Also this creates a
dependency of do_get_write_access() on transaction commit (namely
waiting for data IO to complete) which we want to avoid when
implementing transaction reservation.

So we modify commit code to set new BH_Shadow flag when temporary
shadowing buffer is created and we clear that flag once IO on that
buffer is complete.  This allows do_get_write_access() to wait only
for BH_Shadow bit and thus removes the dependency on data IO
completion.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:08:56 -04:00
Jan Kara e5a120aeb5 jbd2: remove journal_head from descriptor buffers
Similarly as for metadata buffers, also log descriptor buffers don't
really need the journal head. So strip it and remove BJ_LogCtl list.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:06:01 -04:00
Jan Kara f5113effc2 jbd2: don't create journal_head for temporary journal buffers
When writing metadata to the journal, we create temporary buffer heads
for that task.  We also attach journal heads to these buffer heads but
the only purpose of the journal heads is to keep buffers linked in
transaction's BJ_IO list.  We remove the need for journal heads by
reusing buffer_head's b_assoc_buffers list for that purpose.  Also
since BJ_IO list is just a temporary list for transaction commit, we
use a private list in jbd2_journal_commit_transaction() for that thus
removing BJ_IO list from transaction completely.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:01:45 -04:00
Jan Kara 97a851ed71 ext4: use io_end for multiple bios
Change writeback path to create just one io_end structure for the
extent to which we submit IO and share it among bios writing that
extent. This prevents needless splitting and joining of unwritten
extents when they cannot be submitted as a single bio.

Bugs in ENOMEM handling found by Linux File System Verification project
(linuxtesting.org) and fixed by Alexey Khoroshilov
<khoroshilov@ispras.ru>.

CC: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 11:58:58 -04:00
Linus Torvalds 0ab60871b4 A couple jfs bug fixes for 3.10-rc5
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.20 (GNU/Linux)
 
 iQIcBAABAgAGBQJRrLmSAAoJEDaohF61QIxkakoP/jo8fo2KaixVHRag1h7pBwVG
 7UJrgYmojoFUsmem+k34rXOzwc8eqv5BzTwGJe81ktatUjPkEQELdwryB4Rw6YYI
 F8/Mrc2AjHEavuTnojmENP22+YkPOW90r+k00dVk+cY9FQsc1/AhNYtzpVg9hSRX
 RBhGcgfnQQCAlIf09vr7R19oBLiHfrFNnUYJeUg/d0bxW8uiTimX7ENDTOtn61WF
 hLEABnL2tC0UtK3z/B0qNIpDE5DHHaLo8WcGjlXE3Pte4JA2rPDJwaHaUgvCEo6R
 iYCw5Dq6bL7Ce7wMP1GzY+LHgaeUg5Nknnkl5vEKZSZuy7dfC2eBlYbziqyJm73A
 0e7UgypkaNGa4p4MzM0J5OW+4bRRPzNir99BnpU8Bm6dJTY8ZeIMTTQFX2Fla8+G
 SCo6dMjEUfF+yIykDwmeZYjHl5XeKM6B3imoJ8+yD9vZLzg6MY4SUmgPEKL7BFoi
 U4aiID6N67LYAMb5vapzpcl+uHZbrw7CBnI2FanqdsAoPjath1CgrEvseBAUowpy
 EUeCLaQ05xR8ynm44tQCPQFgqBpRxAiVtpy9nPulA3cQCbwelEuLKrZB17DBfK9i
 w7f5e8nNG8gXfRuL5idGzS7g6Hp90JUhBWKh0dJC820OM7idWnXGCLbYTm7t6pvo
 m931/xUofSsXCLNwEo88
 =JZIK
 -----END PGP SIGNATURE-----

Merge tag 'jfs-3.10-rc5' of git://github.com/kleikamp/linux-shaggy

Pull jfs bugfixes from David Kleikamp:
 "A couple jfs bug fixes for 3.10-rc5"

* tag 'jfs-3.10-rc5' of git://github.com/kleikamp/linux-shaggy:
  fs/jfs: Add check if journaling to disk has been disabled in lbmRead()
  jfs: Several bugs in jfs_freeze() and jfs_unfreeze()
2013-06-04 06:33:44 +09:00
Stephen Rothwell 40b313608a Finally eradicate CONFIG_HOTPLUG
Ever since commit 45f035ab9b ("CONFIG_HOTPLUG should be always on"),
it has been basically impossible to build a kernel with CONFIG_HOTPLUG
turned off.  Remove all the remaining references to it.

Cc: Russell King <linux@arm.linux.org.uk>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-03 14:20:18 -07:00
Mathias Krause a3b2c8c7aa debugfs: write_file_bool() - ensure strtobool() operates on valid data
In case, userland writes an empty string to a bool debugfs file, buf[]
will still be uninitialized when being passed to strtobool() making the
outcome of that function purely random.

Fix this by always zero-terminating the buffer.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-03 13:55:02 -07:00
Seth Jennings 3a76e5e09f debugfs: add get/set for atomic types
debugfs currently lack the ability to create attributes
that set/get atomic_t values.

This patch adds support for this through a new
debugfs_create_atomic_t() function.

Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-03 13:55:01 -07:00
Bob Peterson a6a4d98b01 GFS2: Don't cache iopen glocks
This patch makes GFS2 immediately reclaim/delete all iopen glocks
as soon as they're dequeued. This allows deleters to get an
EXclusive lock on iopen so files are deleted properly instead of
being set as unlinked.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-03 16:40:22 +01:00
Bob Peterson e8830d8856 GFS2: Fall back to vmalloc if kmalloc fails for dir hash tables
This version has one more correction: the vmalloc calls are replaced
by __vmalloc calls to preserve the GFP_NOFS flag.

When GFS2's directory management code allocates buffers for a
directory hash table, if it can't get the memory it needs, it
currently gives a bad return code. Rather than giving an error,
this patch allows it to use virtual memory rather than kernel
memory for the hash table. This should make it possible for
directories to function properly, even when kernel memory becomes
very fragmented.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-03 16:39:44 +01:00
Bob Peterson 2b3dcf3581 GFS2: Increase i_writecount during gfs2_setattr_size
This patch calls get_write_access in a few functions. This
merely increases inode->i_writecount for the duration of the function.
That will ensure that any file closes won't delete the inode's
multi-block reservation while the function is running.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-03 16:38:58 +01:00
Bob Peterson 4a58681205 GFS2: Set log descriptor type for jdata blocks
This patch sets the log descriptor type according to whether the
journal commit is for (journaled) data or metadata. This was
recently broken when the functions to process data and metadata
log ops were combined.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-03 16:38:39 +01:00
Maxim Patlasov e5c5f05dca fuse: fix alignment in short read optimization for async_dio
The bug was introduced with async_dio feature: trying to optimize short reads,
we cut number-of-bytes-to-read to i_size boundary. Hence the following example:

	truncate --size=300 /mnt/file
	dd if=/mnt/file of=/dev/null iflag=direct

led to FUSE_READ request of 300 bytes size. This turned out to be problem
for userspace fuse implementations who rely on assumption that kernel fuse
does not change alignment of request from client FS.

The patch turns off the optimization if async_dio is disabled. And, if it's
enabled, the patch fixes adjustment of number-of-bytes-to-read to preserve
alignment.

Note, that we cannot throw out short read optimization entirely because
otherwise a direct read of a huge size issued on a tiny file would generate
a huge amount of fuse requests and most of them would be ACKed by userspace
with zero bytes read.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-06-03 15:15:42 +02:00
Brian Foster c9ecf989cc fuse: return -EIOCBQUEUED from fuse_direct_IO() for all async requests
If request submission fails for an async request (i.e.,
get_user_pages() returns -ERESTARTSYS), we currently skip the
-EIOCBQUEUED return and drop into wait_for_sync_kiocb() forever.

Avoid this by always returning -EIOCBQUEUED for async requests. If
an error occurs, the error is passed into fuse_aio_complete(),
returned via aio_complete() and thus propagated to userspace via
io_getevents().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-06-03 15:15:42 +02:00
Miklos Szeredi 28420dad23 fuse: fix readdirplus Oops in fuse_dentry_revalidate
Fix bug introduced by commit 4582a4ab2a "FUSE: Adapt readdirplus to application
usage patterns".

We need to check for a positive dentry; negative dentries are not added by
readdirplus.  Secondly we need to advise the use of readdirplus on the *parent*,
otherwise the whole thing is useless.  Thirdly all this is only relevant if
"readdirplus_auto" mode is selected by the filesystem.

We advise the use of readdirplus only if the dentry was still valid.  If we had
to redo the lookup then there was no use in doing the -plus version.

Reported-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Feng Shuo <steve.shuo.feng@gmail.com>
CC: stable@vger.kernel.org
2013-06-03 14:40:22 +02:00
Jason Hrycay 1e03e38b35 f2fs: handle errors from get_node_page calls
Add check for error pointers returned from get_node_page in order to
avoid dereferencing a bad address on the next use.

Signed-off-by: Jason Hrycay <jason.hrycay@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-03 19:49:09 +09:00
Linus Torvalds 6cf3c73620 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull assorted fixes from Al Viro:
 "There'll be more - I'm trying to dig out from under the pile of mail
  (a couple of weeks of something flu-like ;-/) and there's several more
  things waiting for review; this is just the obvious stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  zoran: racy refcount handling in vm_ops ->open()/->close()
  befs_readdir(): do not increment ->f_pos if filldir tells us to stop
  hpfs: deadlock and race in directory lseek()
  qnx6: qnx6_readdir() has a braino in pos calculation
  fix buffer leak after "scsi: saner replacements for ->proc_info()"
  vfs: Fix invalid ida_remove() call
2013-06-01 19:51:52 +09:00
Linus Torvalds f8cb27916a NFS client fixes:
- Fix a regression that broke NFS mounting using klibc and busybox
 - Stable fix to check access modes correctly on NFSv4 delegated open()
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJRqUQdAAoJEGcL54qWCgDy4gMP/RfrcBIAJxEBUfB7Q/hCjoRG
 CTmfoq8XK/fTTmCBuTTjf0QntlKNJMFJSnJwGZqFzqA6PJOSDvkWFDf0WzzCtIEN
 3D8C6rko7oYUU2tcu8bJK3CGkfFNgh6ON519XMHMcOZCIpzycIO6DrZFvkcOS6aW
 lzlqHkB/ksH/G4Z7k2C9YyS6Ljr0odAxhsOcn27BWMtktCvsgTyPhowfo83aOzyd
 EFnwHwXSCbawyYIqNiBTFjoudaZwlFDhbStSWX5jcI/JwpR8rfhN8n1dU+AJknE6
 d6DdEgk1CVuBUke0VcDLf0dJ5iMbKY5bnoJF5PF09qTINPmErKUMNI13P3wt+Uyn
 RkmhoiH0ooiBhrkCJWlZHo/EJxl/ionQR7vOlEhN/Hvp11oreXmvHx/pL2WCCanJ
 wLEJnFGflay7NUfkKWSuUpyyBJTS7/5lrWaabBZ1qxlHSOonzwaKW3G0slzJroiV
 x6YmM7xEe6CEquEQmyFm5kJjlueYJpPzIvxFYsPun1cxpzjKosXgNCOEe8oe5Znj
 JkvNS+zc+VN7pw06EtmKgyqoavUSbVcAeZICegskCkUuhYQkdJaP5Lc819qeZj8D
 rlgyltkh6XP7s8W7a4UWePHqrjZyen62ULtNPLoEainSbdpu0BrItZqgnokQ3AIe
 mRiIKkpkoqHKVWWozhCh
 =HTIq
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.10-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull two NFS client fixes from Trond Myklebust:
 - Fix a regression that broke NFS mounting using klibc and busybox
 - Stable fix to check access modes correctly on NFSv4 delegated open()

* tag 'nfs-for-3.10-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: Fix security flavor negotiation with legacy binary mounts
  NFSv4: Fix a thinko in nfs4_try_open_cached
2013-06-01 19:48:59 +09:00
Jan Kara 8af8eecc13 ext4: fix overflow when counting used blocks on 32-bit architectures
The arithmetics adding delalloc blocks to the number of used blocks in
ext4_getattr() can easily overflow on 32-bit archs as we first multiply
number of blocks by blocksize and then divide back by 512. Make the
arithmetics more clever and also use proper type (unsigned long long
instead of unsigned long).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-31 19:39:56 -04:00
Jan Kara a60697f411 ext4: fix data offset overflow in ext4_xattr_fiemap() on 32-bit archs
On 32-bit architectures with 32-bit sector_t computation of data offset
in ext4_xattr_fiemap() can overflow resulting in reporting bogus data
location. Fix the problem by typing block number to proper type before
shifting.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-31 19:38:56 -04:00
Jan Kara e7293fd146 ext4: fix overflows in SEEK_HOLE, SEEK_DATA implementations
ext4_lblk_t is just u32 so multiplying it by blocksize can easily
overflow for files larger than 4 GB. Fix that by properly typing the
block offsets before shifting.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-05-31 19:37:56 -04:00
Jan Kara eaf3793728 ext4: fix data offset overflow on 32-bit archs in ext4_inline_data_fiemap()
On 32-bit archs when sector_t is defined as 32-bit the logic computing
data offset in ext4_inline_data_fiemap(). Fix that by properly typing
the shifted value.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-31 19:33:42 -04:00
Linus Torvalds 1d822d6094 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull reiserfs fixes from Jan Kara:
 "Three reiserfs fixes.  They fix real problems spotted by users so I
  hope they are ok even at this stage."

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  reiserfs: fix deadlock with nfs racing on create/lookup
  reiserfs: fix problems with chowning setuid file w/ xattrs
  reiserfs: fix spurious multiple-fill in reiserfs_readdir_dentry
2013-06-01 06:59:14 +09:00
Linus Torvalds 7cfb953258 xfs: extended attribute fixes for CRCs
- Remove assert on count of remote attribute CRC headers
 - Fix the number of blocks read in for remote attributes
 - Zero remote attribute tails properly
 - Fix mapping of remote attribute buffers to have correct length
 - initialize temp leaf properly in xfs_attr3_leaf_unbalance, and
   xfs_attr3_leaf_compact
 - Rework remote atttributes to have a header per block
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJRqQE3AAoJENaLyazVq6ZOBHMQAIaaBdvrHV0tg/EyzVCuVFul
 BJH2RrL7WqFe6cXjkb5qV9neesj/dyDoSOEVqko7klNxlUhnSK6j9UnM9QfeAZVg
 x22TQ+c938giCu/iibdaFsyEVIgX2WfjPY+OoKJh1kTPOimO1ekRE/H8qRJ3A5Ge
 hl5H/pyDcJXiXowNVQEF6sPj59BzEn65kelpIn/uW5BzrmdIjfM+wnvr4mvkxrCS
 0FM9acLmEl/ri448wsZjrlA8yltqg/J8LS+oc6AgWhWGB9qL3LzM7xtQblv47ail
 kJrvJek8RMQT1nQ8N1gQa0jCSCtWB+Lir04ynTM0b5h2NKcm6iFih5kd6a+wl5qd
 p26LkTSjIQLv21dx9fylTvUIf4xx0NCW/BNqSPpEgTXi6FUiYGzEq7iuvCfbGSYf
 +k9PB3KkVhmRzv/IWvP8OmnrhXZhQVo55Z+OulHwuaZ/+orL0Y0TtypW6imjETK5
 b/JiRoVaN1OroZUyg1qk4OnYftl6j93X3iyLniIZiI5hTMNZBhiqe4EcHd++pYDN
 gpaQC/V/9pAgSWhyyD6E5Jv7HvkUuOc73mSPHH3oz5WAoEI6I1XWx+pExo15BiZD
 FADL3HSnnvGDgGxfWb0nYSS/M0KdAT9lpu+48lKg1sIXVANnYkHxJNofM6ZXe042
 BUuItw8SCGARf9JEJF8J
 =VDbP
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.10-rc4-crc-xattr-fixes' of git://oss.sgi.com/xfs/xfs

Pull xfs extended attribute fixes for CRCs from Ben Myers:
 "Here are several fixes that are relevant on CRC enabled XFS
  filesystems.  They are followed by a rework of the remote attribute
  code so that each block of the attribute contains a header with a CRC.

  Previously there was a CRC header per extent in the remote attribute
  code, but this was untenable because it was not possible to know how
  many extents would be allocated for the attribute until after the
  allocation has completed, due to the fragmentation of free space.
  This became complicated because the size of the headers needs to be
  added to the length of the payload to get the overall length required
  for the allocation.  With a header per block, things are less
  complicated at the cost of a little space.

  I would have preferred to defer this and the rest of the CRC queue to
  3.11 to mitigate risk for existing non-crc users in 3.10.  Doing so
  would require setting a feature bit for the on-disk changes, and so I
  have been pressured into sending this pull request by Eric Sandeen and
  David Chinner from Red Hat.  I'll send another pull request or two
  with the rest of the CRC queue next week.

   - Remove assert on count of remote attribute CRC headers
   - Fix the number of blocks read in for remote attributes
   - Zero remote attribute tails properly
   - Fix mapping of remote attribute buffers to have correct length
   - initialize temp leaf properly in xfs_attr3_leaf_unbalance, and
     xfs_attr3_leaf_compact
   - Rework remote atttributes to have a header per block"

* tag 'for-linus-v3.10-rc4-crc-xattr-fixes' of git://oss.sgi.com/xfs/xfs:
  xfs: rework remote attr CRCs
  xfs: fully initialise temp leaf in xfs_attr3_leaf_compact
  xfs: fully initialise temp leaf in xfs_attr3_leaf_unbalance
  xfs: correctly map remote attr buffers during removal
  xfs: remote attribute tail zeroing does too much
  xfs: remote attribute read too short
  xfs: remote attribute allocation may be contiguous
2013-06-01 06:56:21 +09:00
Linus Torvalds e8d256aca0 xfs: fixes for 3.10-rc4
- Fix nested transactions in xfs_qm_scall_setqlim
 - Clear suid/sgid bits when we truncate with size update
 - Fix recovery for split buffers
 - Fix block count on remote symlinks
 - Add fsgeom flag for v5 superblock support
 - Disable XFS_IOC_SWAPEXT for CRC enabled filesystems
 - Fix dirv3 freespace block corruption
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJRqPJxAAoJENaLyazVq6ZOxVIQAIPG+G+PKTQsbj9kNAxk2yGb
 jAD9HD9bbFFD1Xz3RoO/CoJpXcxo/Gj/6r94ZNTlmvR1SHyL4P/9gCBI8F9Su+Ru
 wAtSyWyubV8bERYLfQ5zys7sNWBGhwYkHyjslIkuoRMv6jlAFpZugghKbF7hNCKg
 qNCkMlaj0uQQAWLAjiWMUwMYjQEhzmvds1VCf4pSCycI3/opYIi9ZV6Wr5vHcJkw
 nicZa5M9OH1oiIADKiMBXMdNGjmLx3l3gM7eqOoJeJWlok6MSp0AciB2hYwDuz/8
 5cxkId5dHO0IV1Bf0N4vYDGxnDBRbbncVzywFYS0WS/4MjL8IGR7jUzGZV+bH52v
 jax5TbS6EkbmgVr2hErmjBvUoAnVRZ1rbAYDNpvudhNjAz0730y0Zy17J2ahja6Q
 0wunmpyd/3hfaeczVu1DkXDWIYv6c0yGxz2Ca9o+bk9PIS7T0NjVWc2niATtnPWh
 HaQU0Ix/2y2zHWnSFJ28iikG1Dm5iTyByCUK1sToHLilB7o6GA8HVVzf5T02usN4
 OxLWqKunwUQzlTJ9Ng6YUGlI/iZiep/VgSY59QpivF6KwMN+S7XE1lPnGYRvHNVy
 gJCjyjbT9qPsJ1gy5pun9IuzbusYPpOzU/OUQtDNx35xyrRfV4Jy1fijSCHykHM+
 AWVVy0u03OueOE3/LQGv
 =r+vv
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.10-rc4' of git://oss.sgi.com/xfs/xfs

Pull xfs fixes from Ben Myers:
 - Fix nested transactions in xfs_qm_scall_setqlim
 - Clear suid/sgid bits when we truncate with size update
 - Fix recovery for split buffers
 - Fix block count on remote symlinks
 - Add fsgeom flag for v5 superblock support
 - Disable XFS_IOC_SWAPEXT for CRC enabled filesystems
 - Fix dirv3 freespace block corruption

* tag 'for-linus-v3.10-rc4' of git://oss.sgi.com/xfs/xfs:
  xfs: fix dir3 freespace block corruption
  xfs: disable swap extents ioctl on CRC enabled filesystems
  xfs: add fsgeom flag for v5 superblock support.
  xfs: fix incorrect remote symlink block count
  xfs: fix split buffer vector log recovery support
  xfs: kill suid/sgid through the truncate path.
  xfs: avoid nesting transactions in xfs_qm_scall_setqlim()
2013-06-01 06:50:16 +09:00
Jeff Layton 1fc29baced cifs: fix off-by-one bug in build_unc_path_to_root
commit 839db3d10a (cifs: fix up handling of prefixpath= option) changed
the code such that the vol->prepath no longer contained a leading
delimiter and then fixed up the places that accessed that field to
account for that change.

One spot in build_unc_path_to_root was missed however. When doing the
pointer addition on pos, that patch failed to account for the fact that
we had already incremented "pos" by one when adding the length of the
prepath. This caused a buffer overrun by one byte.

This patch fixes the problem by correcting the handling of "pos".

Cc: <stable@vger.kernel.org> # v3.8+
Reported-by: Marcus Moeller <marcus.moeller@gmx.ch>
Reported-by: Ken Fallon <ken.fallon@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-05-31 16:23:35 -05:00
Jeff Mahoney a1457c0ce9 reiserfs: fix deadlock with nfs racing on create/lookup
Reiserfs is currently able to be deadlocked by having two NFS clients
where one has removed and recreated a file and another is accessing the
file with an open file handle.

If one client deletes and recreates a file with timing such that the
recreated file obtains the same [dirid, objectid] pair as the original
file while another client accesses the file via file handle, the create
and lookup can race and deadlock if the lookup manages to create the
in-memory inode first.

The create thread, in insert_inode_locked4, will hold the write lock
while waiting on the other inode to be unlocked. The lookup thread,
anywhere in the iget path, will release and reacquire the write lock while
it schedules. If it needs to reacquire the lock while the create thread
has it, it will never be able to make forward progress because it needs
to reacquire the lock before ultimately unlocking the inode.

This patch drops the write lock across the insert_inode_locked4 call so
that the ordering of inode_wait -> write lock is retained. Since this
would have been the case before the BKL push-down, this is safe.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-05-31 23:14:20 +02:00
Jeff Mahoney 4a8570112b reiserfs: fix problems with chowning setuid file w/ xattrs
reiserfs_chown_xattrs() takes the iattr struct passed into ->setattr
and uses it to iterate over all the attrs associated with a file to change
ownership of xattrs (and transfer quota associated with the xattr files).

When the setuid bit is cleared during chown, ATTR_MODE and iattr->ia_mode
are passed to all the xattrs as well. This means that the xattr directory
will have S_IFREG added to its mode bits.

This has been prevented in practice by a missing IS_PRIVATE check
in reiserfs_acl_chmod, which caused a double-lock to occur while holding
the write lock. Since the file system was completely locked up, the
writeout of the corrupted mode never happened.

This patch temporarily clears everything but ATTR_UID|ATTR_GID for the
calls to reiserfs_setattr and adds the missing IS_PRIVATE check.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-05-31 23:14:11 +02:00
Jeff Mahoney 0bdc7acba5 reiserfs: fix spurious multiple-fill in reiserfs_readdir_dentry
After sleeping for filldir(), we check to see if the file system has
changed and research. The next_pos pointer is updated but its value
isn't pushed into the key used for the search itself. As a result,
the search returns the same item that the last cycle of the loop did
and filldir() is called multiple times with the same data.

The end result is that the buffer can contain the same name multiple
times. This can be returned to userspace or used internally in the
xattr code where it can manifest with the following warning:

jdm-20004 reiserfs_delete_xattrs: Couldn't delete all xattrs (-2)

reiserfs_for_each_xattr uses reiserfs_readdir_dentry to iterate over
the xattr names and ends up trying to unlink the same name twice. The
second attempt fails with -ENOENT and the error is returned. At some
point I'll need to add support into reiserfsck to remove the orphaned
directories left behind when this occurs.

The fix is to push the value into the key before researching.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-05-31 23:13:58 +02:00
Al Viro 448293aadb befs_readdir(): do not increment ->f_pos if filldir tells us to stop
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-31 15:17:56 -04:00
Al Viro 31abdab9c1 hpfs: deadlock and race in directory lseek()
For one thing, there's an ABBA deadlock on hpfs fs-wide lock and i_mutex
in hpfs_dir_lseek() - there's a lot of methods that grab the former with
the caller already holding the latter, so it must take i_mutex first.

For another, locking the damn thing, carefully validating the offset,
then dropping locks and assigning the offset is obviously racy.

Moreover, we _must_ do hpfs_add_pos(), or the machinery in dnode.c
won't modify the sucker on B-tree surgeries.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-31 15:17:43 -04:00
Al Viro 1d7095c72d qnx6: qnx6_readdir() has a braino in pos calculation
We want to mask lower 5 bits out, not leave only those and clear the
rest...  As it is, we end up always starting to read from the beginning
of directory, no matter what the current position had been.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-31 15:17:31 -04:00
Takashi Iwai 5d477b6079 vfs: Fix invalid ida_remove() call
When the group id of a shared mount is not allocated, the umount still
tries to call mnt_release_group_id(), which eventually hits a kernel
warning at ida_remove() spewing a message like:
  ida_remove called for id=0 which is not allocated.

This patch fixes the bug simply checking the group id in the caller.

Reported-by: Cristian Rodríguez <crrodriguez@opensuse.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-31 15:16:33 -04:00
Linus Torvalds 484b002e28 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:

 - Three EFI-related fixes

 - Two early memory initialization fixes

 - build fix for older binutils

 - fix for an eager FPU performance regression -- currently we don't
   allow the use of the FPU at interrupt time *at all* in eager mode,
   which is clearly wrong.

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Allow FPU to be used at interrupt time even with eagerfpu
  x86, crc32-pclmul: Fix build with older binutils
  x86-64, init: Fix a possible wraparound bug in switchover in head_64.S
  x86, range: fix missing merge during add range
  x86, efi: initial the local variable of DataSize to zero
  efivar: fix oops in efivar_update_sysfs_entries() caused by memory reuse
  efivarfs: Never return ENOENT from firmware again
2013-05-31 09:44:10 +09:00
Dave Chinner 7bc0dc271e xfs: rework remote attr CRCs
Note: this changes the on-disk remote attribute format. I assert
that this is OK to do as CRCs are marked experimental and the first
kernel it is included in has not yet reached release yet. Further,
the userspace utilities are still evolving and so anyone using this
stuff right now is a developer or tester using volatile filesystems
for testing this feature. Hence changing the format right now to
save longer term pain is the right thing to do.

The fundamental change is to move from a header per extent in the
attribute to a header per filesytem block in the attribute. This
means there are more header blocks and the parsing of the attribute
data is slightly more complex, but it has the advantage that we
always know the size of the attribute on disk based on the length of
the data it contains.

This is where the header-per-extent method has problems. We don't
know the size of the attribute on disk without first knowing how
many extents are used to hold it. And we can't tell from a
mapping lookup, either, because remote attributes can be allocated
contiguously with other attribute blocks and so there is no obvious
way of determining the actual size of the atribute on disk short of
walking and mapping buffers.

The problem with this approach is that if we map a buffer
incorrectly (e.g. we make the last buffer for the attribute data too
long), we then get buffer cache lookup failure when we map it
correctly. i.e. we get a size mismatch on lookup. This is not
necessarily fatal, but it's a cache coherency problem that can lead
to returning the wrong data to userspace or writing the wrong data
to disk. And debug kernels will assert fail if this occurs.

I found lots of niggly little problems trying to fix this issue on a
4k block size filesystem, finally getting it to pass with lots of
fixes. The thing is, 1024 byte filesystems still failed, and it was
getting really complex handling all the corner cases that were
showing up. And there were clearly more that I hadn't found yet.

It is complex, fragile code, and if we don't fix it now, it will be
complex, fragile code forever more.

Hence the simple fix is to add a header to each filesystem block.
This gives us the same relationship between the attribute data
length and the number of blocks on disk as we have without CRCs -
it's a linear mapping and doesn't require us to guess anything. It
is simple to implement, too - the remote block count calculated at
lookup time can be used by the remote attribute set/get/remove code
without modification for both CRC and non-CRC filesystems. The world
becomes sane again.

Because the copy-in and copy-out now need to iterate over each
filesystem block, I moved them into helper functions so we separate
the block mapping and buffer manupulations from the attribute data
and CRC header manipulations. The code becomes much clearer as a
result, and it is a lot easier to understand and debug. It also
appears to be much more robust - once it worked on 4k block size
filesystems, it has worked without failure on 1k block size
filesystems, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit ad1858d777)
2013-05-30 17:26:31 -05:00
Dave Chinner 634fd5322a xfs: fully initialise temp leaf in xfs_attr3_leaf_compact
xfs_attr3_leaf_compact() uses a temporary buffer for compacting the
the entries in a leaf. It copies the the original buffer into the
temporary buffer, then zeros the original buffer completely. It then
copies the entries back into the original buffer.  However, the
original buffer has not been correctly initialised, and so the
movement of the entries goes horribly wrong.

Make sure the zeroed destination buffer is fully initialised, and
once we've set up the destination incore header appropriately, write
is back to the buffer before starting to move entries around.

While debugging this, the _d/_s prefixes weren't sufficient to
remind me what buffer was what, so rename then all _src/_dst.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit d4c712bcf2)
2013-05-30 17:26:24 -05:00
Dave Chinner 9e80c76205 xfs: fully initialise temp leaf in xfs_attr3_leaf_unbalance
xfs_attr3_leaf_unbalance() uses a temporary buffer for recombining
the entries in two leaves when the destination leaf requires
compaction. The temporary buffer ends up being copied back over the
original destination buffer, so the header in the temporary buffer
needs to contain all the information that is in the destination
buffer.

To make sure the temporary buffer is fully initialised, once we've
set up the temporary incore header appropriately, write is back to
the temporary buffer before starting to move entries around.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 8517de2a81)
2013-05-30 17:26:16 -05:00
Dave Chinner 58a7228155 xfs: correctly map remote attr buffers during removal
If we don't map the buffers correctly (same as for get/set
operations) then the incore buffer lookup will fail. If a block
number matches but a length is wrong, then debug kernels will ASSERT
fail in _xfs_buf_find() due to the length mismatch. Ensure that we
map the buffers correctly by basing the length of the buffer on the
attribute data length rather than the remote block count.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 6863ef8449)
2013-05-30 17:26:08 -05:00
Dave Chinner 26f714450c xfs: remote attribute tail zeroing does too much
When an attribute data does not fill then entire remote block, we
zero the remaining part of the buffer. This, however, needs to take
into account that the buffer has a header, and so the offset where
zeroing starts and the length of zeroing need to take this into
account. Otherwise we end up with zeros over the end of the
attribute value when CRCs are enabled.

While there, make sure we only ask to map an extent that covers the
remaining range of the attribute, rather than asking every time for
the full length of remote data. If the remote attribute blocks are
contiguous with other parts of the attribute tree, it will map those
blocks as well and we can potentially zero them incorrectly. We can
also get buffer size mistmatches when trying to read or remove the
remote attribute, and this can lead to not finding the correct
buffer when looking it up in cache.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 4af3644c9a)
2013-05-30 17:25:58 -05:00
Dave Chinner 551b382f53 xfs: remote attribute read too short
Reading a maximally size remote attribute fails when CRCs are
enabled with this verification error:

XFS (vdb): remote attribute header does not match required off/len/owner)

There are two reasons for this, the first being that the
length of the buffer being read is determined from the
args->rmtblkcnt which doesn't take into account CRC headers. Hence
the mapped length ends up being too short and so we need to
calculate it directly from the value length.

The second is that the byte count of valid data within a buffer is
capped by the length of the data and so doesn't take into account
that the buffer might be longer due to headers. Hence we need to
calculate the data space in the buffer first before calculating the
actual byte count of data.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 913e96bc29)
2013-05-30 17:25:50 -05:00
Dave Chinner 9531e2de6b xfs: remote attribute allocation may be contiguous
When CRCs are enabled, there may be multiple allocations made if the
headers cause a length overflow. This, however, does not mean that
the number of headers required increases, as the second and
subsequent extents may be contiguous with the previous extent. Hence
when we map the extents to write the attribute data, we may end up
with less extents than allocations made. Hence the assertion that we
consume the number of headers we calculated in the allocation loop
is incorrect and needs to be removed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 90253cf142)
2013-05-30 17:25:39 -05:00
Dave Chinner e400d27d16 xfs: fix dir3 freespace block corruption
When the directory freespace index grows to a second block (2017
4k data blocks in the directory), the initialisation of the second
new block header goes wrong. The write verifier fires a corruption
error indicating that the block number in the header is zero. This
was being tripped by xfs/110.

The problem is that the initialisation of the new block is done just
fine in xfs_dir3_free_get_buf(), but the caller then users a dirv2
structure to zero on-disk header fields that xfs_dir3_free_get_buf()
has already zeroed. These lined up with the block number in the dir
v3 header format.

While looking at this, I noticed that the struct xfs_dir3_free_hdr()
had 4 bytes of padding in it that wasn't defined as padding or being
zeroed by the initialisation. Add a pad field declaration and fully
zero the on disk and in-core headers in xfs_dir3_free_get_buf() so
that this is never an issue in the future. Note that this doesn't
change the on-disk layout, just makes the 32 bits of padding in the
layout explicit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 5ae6e6a401)
2013-05-30 17:22:54 -05:00
Dave Chinner 7c9950fd2a xfs: disable swap extents ioctl on CRC enabled filesystems
Currently, swapping extents from one inode to another is a simple
act of switching data and attribute forks from one inode to another.
This, unfortunately in no longer so simple with CRC enabled
filesystems as there is owner information embedded into the BMBT
blocks that are swapped between inodes. Hence swapping the forks
between inodes results in the inodes having mapping blocks that
point to the wrong owner and hence are considered corrupt.

To fix this we need an extent tree block or record based swap
algorithm so that the BMBT block owner information can be updated
atomically in the swap transaction. This is a significant piece of
new work, so for the moment simply don't allow swap extent
operations to succeed on CRC enabled filesystems.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 02f75405a7)
2013-05-30 17:20:08 -05:00
Dave Chinner e7927e879d xfs: add fsgeom flag for v5 superblock support.
Currently userspace has no way of determining that a filesystem is
CRC enabled. Add a flag to the XFS_IOC_FSGEOMETRY ioctl output to
indicate that the filesystem has v5 superblock support enabled.
This will allow xfs_info to correctly report the state of the
filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 74137fff06)
2013-05-30 17:19:45 -05:00
Dave Chinner 1de09d1ae4 xfs: fix incorrect remote symlink block count
When CRCs are enabled, the number of blocks needed to hold a remote
symlink on a 1k block size filesystem may be 2 instead of 1. The
transaction reservation for the allocated blocks was not taking this
into account and only allocating one block. Hence when trying to
read or invalidate such symlinks, we are mapping a hole where there
should be a block and things go bad at that point.

Fix the reservation to use the correct block count, clean up the
block count calculation similar to the remote attribute calculation,
and add a debug guard to detect when we don't write the entire
symlink to disk.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 321a95839e)
2013-05-30 17:19:07 -05:00
Dave Chinner 7d2ffe80aa xfs: fix split buffer vector log recovery support
A long time ago in a galaxy far away....

.. the was a commit made to fix some ilinux specific "fragmented
buffer" log recovery problem:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b29c0bece51da72fb3ff3b61391a391ea54e1603

That problem occurred when a contiguous dirty region of a buffer was
split across across two pages of an unmapped buffer. It's been a
long time since that has been done in XFS, and the changes to log
the entire inode buffers for CRC enabled filesystems has
re-introduced that corner case.

And, of course, it turns out that the above commit didn't actually
fix anything - it just ensured that log recovery is guaranteed to
fail when this situation occurs. And now for the gory details.

xfstest xfs/085 is failing with this assert:

XFS (vdb): bad number of regions (0) in inode log format
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 1583

Largely undocumented factoid #1: Log recovery depends on all log
buffer format items starting with this format:

struct foo_log_format {
	__uint16_t	type;
	__uint16_t	size;
	....

As recoery uses the size field and assumptions about 32 bit
alignment in decoding format items.  So don't pay much attention to
the fact log recovery thinks that it decoding an inode log format
item - it just uses them to determine what the size of the item is.

But why would it see a log format item with a zero size? Well,
luckily enough xfs_logprint uses the same code and gives the same
error, so with a bit of gdb magic, it turns out that it isn't a log
format that is being decoded. What logprint tells us is this:

Oper (130): tid: a0375e1a  len: 28  clientid: TRANS  flags: none
BUF:  #regs: 2   start blkno: 144 (0x90)  len: 16  bmap size: 2  flags: 0x4000
Oper (131): tid: a0375e1a  len: 4096  clientid: TRANS  flags: none
BUF DATA
----------------------------------------------------------------------------
Oper (132): tid: a0375e1a  len: 4096  clientid: TRANS  flags: none
xfs_logprint: unknown log operation type (4e49)
**********************************************************************
* ERROR: data block=2                                                 *
**********************************************************************

That we've got a buffer format item (oper 130) that has two regions;
the format item itself and one dirty region. The subsequent region
after the buffer format item and it's data is them what we are
tripping over, and the first bytes of it at an inode magic number.
Not a log opheader like there is supposed to be.

That means there's a problem with the buffer format item. It's dirty
data region is 4096 bytes, and it contains - you guessed it -
initialised inodes. But inode buffers are 8k, not 4k, and we log
them in their entirety. So something is wrong here. The buffer
format item contains:

(gdb) p /x *(struct xfs_buf_log_format *)in_f
$22 = {blf_type = 0x123c, blf_size = 0x2, blf_flags = 0x4000,
       blf_len = 0x10, blf_blkno = 0x90, blf_map_size = 0x2,
       blf_data_map = {0xffffffff, 0xffffffff, .... }}

Two regions, and a signle dirty contiguous region of 64 bits.  64 *
128 = 8k, so this should be followed by a single 8k region of data.
And the blf_flags tell us that the type of buffer is a
XFS_BLFT_DINO_BUF. It contains inodes. And because it doesn't have
the XFS_BLF_INODE_BUF flag set, that means it's an inode allocation
buffer. So, it should be followed by 8k of inode data.

But we know that the next region has a header of:

(gdb) p /x *ohead
$25 = {oh_tid = 0x1a5e37a0, oh_len = 0x100000, oh_clientid = 0x69,
       oh_flags = 0x0, oh_res2 = 0x0}

and so be32_to_cpu(oh_len) = 0x1000 = 4096 bytes. It's simply not
long enough to hold all the logged data. There must be another
region. There is - there's a following opheader for another 4k of
data that contains the other half of the inode cluster data - the
one we assert fail on because it's not a log format header.

So why is the second part of the data not being accounted to the
correct buffer log format structure? It took a little more work with
gdb to work out that the buffer log format structure was both
expecting it to be there but hadn't accounted for it. It was at that
point I went to the kernel code, as clearly this wasn't a bug in
xfs_logprint and the kernel was writing bad stuff to the log.

First port of call was the buffer item formatting code, and the
discontiguous memory/contiguous dirty region handling code
immediately stood out. I've wondered for a long time why the code
had this comment in it:

                        vecp->i_addr = xfs_buf_offset(bp, buffer_offset);
                        vecp->i_len = nbits * XFS_BLF_CHUNK;
                        vecp->i_type = XLOG_REG_TYPE_BCHUNK;
/*
 * You would think we need to bump the nvecs here too, but we do not
 * this number is used by recovery, and it gets confused by the boundary
 * split here
 *                      nvecs++;
 */
                        vecp++;

And it didn't account for the extra vector pointer. The case being
handled here is that a contiguous dirty region lies across a
boundary that cannot be memcpy()d across, and so has to be split
into two separate operations for xlog_write() to perform.

What this code assumes is that what is written to the log is two
consecutive blocks of data that are accounted in the buf log format
item as the same contiguous dirty region and so will get decoded as
such by the log recovery code.

The thing is, xlog_write() knows nothing about this, and so just
does it's normal thing of adding an opheader for each vector. That
means the 8k region gets written to the log as two separate regions
of 4k each, but because nvecs has not been incremented, the buf log
format item accounts for only one of them.

Hence when we come to log recovery, we process the first 4k region
and then expect to come across a new item that starts with a log
format structure of some kind that tells us whenteh next data is
going to be. Instead, we hit raw buffer data and things go bad real
quick.

So, the commit from 2002 that commented out nvecs++ is just plain
wrong. It breaks log recovery completely, and it would seem the only
reason this hasn't been since then is that we don't log large
contigous regions of multi-page unmapped buffers very often. Never
would be a closer estimate, at least until the CRC code came along....

So, lets fix that by restoring the nvecs accounting for the extra
region when we hit this case.....

.... and there's the problemin log recovery it is apparently working
around:

XFS: Assertion failed: i == item->ri_total, file: fs/xfs/xfs_log_recover.c, line: 2135

Yup, xlog_recover_do_reg_buffer() doesn't handle contigous dirty
regions being broken up into multiple regions by the log formatting
code. That's an easy fix, though - if the number of contiguous dirty
bits exceeds the length of the region being copied out of the log,
only account for the number of dirty bits that region covers, and
then loop again and copy more from the next region. It's a 2 line
fix.

Now xfstests xfs/085 passes, we have one less piece of mystery
code, and one more important piece of knowledge about how to
structure new log format items..

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 709da6a61a)
2013-05-30 17:18:01 -05:00
Dave Chinner 2962f5a5dc xfs: kill suid/sgid through the truncate path.
XFS has failed to kill suid/sgid bits correctly when truncating
files of non-zero size since commit c4ed4243 ("xfs: split
xfs_setattr") introduced in the 3.1 kernel. Fix it.

Fix it.

cc: stable kernel <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 56c19e89b3)
2013-05-30 17:17:35 -05:00
Dave Chinner 08fb39051f xfs: avoid nesting transactions in xfs_qm_scall_setqlim()
Lockdep reports:

=============================================
[ INFO: possible recursive locking detected ]
3.9.0+ #3 Not tainted
---------------------------------------------
setquota/28368 is trying to acquire lock:
 (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50

but task is already holding lock:
 (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50

from xfs_qm_scall_setqlim()->xfs_dqread() when a dquot needs to be
allocated.

xfs_qm_scall_setqlim() is starting a transaction and then not
passing it into xfs_qm_dqet() and so it starts it's own transaction
when allocating the dquot.  Splat!

Fix this by not allocating the dquot in xfs_qm_scall_setqlim()
inside the setqlim transaction. This requires getting the dquot
first (and allocating it if necessary) then dropping and relocking
the dquot before joining it to the setqlim transaction.

Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit f648167f3a)
2013-05-30 17:10:56 -05:00
Chuck Lever eb54d43707 NFS: Fix security flavor negotiation with legacy binary mounts
Darrick J. Wong <darrick.wong@oracle.com> reports:
> I have a kvm-based testing setup that netboots VMs over NFS, the
> client end of which seems to have broken somehow in 3.10-rc1.  The
> server's exports file looks like this:
>
> /storage/mtr/x64	192.168.122.0/24(ro,sync,no_root_squash,no_subtree_check)
>
> On the client end (inside the VM), the initrd runs the following
> command to try to mount the rootfs over NFS:
>
> # mount -o nolock -o ro -o retrans=10 192.168.122.1:/storage/mtr/x64/ /root
>
> (Note: This is the busybox mount command.)
>
> The mount fails with -EINVAL.

Commit 4580a92d44 "NFS: Use server-recommended security flavor by
default (NFSv3)" introduced a behavior regression for NFS mounts
done via a legacy binary mount(2) call.

Ensure that a default security flavor is specified for legacy binary
mount requests, since they do not invoke nfs_select_flavor() in the
kernel.

Busybox uses klibc's nfsmount command, which performs NFS mounts
using the legacy binary mount data format.  /sbin/mount.nfs is not
affected by this regression.

Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-05-30 16:31:34 -04:00
Dave Chinner 5ae6e6a401 xfs: fix dir3 freespace block corruption
When the directory freespace index grows to a second block (2017
4k data blocks in the directory), the initialisation of the second
new block header goes wrong. The write verifier fires a corruption
error indicating that the block number in the header is zero. This
was being tripped by xfs/110.

The problem is that the initialisation of the new block is done just
fine in xfs_dir3_free_get_buf(), but the caller then users a dirv2
structure to zero on-disk header fields that xfs_dir3_free_get_buf()
has already zeroed. These lined up with the block number in the dir
v3 header format.

While looking at this, I noticed that the struct xfs_dir3_free_hdr()
had 4 bytes of padding in it that wasn't defined as padding or being
zeroed by the initialisation. Add a pad field declaration and fully
zero the on disk and in-core headers in xfs_dir3_free_get_buf() so
that this is never an issue in the future. Note that this doesn't
change the on-disk layout, just makes the 32 bits of padding in the
layout explicit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-30 14:32:47 -05:00
Dave Chinner 56c19e89b3 xfs: kill suid/sgid through the truncate path.
XFS has failed to kill suid/sgid bits correctly when truncating
files of non-zero size since commit c4ed4243 ("xfs: split
xfs_setattr") introduced in the 3.1 kernel. Fix it.

Fix it.

cc: stable kernel <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-30 13:43:52 -05:00
Dave Chinner 74137fff06 xfs: add fsgeom flag for v5 superblock support.
Currently userspace has no way of determining that a filesystem is
CRC enabled. Add a flag to the XFS_IOC_FSGEOMETRY ioctl output to
indicate that the filesystem has v5 superblock support enabled.
This will allow xfs_info to correctly report the state of the
filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-30 12:57:25 -05:00
Dave Chinner 02f75405a7 xfs: disable swap extents ioctl on CRC enabled filesystems
Currently, swapping extents from one inode to another is a simple
act of switching data and attribute forks from one inode to another.
This, unfortunately in no longer so simple with CRC enabled
filesystems as there is owner information embedded into the BMBT
blocks that are swapped between inodes. Hence swapping the forks
between inodes results in the inodes having mapping blocks that
point to the wrong owner and hence are considered corrupt.

To fix this we need an extent tree block or record based swap
algorithm so that the BMBT block owner information can be updated
atomically in the swap transaction. This is a significant piece of
new work, so for the moment simply don't allow swap extent
operations to succeed on CRC enabled filesystems.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-30 12:55:31 -05:00
Dave Chinner 709da6a61a xfs: fix split buffer vector log recovery support
A long time ago in a galaxy far away....

.. the was a commit made to fix some ilinux specific "fragmented
buffer" log recovery problem:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b29c0bece51da72fb3ff3b61391a391ea54e1603

That problem occurred when a contiguous dirty region of a buffer was
split across across two pages of an unmapped buffer. It's been a
long time since that has been done in XFS, and the changes to log
the entire inode buffers for CRC enabled filesystems has
re-introduced that corner case.

And, of course, it turns out that the above commit didn't actually
fix anything - it just ensured that log recovery is guaranteed to
fail when this situation occurs. And now for the gory details.

xfstest xfs/085 is failing with this assert:

XFS (vdb): bad number of regions (0) in inode log format
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 1583

Largely undocumented factoid #1: Log recovery depends on all log
buffer format items starting with this format:

struct foo_log_format {
	__uint16_t	type;
	__uint16_t	size;
	....

As recoery uses the size field and assumptions about 32 bit
alignment in decoding format items.  So don't pay much attention to
the fact log recovery thinks that it decoding an inode log format
item - it just uses them to determine what the size of the item is.

But why would it see a log format item with a zero size? Well,
luckily enough xfs_logprint uses the same code and gives the same
error, so with a bit of gdb magic, it turns out that it isn't a log
format that is being decoded. What logprint tells us is this:

Oper (130): tid: a0375e1a  len: 28  clientid: TRANS  flags: none
BUF:  #regs: 2   start blkno: 144 (0x90)  len: 16  bmap size: 2  flags: 0x4000
Oper (131): tid: a0375e1a  len: 4096  clientid: TRANS  flags: none
BUF DATA
----------------------------------------------------------------------------
Oper (132): tid: a0375e1a  len: 4096  clientid: TRANS  flags: none
xfs_logprint: unknown log operation type (4e49)
**********************************************************************
* ERROR: data block=2                                                 *
**********************************************************************

That we've got a buffer format item (oper 130) that has two regions;
the format item itself and one dirty region. The subsequent region
after the buffer format item and it's data is them what we are
tripping over, and the first bytes of it at an inode magic number.
Not a log opheader like there is supposed to be.

That means there's a problem with the buffer format item. It's dirty
data region is 4096 bytes, and it contains - you guessed it -
initialised inodes. But inode buffers are 8k, not 4k, and we log
them in their entirety. So something is wrong here. The buffer
format item contains:

(gdb) p /x *(struct xfs_buf_log_format *)in_f
$22 = {blf_type = 0x123c, blf_size = 0x2, blf_flags = 0x4000,
       blf_len = 0x10, blf_blkno = 0x90, blf_map_size = 0x2,
       blf_data_map = {0xffffffff, 0xffffffff, .... }}

Two regions, and a signle dirty contiguous region of 64 bits.  64 *
128 = 8k, so this should be followed by a single 8k region of data.
And the blf_flags tell us that the type of buffer is a
XFS_BLFT_DINO_BUF. It contains inodes. And because it doesn't have
the XFS_BLF_INODE_BUF flag set, that means it's an inode allocation
buffer. So, it should be followed by 8k of inode data.

But we know that the next region has a header of:

(gdb) p /x *ohead
$25 = {oh_tid = 0x1a5e37a0, oh_len = 0x100000, oh_clientid = 0x69,
       oh_flags = 0x0, oh_res2 = 0x0}

and so be32_to_cpu(oh_len) = 0x1000 = 4096 bytes. It's simply not
long enough to hold all the logged data. There must be another
region. There is - there's a following opheader for another 4k of
data that contains the other half of the inode cluster data - the
one we assert fail on because it's not a log format header.

So why is the second part of the data not being accounted to the
correct buffer log format structure? It took a little more work with
gdb to work out that the buffer log format structure was both
expecting it to be there but hadn't accounted for it. It was at that
point I went to the kernel code, as clearly this wasn't a bug in
xfs_logprint and the kernel was writing bad stuff to the log.

First port of call was the buffer item formatting code, and the
discontiguous memory/contiguous dirty region handling code
immediately stood out. I've wondered for a long time why the code
had this comment in it:

                        vecp->i_addr = xfs_buf_offset(bp, buffer_offset);
                        vecp->i_len = nbits * XFS_BLF_CHUNK;
                        vecp->i_type = XLOG_REG_TYPE_BCHUNK;
/*
 * You would think we need to bump the nvecs here too, but we do not
 * this number is used by recovery, and it gets confused by the boundary
 * split here
 *                      nvecs++;
 */
                        vecp++;

And it didn't account for the extra vector pointer. The case being
handled here is that a contiguous dirty region lies across a
boundary that cannot be memcpy()d across, and so has to be split
into two separate operations for xlog_write() to perform.

What this code assumes is that what is written to the log is two
consecutive blocks of data that are accounted in the buf log format
item as the same contiguous dirty region and so will get decoded as
such by the log recovery code.

The thing is, xlog_write() knows nothing about this, and so just
does it's normal thing of adding an opheader for each vector. That
means the 8k region gets written to the log as two separate regions
of 4k each, but because nvecs has not been incremented, the buf log
format item accounts for only one of them.

Hence when we come to log recovery, we process the first 4k region
and then expect to come across a new item that starts with a log
format structure of some kind that tells us whenteh next data is
going to be. Instead, we hit raw buffer data and things go bad real
quick.

So, the commit from 2002 that commented out nvecs++ is just plain
wrong. It breaks log recovery completely, and it would seem the only
reason this hasn't been since then is that we don't log large
contigous regions of multi-page unmapped buffers very often. Never
would be a closer estimate, at least until the CRC code came along....

So, lets fix that by restoring the nvecs accounting for the extra
region when we hit this case.....

.... and there's the problemin log recovery it is apparently working
around:

XFS: Assertion failed: i == item->ri_total, file: fs/xfs/xfs_log_recover.c, line: 2135

Yup, xlog_recover_do_reg_buffer() doesn't handle contigous dirty
regions being broken up into multiple regions by the log formatting
code. That's an easy fix, though - if the number of contiguous dirty
bits exceeds the length of the region being copied out of the log,
only account for the number of dirty bits that region covers, and
then loop again and copy more from the next region. It's a 2 line
fix.

Now xfstests xfs/085 passes, we have one less piece of mystery
code, and one more important piece of knowledge about how to
structure new log format items..

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-30 12:48:33 -05:00
Dave Chinner 321a95839e xfs: fix incorrect remote symlink block count
When CRCs are enabled, the number of blocks needed to hold a remote
symlink on a 1k block size filesystem may be 2 instead of 1. The
transaction reservation for the allocated blocks was not taking this
into account and only allocating one block. Hence when trying to
read or invalidate such symlinks, we are mapping a hole where there
should be a block and things go bad at that point.

Fix the reservation to use the correct block count, clean up the
block count calculation similar to the remote attribute calculation,
and add a debug guard to detect when we don't write the entire
symlink to disk.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-30 12:37:04 -05:00
Dave Chinner 34510185ab xfs: don't emit v5 superblock warnings on write
We write the superblock every 30s or so which results in the
verifier being called. Right now that results in this output
every 30s:

XFS (vda): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
Use of these features in this kernel is at your own risk!

And spamming the logs.

We don't need to check for whether we support v5 superblocks or
whether there are feature bits we don't support set as these are
only relevant when we first mount the filesytem. i.e. on superblock
read. Hence for the write verification we can just skip all the
checks (and hence verbose output) altogether.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-30 12:24:19 -05:00
Trond Myklebust f448badd34 NFSv4: Fix a thinko in nfs4_try_open_cached
We need to pass the full open mode flags to nfs_may_open() when doing
a delegated open.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-05-29 16:03:23 -04:00
Todd Poynor 11ffa9d606 timerfd: Add alarm timers
Add support for clocks CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM,
thereby enabling wakeup alarm timers via file descriptors.

Signed-off-by: Todd Poynor <toddpoynor@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-05-29 12:57:34 -07:00
Linus Torvalds 320b34e3e0 Merge branch 'for-3.10' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "Fixes for a couple of DFS problems, a problem with extended security
  negotiation and two other small cifs fixes"

* 'for-3.10' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix composing of mount options for DFS referrals
  cifs: stop printing the unc= option in /proc/mounts
  cifs: fix error handling when calling cifs_parse_devname
  cifs: allow sec=none mounts to work against servers that don't support extended security
  cifs: fix potential buffer overrun when composing a new options string
  cifs: only set ops for inodes in I_NEW state
2013-05-28 10:08:39 -07:00
Paul Taysom 566370a2e5 ext4: suppress ext4 orphan messages on mount
Suppress the messages releating to processing the ext4 orphan list
("truncating inode" and "deleting unreferenced inode") unless the
debug option is on, since otherwise they end up taking up space in the
log that could be used for more useful information.

Tested by opening several files, unlinking them, then
crashing the system, rebooting the system and examining
/var/log/messages.

Addresses the problem described in http://crbug.com/220976

Signed-off-by: Paul Taysom <taysom@chromium.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-05-28 07:51:21 -04:00
Darrick J. Wong eee06c5678 jbd2: fix block tag checksum verification brokenness
Al Viro complained of a ton of bogosity with regards to the jbd2 block
tag header checksum.  This one checksum is 16 bits, so cut off the
upper 16 bits and treat it as a 16-bit value and don't mess around
with be32* conversions.  Fortunately metadata checksumming is still
"experimental" and not in a shipping e2fsprogs, so there should be few
users affected by this.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2013-05-28 07:31:59 -04:00
Zheng Liu 5d9cf9c625 jbd2: use kmem_cache_zalloc for allocating journal head
This commit tries to use kmem_cache_zalloc instead of kmem_cache_alloc/
memset when a new journal head is alloctated.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
2013-05-28 07:27:11 -04:00
Masanari Iida 8b513d0cf6 treewide: Fix typo in printk
Correct spelling typo in various part of drivers

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-05-28 12:02:13 +02:00
Stefan Behrens 7e21f14d17 btrfs: fix btrfs_extend_item() comment
The size parameter to btrfs_extend_item() is the number of bytes
to add to the item, not the size of the item after the operation
(like it is for btrfs_truncate_item(), there the size parameter
is not the number of bytes to take away, but the total size of
the item after truncation).
Fix it in the comment.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-05-28 12:02:12 +02:00
Pavel Tikhomirov 15ef0298de posix-timers: Show clock ID in proc file
Expand information about posix-timers in /proc/<pid>/timers by adding
info about clock, with which the timer was created. I.e. in the forth
line of timer info after "notify:" line go "ClockID: <clock_id>".

Signed-off-by: Pavel Tikhomirov <snorcht@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Helsley <matt.helsley@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Link: http://lkml.kernel.org/r/1368742323-46949-2-git-send-email-snorcht@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-05-28 11:41:14 +02:00
Jaegeuk Kim 83d5d6f66b f2fs: cover cp_file information with ilock
If a file is linked with other files, it should be checkpointed at every fsync
calls.
For this, we use set_cp_file() with FADVISE_CP_BIT, but previously we didn't
cover the flag by the global lock.
This patch fixes that the inode page stores this correctly.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:06 +09:00
Jaegeuk Kim afc3eda2a8 f2fs: fix incorrect iputs during the dentry recovery
- iget/iput flow in the dentry recovery process

1. *dir* = f2fs_iget
2. set FI_DELAY_IPUT to *dir*
3. add *dir* to the dirty_dir_list
		   - __f2fs_add_link
		     - recover_dentry)
4. iput *dir* by remove_dirty_dir_inode
		   - sync_dirty_dir_inodes
		     - write_chekcpoint

If *dir*'s i_count is not 1 (i.e., root dir), remove_dirty_dir_inode is called
later and then iput is triggered again due to the FI_DELAY_IPUT flag.
So, let's unset the flag properly once iput is triggered.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:06 +09:00
Jaegeuk Kim 6b8213d9a4 f2fs: fix dentry recovery routine
The error scenario is:
1. create /a
(1.a link /a /b)
2. sync
3. unlinke /a
4. create /a
5. fsync /a
6. Sudden power-off

When the f2fs recovers the fsynced dentry, /a, we discover an exsiting dentry at
f2fs_find_entry() in recover_dentry().

In such the case, we should unlink the existing dentry and its inode
and then recover newly created dentry.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:05 +09:00
Jaegeuk Kim 3b10b1fd2b f2fs: iput only if whole data blocks are flushed
If there remains some unwritten blocks from the recovery, we should not call
iput on that directory inode.
Otherwise, we can loose some dentry blocks after the recovery.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:05 +09:00
Namjae Jeon 7a267f8d74 f2fs: return proper error from start_gc_thread
when there is an error from kthread_run, then return proper error
rather than returning -ENOMEM.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:05 +09:00
Namjae Jeon a06a241603 f2fs: optimize several routines in node.h
There are various functions with common code which could be separated
out to make common routines. So, made new routines and in order to
retain the same call path and no major changes, written some macros
to access those routines.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:05 +09:00
Namjae Jeon 4777f86b7c f2fs: remove unneeded initializations in f2fs_parent_dir
There is no need to initialize few pointers in f2fs_parent_dir
as the values are not checked and instead directly initialized
values are used.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:05 +09:00
Namjae Jeon 35b09d82c3 f2fs: push some variables to debug part
Some, counters are needed only for the statistical information
while debugging.
So, those can be controlled using CONFIG_F2FS_STAT_FS,
pushing the usage for few variables under this flag.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:05 +09:00
Jaegeuk Kim a9841c4dbb f2fs: align data types between on-disk and in-memory block addresses
The on-disk block address is defined as __le32, but in-memory block address,
block_t, does as u64.

Let's synchronize them to 32 bits.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:04 +09:00
Dan Carpenter f28c06fa6f f2fs: dereferencing an ERR_PTR
There is an error path where "dir" is an ERR_PTR.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:04 +09:00
Jaegeuk Kim 6f6fd833e1 f2fs: use ihold
Use the following helper function committed by Al.

commit 7de9c6ee3e
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Sat Oct 23 11:11:40 2010 -0400

    new helper: ihold()

...

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:04 +09:00
Jaegeuk Kim 93ff10d690 f2fs: should not make_bad_inode on f2fs_link failure
If -ENOSPC is met during f2fs_link, we should not make the inode as bad.
The inode is still alive.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:04 +09:00
Jaegeuk Kim 39cf72cf09 f2fs: fix to handle do_recover_data errors
This patch adds error handling codes of check_index_in_prev_nodes and its
caller, do_recover_data.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:04 +09:00
Jaegeuk Kim b292dcab06 f2fs: reuse the locked dnode page and its inode
This patch fixes the following deadlock bug during the recovery.

INFO: task mount:1322 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount           D ffffffff81125870     0  1322   1266 0x00000000
 ffff8801207e39d8 0000000000000046 ffff88012ab1dee0 0000000000000046
 ffff8801207e3a08 ffff880115903f40 ffff8801207e3fd8 ffff8801207e3fd8
 ffff8801207e3fd8 ffff880115903f40 ffff8801207e39d8 ffff88012fc94520
Call Trace:
[<ffffffff81125870>] ? __lock_page+0x70/0x70
[<ffffffff816a92d9>] schedule+0x29/0x70
[<ffffffff816a93af>] io_schedule+0x8f/0xd0
[<ffffffff8112587e>] sleep_on_page+0xe/0x20
[<ffffffff816a649a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81125867>] __lock_page+0x67/0x70
[<ffffffff8106c7b0>] ? autoremove_wake_function+0x40/0x40
[<ffffffff81126857>] find_lock_page+0x67/0x80
[<ffffffff8112698f>] find_or_create_page+0x3f/0xb0
[<ffffffffa03901a8>] ? sync_inode_page+0xa8/0xd0 [f2fs]
[<ffffffffa038fdf7>] get_node_page+0x67/0x180 [f2fs]
[<ffffffffa039818b>] recover_fsync_data+0xacb/0xff0 [f2fs]
[<ffffffff816aaa1e>] ? _raw_spin_unlock+0x3e/0x40
[<ffffffffa0389634>] f2fs_fill_super+0x7d4/0x850 [f2fs]
[<ffffffff81184cf9>] mount_bdev+0x1c9/0x210
[<ffffffffa0388e60>] ? validate_superblock+0x180/0x180 [f2fs]
[<ffffffffa0387635>] f2fs_mount+0x15/0x20 [f2fs]
[<ffffffff81185a13>] mount_fs+0x43/0x1b0
[<ffffffff81145ba0>] ? __alloc_percpu+0x10/0x20
[<ffffffff811a0796>] vfs_kern_mount+0x76/0x120
[<ffffffff811a2cb7>] do_mount+0x237/0xa10
[<ffffffff81140b9b>] ? strndup_user+0x5b/0x80
[<ffffffff811a3520>] SyS_mount+0x90/0xe0
[<ffffffff816b3502>] system_call_fastpath+0x16/0x1b

The bug is triggered when check_index_in_prev_nodes tries to get the direct
node page by calling get_node_page.
At this point, if the direct node page is already locked by get_dnode_of_data,
its caller, we got a deadlock condition.

This patch adds additional condition check for the reuse of locked direct node
pages prior to the get_node_page call.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:04 +09:00
Jaegeuk Kim b638f0c4b8 f2fs: fix wrong condition check
While an orphan inode has zero link_count, f2fs_gc is able to select the inode
for foreground gc.

- f2fs_gc
 - do_garbage_collect
   - gc_data_segment
     : f2fs_iget is failed
     : get_valid_blocks() != 0, so that retry
--> here we got the infinite loop.

This patch resolved this issue.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:03 +09:00
Jaegeuk Kim 77888c1e42 f2fs: add f2fs_readonly()
Introduce a simple macro function for readability.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:03 +09:00
Jaegeuk Kim 6f85b35203 f2fs: avoid RECLAIM_FS-ON-W: deadlock
This patch tries to avoid the following deadlock condition of which the reclaim
path can trigger f2fs_balance_fs again.

=================================
[ INFO: inconsistent lock state ]
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/41 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (&sbi->gc_mutex){+.+.?.}, at: f2fs_balance_fs+0xe6/0x100 [f2fs]
{RECLAIM_FS-ON-W} state was registered at:
  [<ffffffff810aa5a9>] mark_held_locks+0xb9/0x140
  [<ffffffff810aae85>] lockdep_trace_alloc+0x85/0xf0
  [<ffffffff8113ab2c>] __alloc_pages_nodemask+0x7c/0x9b0
  [<ffffffff81175aa8>] alloc_pages_current+0xb8/0x180
  [<ffffffff811319cf>] __page_cache_alloc+0xaf/0xd0
  [<ffffffff8113225c>] find_or_create_page+0x4c/0xb0
  [<ffffffffa021359e>] find_data_page+0x14e/0x210 [f2fs]
  [<ffffffffa021161b>] f2fs_gc+0x9eb/0xd90 [f2fs]
  [<ffffffffa0218fae>] f2fs_balance_fs+0xee/0x100 [f2fs]
  [<ffffffffa020848c>] f2fs_setattr+0x6c/0x200 [f2fs]
  [<ffffffff811ae51b>] notify_change+0x1db/0x3a0
  [<ffffffff8118fbd0>] do_truncate+0x60/0xa0
  [<ffffffff8118fd95>] vfs_truncate+0x185/0x1b0
  [<ffffffff8118fe1c>] do_sys_truncate+0x5c/0xa0
  [<ffffffff8118ffee>] SyS_truncate+0xe/0x10
  [<ffffffff816e2b42>] system_call_fastpath+0x16/0x1b

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:03 +09:00
Jaegeuk Kim 2c2c149f7d f2fs: don't do checkpoint if error is occurred
If we met an error during the dentry recovery, we should not conduct checkpoint.
Otherwise, some errorneous dentry blocks overwrites the existing blocks that
contain the remaining recovery information.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:03 +09:00
Jaegeuk Kim 45856aff0d f2fs: fix to unlock page before exit
If we got an error after lock_page, we should unlock it before exit.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:03 +09:00
Jaegeuk Kim 9a55ed656c f2fs: remove unnecessary kmap/kunmap operations
The allocated page used by the recovery is not on HIGHMEM, so that we don't
need to use kmap/kunmap.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:03 +09:00
Namjae Jeon 9851e6e189 f2fs: reorganize f2fs_vm_page_mkwrite
Few things can be changed in the default mkwrite function
1) Make file_update_time at the start before acquiring any lock
2) the condition page_offset(page) >= i_size_read(inode) should be
 changed to page_offset(page) > i_size_read
3) Move wait_on_page_writeback.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:02 +09:00
majianpeng 145b04e5ed f2fs: use list_for_each_entry rather than list_for_each_entry_safe
We can do this, since now we use a global mutex, f2fs_stat_mutex to protect its
list operations.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
[Jaegeuk Kim: add description]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:02 +09:00
Haicheng Li 81fb5e8746 f2fs: remove unecessary variable and code
Code cleanup without behavior changed.

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:02 +09:00
Peter Zijlstra bfe35965ec f2fs, lockdep: annotate mutex_lock_all()
Majianpeng reported a lockdep splat for f2fs. It turns out mutex_lock_all()
acquires an array of locks (in global/local lock style).

Any such operation is always serialized using cp_mutex, therefore there is no
fs_lock[] lock-order issue; tell lockdep about this using the
mutex_lock_nest_lock() primitive.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:02 +09:00
Jaegeuk Kim f356fe0cba f2fs: add debug msgs in the recovery routine
This patch adds some trivial debugging messages in the recovery process.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:02 +09:00
Jaegeuk Kim 44a83ff6a8 f2fs: update inode page after creation
I found a bug when testing power-off-recovery as follows.

[Bug Scenario]
1. create a file
2. fsync the file
3. reboot w/o any sync
4. try to recover the file
 - found its fsync mark
 - found its dentry mark
   : try to recover its dentry
    - get its file name
    - get its parent inode number
     : here we got zero value

The reason why we get the wrong parent inode number is that we didn't
synchronize the inode page with its newly created inode information perfectly.

Especially, previous f2fs stores fi->i_pino and writes it to the cached
node page in a wrong order, which incurs the zero-valued i_pino during the
recovery.

So, this patch modifies the creation flow to fix the synchronization order of
inode page with its inode.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:02 +09:00
Jaegeuk Kim 64aa7ed98d f2fs: change get_new_data_page to pass a locked node page
This patch is for passing a locked node page to get_dnode_of_data.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:01 +09:00
Jaegeuk Kim 1646cfac95 f2fs: skip get_node_page if locked node page is passed
If get_dnode_of_data gets a locked node page, let's skip redundant
get_node_page calls.
This is for the futher enhancement.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:01 +09:00
Jaegeuk Kim 0a364af18f f2fs: remove unnecessary por_doing check
This por_doing check is totally not related to the recovery process.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:01 +09:00
Jaegeuk Kim 74d0b917ef f2fs: fix BUG_ON during f2fs_evict_inode(dir)
During the dentry recovery routine, recover_inode() triggers __f2fs_add_link
with its directory inode.

In the following scenario, a bug is captured.
 1. dir = f2fs_iget(pino)
 2. __f2fs_add_link(dir, name)
 3. iput(dir)
  -> f2fs_evict_inode() faces with BUG_ON(atomic_read(fi->dirty_dents))

Kernel BUG at ffffffffa01c0676 [verbose debug info unavailable]
[<ffffffffa01c0676>] f2fs_evict_inode+0x276/0x300 [f2fs]
Call Trace:
 [<ffffffff8118ea00>] evict+0xb0/0x1b0
 [<ffffffff8118f1c5>] iput+0x105/0x190
 [<ffffffffa01d2dac>] recover_fsync_data+0x3bc/0x1070 [f2fs]
 [<ffffffff81692e8a>] ? io_schedule+0xaa/0xd0
 [<ffffffff81690acb>] ? __wait_on_bit_lock+0x7b/0xc0
 [<ffffffff8111a0e7>] ? __lock_page+0x67/0x70
 [<ffffffff81165e21>] ? kmem_cache_alloc+0x31/0x140
 [<ffffffff8118a502>] ? __d_instantiate+0x92/0xf0
 [<ffffffff812a949b>] ? security_d_instantiate+0x1b/0x30
 [<ffffffff8118a5b4>] ? d_instantiate+0x54/0x70

This means that we should flush all the dentry pages between iget and iput().
But, during the recovery routine, it is unallowed due to consistency, so we
have to wait the whole recovery process.
And then, write_checkpoint flushes all the dirty dentry blocks, and nicely we
can put the stale dir inodes from the dirty_dir_inode_list.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:01 +09:00
Jaegeuk Kim 8c26d7d571 f2fs: fix por_doing variable coverage
The reason of using sbi->por_doing is to alleviate data writes during the
recovery.
The find_fsync_dnodes() produces some dirty dentry pages, so we should
cover it too with sbi->por_doing.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:01 +09:00
Jaegeuk Kim addbe45b00 f2fs: remove redundant assignment
We don't need to assign a value redundantly.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:01 +09:00
Jaegeuk Kim 650495dedc f2fs: fix the inconsistent state of data pages
In get_lock_data_page, if there is a data race between get_dnode_of_data for
node and grab_cache_page for data, f2fs is able to face with the following
BUG_ON(dn.data_blkaddr == NEW_ADDR).

kernel BUG at /home/zeus/f2fs_test/src/fs/f2fs/data.c:251!
 [<ffffffffa044966c>] get_lock_data_page+0x1ec/0x210 [f2fs]
Call Trace:
 [<ffffffffa043b089>] f2fs_readdir+0x89/0x210 [f2fs]
 [<ffffffff811a0920>] ? fillonedir+0x100/0x100
 [<ffffffff811a0920>] ? fillonedir+0x100/0x100
 [<ffffffff811a07f8>] vfs_readdir+0xb8/0xe0
 [<ffffffff811a0b4f>] sys_getdents+0x8f/0x110
 [<ffffffff816d7999>] system_call_fastpath+0x16/0x1b

This bug is able to be occurred when the block address of the data block is
changed after f2fs_put_dnode().
In order to avoid that, this patch fixes the lock order of node and data
blocks in which the node block lock is covered by the data block lock.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:00 +09:00
Jaegeuk Kim 65e5cd0a15 f2fs: fix inconsistency of block count during recovery
Currently f2fs recovers the dentry of fsynced files.
When power-off-recovery is conducted, this newly recovered inode should increase
node block count as well as inode block count.

This patch resolves this inconsistency that results in:

1. create a file
2. write data
3. fsync
4. reboot without sync
5. mount and recover the file
6. node block count is 1 and inode block count is 2
 : fall into the inconsistent state
7. unlink the file
 : trigger the following BUG_ON

------------[ cut here ]------------
kernel BUG at /home/zeus/f2fs_test/src/fs/f2fs/f2fs.h:716!
Call Trace:
 [<ffffffffa0344100>] ? get_node_page+0x50/0x1a0 [f2fs]
 [<ffffffffa0344bfc>] remove_inode_page+0x8c/0x100 [f2fs]
 [<ffffffffa03380f0>] ? f2fs_evict_inode+0x180/0x2d0 [f2fs]
 [<ffffffffa033812e>] f2fs_evict_inode+0x1be/0x2d0 [f2fs]
 [<ffffffff811c7a67>] evict+0xa7/0x1a0
 [<ffffffff811c82b5>] iput+0x105/0x190
 [<ffffffff811c2b30>] d_kill+0xe0/0x120
 [<ffffffff811c2c57>] dput+0xe7/0x1e0
 [<ffffffff811acc3d>] __fput+0x19d/0x2d0
 [<ffffffff811acd7e>] ____fput+0xe/0x10
 [<ffffffff81070645>] task_work_run+0xb5/0xe0
 [<ffffffff81002941>] do_notify_resume+0x71/0xb0
 [<ffffffff8175f14a>] int_signal+0x12/0x17

Reported-and-Tested-by: Chris Fries <C.Fries@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-28 15:03:00 +09:00
Lukas Czerner d23142c627 ext4: make punch hole code path work with bigalloc
Currently punch hole is disabled in file systems with bigalloc
feature enabled. However the recent changes in punch hole patch should
make it easier to support punching holes on bigalloc enabled file
systems.

This commit changes partial_cluster handling in ext4_remove_blocks(),
ext4_ext_rm_leaf() and ext4_ext_remove_space(). Currently
partial_cluster is unsigned long long type and it makes sure that we
will free the partial cluster if all extents has been released from that
cluster. However it has been specifically designed only for truncate.

With punch hole we can be freeing just some extents in the cluster
leaving the rest untouched. So we have to make sure that we will notice
cluster which still has some extents. To do this I've changed
partial_cluster to be signed long long type. The only scenario where
this could be a problem is when cluster_size == block size, however in
that case there would not be any partial clusters so we're safe. For
bigger clusters the signed type is enough. Now we use the negative value
in partial_cluster to mark such cluster used, hence we know that we must
not free it even if all other extents has been freed from such cluster.

This scenario can be described in simple diagram:

|FFF...FF..FF.UUU|
 ^----------^
  punch hole

. - free space
| - cluster boundary
F - freed extent
U - used extent

Also update respective tracepoints to use signed long long type for
partial_cluster.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:33:35 -04:00
Lukas Czerner 61801325f7 ext4: update ext4_ext_remove_space trace point
Add "end" variable.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Lukas Czerner 78fb9cdf03 ext4: remove unused code from ext4_remove_blocks()
The "head removal" branch in the condition is never used in any code
path in ext4 since the function only caller ext4_ext_rm_leaf() will make
sure that the extent is properly split before removing blocks. Note that
there is a bug in this branch anyway.

This commit removes the unused code completely and makes use of
ext4_error() instead of printk if dubious range is provided.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Lukas Czerner c121ffd013 ext4: remove unused discard_partial_page_buffers
The discard_partial_page_buffers is no longer used anywhere so we can
simply remove it including the *_no_lock variant and
EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED define.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Lukas Czerner a87dd18ce2 ext4: use ext4_zero_partial_blocks in punch_hole
We're doing to get rid of ext4_discard_partial_page_buffers() since it is
duplicating some code and also partially duplicating work of
truncate_pagecache_range(), moreover the old implementation was much
clearer.

Now when the truncate_inode_pages_range() can handle truncating non page
aligned regions we can use this to invalidate and zero out block aligned
region of the punched out range and then use ext4_block_truncate_page()
to zero the unaligned blocks on the start and end of the range. This
will greatly simplify the punch hole code. Moreover after this commit we
can get rid of the ext4_discard_partial_page_buffers() completely.

We also introduce function ext4_prepare_punch_hole() to do come common
operations before we attempt to do the actual punch hole on
indirect or extent file which saves us some code duplication.

This has been tested on ppc64 with 1k block size with fsx and xfstests
without any problems.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Lukas Czerner 55f252c9f5 ext4: truncate_inode_pages() in orphan cleanup path
Currently we do not tell mm to zero out tail of the page before truncate
in orphan_cleanup(). This is ok, because the page should not be
uptodate, however this may eventually change and I might cause problems.

Call truncate_inode_pages() as precautionary measure. Thanks Jan Kara
for pointing this out.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Lukas Czerner eb3544c6fc Revert "ext4: fix fsx truncate failure"
This reverts commit 189e868fa8.

This commit reintroduces the use of ext4_block_truncate_page() in ext4
truncate operation instead of ext4_discard_partial_page_buffers().

The statement in the commit description that the truncate operation only
zero block unaligned portion of the last page is not exactly right,
since truncate_pagecache_range() also zeroes and invalidate the unaligned
portion of the page. Then there is no need to zero and unmap it once more
and ext4_block_truncate_page() was doing the right job, although we
still need to update the buffer head containing the last block, which is
exactly what ext4_block_truncate_page() is doing.

Moreover the problem described in the commit is fixed more properly with
commit

15291164b2
	jbd2: clear BH_Delay & BH_Unwritten in journal_unmap_buffer

This was tested on ppc64 machine with block size of 1024 bytes without
any problems.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Lukas Czerner 0713ed0cde ext4: Call ext4_jbd2_file_inode() after zeroing block
In data=ordered mode we should call ext4_jbd2_file_inode() so that crash
after the truncate transaction has committed does not expose stall data
in the tail of the block.

Thanks Jan Kara for pointing that out.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Lukas Czerner d863dc3614 Revert "ext4: remove no longer used functions in inode.c"
This reverts commit ccb4d7af91.

This commit reintroduces functions ext4_block_truncate_page() and
ext4_block_zero_page_range() which has been previously removed in favour
of ext4_discard_partial_page_buffers().

In future commits we want to reintroduce those function and remove
ext4_discard_partial_page_buffers() since it is duplicating some code
and also partially duplicating work of truncate_pagecache_range(),
moreover the old implementation was much clearer.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Greg Kroah-Hartman 4993632218 Merge 3.10-rc3 into driver-core-next
We want the changes here.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-27 10:42:53 +09:00
Linus Torvalds 89ff77837a NFS client bugfixes for 3.10
- Stable fix to prevent an rpc_task wakeup race
 - Fix a NFSv4.1 session drain deadlock
 - Fix a NFSv4/v4.1 mount regression when not running rpc.gssd
 - Ensure auth_gss pipe detection works in namespaces
 - Fix SETCLIENTID fallback if rpcsec_gss is not available
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJRomGNAAoJEGcL54qWCgDyzUIQALj4hsOtdcmCCAWu0m+Pr+Gz
 /6rpO2xJ5cLWyTYS2J9wmPkU+mb/g/OB/OhANyQQsDbfoW3e27TGaXRB8hUs29vk
 LHTkFcrBTi4ohlw2Gyb6LWDF6cyHkC7cpH7tfIjLDff77V/qK1uqo41MtYpqUvy3
 41tMoFOhvpwsy7HuKqVyYtNlwxnXrdJ5XvK0ycaQYQ9t8om9Hxu/scCgLfl/qwRr
 eVhIesC2/Y1eC46UK7/TWCC7aTLkvi85UQY+fywMkajt/gPzjjh6nuhc45aOT9d7
 pc0sXgIs8fLEjC+CRa0ojrziJZCHZY93U1kzA+CotRqPq76nPeFeG/1VhViKKb4C
 1AU9IA4ntViUqNDQiGrCxE6ZeMewTOKksrCErwSS16Hv9Gqh4SHq84R/bF6MFgBc
 dqQsexUbf/vaWHmuosARgbyc/QcBXDwWwbkq0hXSWbHA8vpc8/Lw1wkp49eyKz0V
 0zwJ9xInakXr5/DIC72IKEF0Dg26L+GTXWLmDZVcdpVBp5A403trxUKckcsNa/Xf
 FXaGMhyv2ZfV4AohW1Z8klkLiMHt4dr+YB7SQAgR/gb81p3odgKgMz51PfmHxFqs
 4nxwRQqWplBTGiLrOFSgaRC50jLqEloUweWC2qf56KYEb9N6M6XDIXhllLIMyWLD
 94cvihpKj+/ecKH5kQWF
 =B8go
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.10-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:

 - Stable fix to prevent an rpc_task wakeup race
 - Fix a NFSv4.1 session drain deadlock
 - Fix a NFSv4/v4.1 mount regression when not running rpc.gssd
 - Ensure auth_gss pipe detection works in namespaces
 - Fix SETCLIENTID fallback if rpcsec_gss is not available

* tag 'nfs-for-3.10-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: Fix SETCLIENTID fallback if GSS is not available
  SUNRPC: Prevent an rpc_task wakeup race
  NFSv4.1 Fix a pNFS session draining deadlock
  SUNRPC: Convert auth_gss pipe detection to work in namespaces
  SUNRPC: Faster detection if gssd is actually running
  SUNRPC: Fix a bug in gss_create_upcall
2013-05-26 12:33:05 -07:00
Linus Torvalds 088d812fe9 xfs: fixes for 3.10-rc3
- Fix for corruption with FSX on 512 byte blocksize filesystems
 - Fix rounding error in xfs_free_file_space
 - Fix use-after-free with extent free intents
 - Add several missing KM_NOFS flags to fix lockdep reports
 - Several fixes for CRC related code
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJRn93mAAoJENaLyazVq6ZOPw8P/1ab7Gr4K7ccpuRvC+JiBPx9
 /Eb1l2A68r8rWJq3bEvLM/XgN7F0IESastS+iA8lrNuLWJAHCb+Y37mdbyWymhWO
 J6LFg/Mic+n5uGmg6nXrqLJ8epTd3L+V+cCEHLpa/20zEwa5yd8sSTV0P1pfcMRi
 S36Uof7boe+s4H/s2z4YX8Y+dkaVn1tN7j4PDs5Zd/wiUq0EKA5le4AkbqIJlCDn
 E9mJL1S2t3QwfgQWL+pe8f5jzReGTAxJUQWX2ErajX4vYI7PnR1WTYf7YJB291f0
 XPfkjqvaK0WCdipSkgb58t/Rj2IxoHItdryvBHa/rnL4e95aBQMpNyY8btNX28ou
 NgcIQnqQB+xlJeHpSyb1xNusamGzWw0CEYsXPcDQbXj1uJjd7/GfqTwtVbB4zOlW
 Hua7/6N22vaY1dfQcQFwgi1XcKF9jSVKBVOvy7yHEmdNuT7mFejaG1gVO7NpTIgd
 s7R91pQVlYpLTmZHK6ZqZap5OC6kS/YJr9HxVxS3FQhaJmFwGqvf6UUjSWOK1MnJ
 obTbbygo2Orvh10lgzDNsd+xRJZ9aFn+UfywSeQGTfm91HvTnEQje3xqctv3zmBy
 adrWSXXg/eN07nciNk0zTgHnQZo1v/ZeAFlz0AcSslMMQ7Pq4+Pv4LPtdxMIdrxU
 1IZ32GunGTUXqm6GRW6v
 =hQCM
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.10-rc3' of git://oss.sgi.com/xfs/xfs

Pull xfs fixes from Ben Myers:
 "Here are fixes for corruption on 512 byte filesystems, a rounding
  error, a use-after-free, some flags to fix lockdep reports, and
  several fixes related to CRCs.  We have a somewhat larger post -rc1
  queue than usual due to fixes related to the CRC feature we merged for
  3.10:

   - Fix for corruption with FSX on 512 byte blocksize filesystems
   - Fix rounding error in xfs_free_file_space
   - Fix use-after-free with extent free intents
   - Add several missing KM_NOFS flags to fix lockdep reports
   - Several fixes for CRC related code"

* tag 'for-linus-v3.10-rc3' of git://oss.sgi.com/xfs/xfs:
  xfs: remote attribute lookups require the value length
  xfs: xfs_attr_shortform_allfit() does not handle attr3 format.
  xfs: xfs_da3_node_read_verify() doesn't handle XFS_ATTR3_LEAF_MAGIC
  xfs: fix missing KM_NOFS tags to keep lockdep happy
  xfs: Don't reference the EFI after it is freed
  xfs: fix rounding in xfs_free_file_space
  xfs: fix sub-page blocksize data integrity writes
2013-05-26 09:35:02 -07:00
Benjamin LaHaise 03e04f048d aio: fix kioctx not being freed after cancellation at exit time
The recent changes overhauling fs/aio.c introduced a bug that results in
the kioctx not being freed when outstanding kiocbs are cancelled at
exit_aio() time.  Specifically, a kiocb that is cancelled has its
completion events discarded by batch_complete_aio(), which then fails to
wake up the process stuck in free_ioctx().  Fix this by modifying the
wait_event() condition in free_ioctx() appropriately.

This patch was tested with the cancel operation in the thread based code
posted yesterday.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Zach Brown <zab@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24 16:22:53 -07:00
Joseph Qi b4ca2b4b57 ocfs2: goto out_unlock if ocfs2_get_clusters_nocache() failed in ocfs2_fiemap()
Last time we found there is lock/unlock bug in ocfs2_file_aio_write, and
then we did a thorough search for all lock resources in
ocfs2_inode_info, including rw, inode and open lockres and found this
bug.  My kernel version is 3.0.13, and it is also in the lastest version
3.9.  In ocfs2_fiemap, once ocfs2_get_clusters_nocache failed, it should
goto out_unlock instead of out, because we need release buffer head, up
read alloc sem and unlock inode.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24 16:22:52 -07:00
Ryusuke Konishi 136e8770cd nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF boundary
nilfs2: fix issue of nilfs_set_page_dirty for page at EOF boundary

DESCRIPTION:
 There are use-cases when NILFS2 file system (formatted with block size
lesser than 4 KB) can be remounted in RO mode because of encountering of
"broken bmap" issue.

The issue was reported by Anthony Doggett <Anthony2486@interfaces.org.uk>:
 "The machine I've been trialling nilfs on is running Debian Testing,
  Linux version 3.2.0-4-686-pae (debian-kernel@lists.debian.org) (gcc
  version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.35-2), but I've
  also reproduced it (identically) with Debian Unstable amd64 and Debian
  Experimental (using the 3.8-trunk kernel).  The problematic partitions
  were formatted with "mkfs.nilfs2 -b 1024 -B 8192"."

SYMPTOMS:
(1) System log contains error messages likewise:

    [63102.496756] nilfs_direct_assign: invalid pointer: 0
    [63102.496786] NILFS error (device dm-17): nilfs_bmap_assign: broken bmap (inode number=28)
    [63102.496798]
    [63102.524403] Remounting filesystem read-only

(2) The NILFS2 file system is remounted in RO mode.

REPRODUSING PATH:
(1) Create volume group with name "unencrypted" by means of vgcreate utility.
(2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>):

----------------[BEGIN SCRIPT]--------------------

VG=unencrypted
lvcreate --size 2G --name ntest $VG
mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
mkdir /var/tmp/n
mkdir /var/tmp/n/ntest
mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
mkdir /var/tmp/n/ntest/thedir
cd /var/tmp/n/ntest/thedir
sleep 2
date
darcs init
sleep 2
dmesg|tail -n 5
date
darcs whatsnew || true
date
sleep 2
dmesg|tail -n 5
----------------[END SCRIPT]--------------------

REPRODUCIBILITY: 100%

INVESTIGATION:
As it was discovered, the issue takes place during segment
construction after executing such sequence of user-space operations:

  open("_darcs/index", O_RDWR|O_CREAT|O_NOCTTY, 0666) = 7
  fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
  ftruncate(7, 60)

The error message "NILFS error (device dm-17): nilfs_bmap_assign: broken
bmap (inode number=28)" takes place because of trying to get block
number for third block of the file with logical offset #3072 bytes.  As
it is possible to see from above output, the file has 60 bytes of the
whole size.  So, it is enough one block (1 KB in size) allocation for
the whole file.  Trying to operate with several blocks instead of one
takes place because of discovering several dirty buffers for this file
in nilfs_segctor_scan_file() method.

The root cause of this issue is in nilfs_set_page_dirty function which
is called just before writing to an mmapped page.

When nilfs_page_mkwrite function handles a page at EOF boundary, it
fills hole blocks only inside EOF through __block_page_mkwrite().

The __block_page_mkwrite() function calls set_page_dirty() after filling
hole blocks, thus nilfs_set_page_dirty function (=
a_ops->set_page_dirty) is called.  However, the current implementation
of nilfs_set_page_dirty() wrongly marks all buffers dirty even for page
at EOF boundary.

As a result, buffers outside EOF are inconsistently marked dirty and
queued for write even though they are not mapped with nilfs_get_block
function.

FIX:
This modifies nilfs_set_page_dirty() not to mark hole blocks dirty.

Thanks to Vyacheslav Dubeyko for his effort on analysis and proposals
for this issue.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reported-by: Anthony Doggett <Anthony2486@interfaces.org.uk>
Reported-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24 16:22:52 -07:00
Jeff Moyer 6900807c6b aio: fix io_getevents documentation
In reviewing man pages, I noticed that io_getevents is documented to
update the timeout that gets passed into the library call.  This doesn't
happen in kernel space or in the library (even though it's documented to
do so in both places).  Unless there is objection, I'd like to fix the
comments/docs to match the code (I will also update the man page upon
consensus).

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: Cyril Hrubis <chrubis@suse.cz>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24 16:22:52 -07:00
Jeff Mahoney fb09c3733a hfs: avoid crash in hfs_bnode_create
Commit 634725a929 ("hfs: cleanup HFS+ prints") removed the BUG_ON in
hfs_bnode_create in hfsplus.  This patch removes it from the hfs version
and avoids an fsfuzzer crash.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24 16:22:51 -07:00
Joseph Qi afe1bb73f8 ocfs2: unlock rw lock if inode lock failed
In ocfs2_file_aio_write(), it does ocfs2_rw_lock() first and then
ocfs2_inode_lock().

But if ocfs2_inode_lock() failed, it goes to out_sems without unlocking
rw lock.  This will cause a bug in ocfs2_lock_res_free() when testing
res->l_ex_holders, which is increased in __ocfs2_cluster_lock() and
decreased in __ocfs2_cluster_unlock().

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: "Duyongfeng (B)" <du.duyongfeng@huawei.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24 16:22:51 -07:00
OGAWA Hirofumi 7b92d03c32 fat: fix possible overflow for fat_clusters
Intermediate value of fat_clusters can be overflowed on 32bits arch.

Reported-by: Krzysztof Strasburger <strasbur@chkw386.ch.pwr.wroc.pl>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24 16:22:50 -07:00
Paul Taysom c15cddd900 ecryptfs: fixed msync to flush data
When msync is called on a memory mapped file, that
data is not flushed to the disk.

In Linux, msync calls fsync for the file. For ecryptfs,
fsync just calls the lower level file system's fsync.
Changed the ecryptfs fsync code to call filemap_write_and_wait
before calling the lower level fsync.

Addresses the problem described in http://crbug.com/239536

Signed-off-by: Paul Taysom <taysom@chromium.org>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Cc: stable@vger.kernel.org # v3.6+
2013-05-24 16:21:45 -07:00
Dave Chinner 7ae077802c xfs: remote attribute lookups require the value length
When reading a remote attribute, to correctly calculate the length
of the data buffer for CRC enable filesystems, we need to know the
length of the attribute data. We get this information when we look
up the attribute, but we don't store it in the args structure along
with the other remote attr information we get from the lookup. Add
this information to the args structure so we can use it
appropriately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit e461fcb194)
2013-05-24 16:31:20 -05:00
Dave Chinner cf257abf02 xfs: xfs_attr_shortform_allfit() does not handle attr3 format.
xfstests generic/117 fails with:

XFS: Assertion failed: leaf->hdr.info.magic == cpu_to_be16(XFS_ATTR_LEAF_MAGIC)

indicating a function that does not handle the attr3 format
correctly. Fix it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit b38958d715)
2013-05-24 16:29:56 -05:00
Dave Chinner 7ced60cae4 xfs: xfs_da3_node_read_verify() doesn't handle XFS_ATTR3_LEAF_MAGIC
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 72916fb8cb)
2013-05-24 16:29:37 -05:00
Dave Chinner b17cb364db xfs: fix missing KM_NOFS tags to keep lockdep happy
There are several places where we use KM_SLEEP allocation contexts
and use the fact that they are called from transaction context to
add KM_NOFS where appropriate. Unfortunately, there are several
places where the code makes this assumption but can be called from
outside transaction context but with filesystem locks held. These
places need explicit KM_NOFS annotations to avoid lockdep
complaining about reclaim contexts.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit ac14876cf9)
2013-05-24 16:29:15 -05:00
Dave Chinner 509e708a89 xfs: Don't reference the EFI after it is freed
Checking the EFI for whether it is being released from recovery
after we've already released the known active reference is a mistake
worthy of a brown paper bag. Fix the (now) obvious use after free
that it can cause.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 52c24ad39f)
2013-05-24 16:27:57 -05:00
Dave Chinner 7031d0e1c4 xfs: fix rounding in xfs_free_file_space
The offset passed into xfs_free_file_space() needs to be rounded
down to a certain size, but the rounding mask is built by a 32 bit
variable. Hence the mask will always mask off the upper 32 bits of
the offset and lead to incorrect writeback and invalidation ranges.

This is not actually exposed as a bug because we writeback and
invalidate from the rounded offset to the end of the file, and hence
the offset we are actually punching a hole out of will always be
covered by the code. This needs fixing, however, if we ever want to
use exact ranges for writeback/invalidation here...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 28ca489c63)
2013-05-24 16:27:41 -05:00
Dave Chinner 480d7467e4 xfs: fix sub-page blocksize data integrity writes
FSX on 512 byte block size filesystems has been failing for some
time with corrupted data. The fault dates back to the change in
the writeback data integrity algorithm that uses a mark-and-sweep
approach to avoid data writeback livelocks.

Unfortunately, a side effect of this mark-and-sweep approach is that
each page will only be written once for a data integrity sync, and
there is a condition in writeback in XFS where a page may require
two writeback attempts to be fully written. As a result of the high
level change, we now only get a partial page writeback during the
integrity sync because the first pass through writeback clears the
mark left on the page index to tell writeback that the page needs
writeback....

The cause is writing a partial page in the clustering code. This can
happen when a mapping boundary falls in the middle of a page - we
end up writing back the first part of the page that the mapping
covers, but then never revisit the page to have the remainder mapped
and written.

The fix is simple - if the mapping boundary falls inside a page,
then simple abort clustering without touching the page. This means
that the next ->writepage entry that write_cache_pages() will make
is the page we aborted on, and xfs_vm_writepage() will map all
sections of the page correctly. This behaviour is also optimal for
non-data integrity writes, as it results in contiguous sequential
writeback of the file rather than missing small holes and having to
write them a "random" writes in a future pass.

With this fix, all the fsx tests in xfstests now pass on a 512 byte
block size filesystem on a 4k page machine.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 49b137cbbc)
2013-05-24 16:26:51 -05:00
Gu Zheng 95bbb82f60 fs/jfs: Add check if journaling to disk has been disabled in lbmRead()
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-05-24 16:03:47 -05:00
Vahram Martirosyan e9b3766719 jfs: Several bugs in jfs_freeze() and jfs_unfreeze()
The mentioned functions do not pay attention to the error codes returned
by the functions updateSuper(), lmLogInit() and lmLogShutdown(). It brings
to system crash later when writing to log.

The patch adds corresponding code to check and return the error codes
and to print correct error messages in case of errors.

Found by Linux File System Verification project (linuxtesting.org).

Signed-off-by: Vahram Martirosyan <vahram.martirosyan@linuxtesting.org>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-05-24 16:03:42 -05:00
Jeff Layton d9deef0a3f cifs: fix composing of mount options for DFS referrals
With the change to ignore the unc= and prefixpath= mount options, there
is no longer any need to add them to the options string when mounting.
By the same token, we now need to build a device name that includes the
prefixpath when mounting.

To make things neater, the delimiters on the devicename are changed
to '/' since that's preferred when mounting anyway.

v2: fix some comments and don't bother looking at whether there is
    a prepath in the ref->node_name when deciding whether to pass
    a prepath to cifs_build_devname.

v3: rebase on top of potential buffer overrun fix for stable

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-05-24 13:08:31 -05:00
Jeff Layton 9c9c29e1af cifs: stop printing the unc= option in /proc/mounts
Since we no longer recognize that option, stop printing it out. The
devicename is now the canonical source for this info.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-05-24 13:08:29 -05:00
Jeff Layton 37d4f99b55 cifs: fix error handling when calling cifs_parse_devname
When we allowed separate unc= and prefixpath= mount options, we could
ignore EINVAL errors from cifs_parse_devname. Now that they are
deprecated, we need to check for that as well and fail the mount if it's
malformed.

Also fix a later error message that refers to the unc= option.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-05-24 13:08:28 -05:00
Jeff Layton 539673fff7 cifs: allow sec=none mounts to work against servers that don't support extended security
In the case of sec=none, we're not sending a username or password, so
there's little benefit to mandating NTLMSSP auth. Allow it to use
unencapsulated auth in that case.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-05-24 13:08:26 -05:00
Jeff Layton 166faf21bd cifs: fix potential buffer overrun when composing a new options string
Consider the case where we have a very short ip= string in the original
mount options, and when we chase a referral we end up with a very long
IPv6 address. Be sure to allow for that possibility when estimating the
size of the string to allocate.

Cc: <stable@vger.kernel.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-05-24 13:08:19 -05:00
Jeff Layton 62106e9627 cifs: only set ops for inodes in I_NEW state
It's generally not safe to reset the inode ops once they've been set. In
the case where the inode was originally thought to be a directory and
then later found to be a DFS referral, this can lead to an oops when we
try to trigger an inode op on it after changing the ops to the blank
referral operations.

Cc: <stable@vger.kernel.org>
Reported-and-Tested-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-05-24 12:55:39 -05:00
Linus Torvalds a8432588fc Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fix from Steve French:
 "One cifs fix to merge now - fixes possible DFS oops (I expect to
  request a merge of 4 additional cifs fixes next week)"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: only set ops for inodes in I_NEW state
2013-05-24 10:45:59 -07:00
Steven Whitehouse e97e548ba8 GFS2: Fix typo in gfs2_log_end_write loop
There was a missing _all in this loop iterator

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-05-24 13:48:09 +01:00
Randy Dunlap 75f96ce6e7 GFS2: fix DLM depends to fix build errors
Fix build errors by correcting DLM dependencies in GFS2.
Build errors happen when CONFIG_GFS2_FS_LOCKING_DLM=y and CONFIG_DLM=m:

fs/built-in.o: In function `gfs2_lock':
file.c:(.text+0xc7abd): undefined reference to `dlm_posix_get'
file.c:(.text+0xc7ad0): undefined reference to `dlm_posix_unlock'
file.c:(.text+0xc7ad9): undefined reference to `dlm_posix_lock'
fs/built-in.o: In function `gdlm_unmount':
lock_dlm.c:(.text+0xd6e5b): undefined reference to `dlm_release_lockspace'
fs/built-in.o: In function `sync_unlock':
lock_dlm.c:(.text+0xd6e9e): undefined reference to `dlm_unlock'
fs/built-in.o: In function `sync_lock':
lock_dlm.c:(.text+0xd6fb6): undefined reference to `dlm_lock'
fs/built-in.o: In function `gdlm_put_lock':
lock_dlm.c:(.text+0xd7238): undefined reference to `dlm_unlock'
fs/built-in.o: In function `gdlm_mount':
lock_dlm.c:(.text+0xd753e): undefined reference to `dlm_new_lockspace'
lock_dlm.c:(.text+0xd79d3): undefined reference to `dlm_release_lockspace'
fs/built-in.o: In function `gdlm_lock':
lock_dlm.c:(.text+0xd8179): undefined reference to `dlm_lock'
fs/built-in.o: In function `gdlm_cancel':
lock_dlm.c:(.text+0xd6b22): undefined reference to `dlm_unlock'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-05-24 13:47:53 +01:00
Bob Peterson af21ca8ed5 GFS2: Use single-block reservations for directories
This patch changes the multi-block allocation code, such that
directory inodes only get a single block reserved in the bitmap.
That way, the bitmaps are more tightly packed together, and there
are fewer spans of free blocks for in-use block reservations.
This means it takes less time to find a free span of blocks in the
bitmap, which speeds things up. This increases the performance of
some workloads by almost 2X. In Nate's mockup.py script (which does
(1) create dir, (2) create dir in dir, (3) create file in that dir)
the test executes in 23 steps rather than 43 steps, a 47%
performance improvement.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-05-24 13:47:32 +01:00
Bob Peterson 37f715774e GFS2: two minor quota fixups
This patch fixes two regression problems that Abhi found in the
GFS2 quota code.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-05-24 13:47:13 +01:00
Dave Chinner ad1858d777 xfs: rework remote attr CRCs
Note: this changes the on-disk remote attribute format. I assert
that this is OK to do as CRCs are marked experimental and the first
kernel it is included in has not yet reached release yet. Further,
the userspace utilities are still evolving and so anyone using this
stuff right now is a developer or tester using volatile filesystems
for testing this feature. Hence changing the format right now to
save longer term pain is the right thing to do.

The fundamental change is to move from a header per extent in the
attribute to a header per filesytem block in the attribute. This
means there are more header blocks and the parsing of the attribute
data is slightly more complex, but it has the advantage that we
always know the size of the attribute on disk based on the length of
the data it contains.

This is where the header-per-extent method has problems. We don't
know the size of the attribute on disk without first knowing how
many extents are used to hold it. And we can't tell from a
mapping lookup, either, because remote attributes can be allocated
contiguously with other attribute blocks and so there is no obvious
way of determining the actual size of the atribute on disk short of
walking and mapping buffers.

The problem with this approach is that if we map a buffer
incorrectly (e.g. we make the last buffer for the attribute data too
long), we then get buffer cache lookup failure when we map it
correctly. i.e. we get a size mismatch on lookup. This is not
necessarily fatal, but it's a cache coherency problem that can lead
to returning the wrong data to userspace or writing the wrong data
to disk. And debug kernels will assert fail if this occurs.

I found lots of niggly little problems trying to fix this issue on a
4k block size filesystem, finally getting it to pass with lots of
fixes. The thing is, 1024 byte filesystems still failed, and it was
getting really complex handling all the corner cases that were
showing up. And there were clearly more that I hadn't found yet.

It is complex, fragile code, and if we don't fix it now, it will be
complex, fragile code forever more.

Hence the simple fix is to add a header to each filesystem block.
This gives us the same relationship between the attribute data
length and the number of blocks on disk as we have without CRCs -
it's a linear mapping and doesn't require us to guess anything. It
is simple to implement, too - the remote block count calculated at
lookup time can be used by the remote attribute set/get/remove code
without modification for both CRC and non-CRC filesystems. The world
becomes sane again.

Because the copy-in and copy-out now need to iterate over each
filesystem block, I moved them into helper functions so we separate
the block mapping and buffer manupulations from the attribute data
and CRC header manipulations. The code becomes much clearer as a
result, and it is a lot easier to understand and debug. It also
appears to be much more robust - once it worked on 4k block size
filesystems, it has worked without failure on 1k block size
filesystems, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-23 18:04:06 -05:00
Dave Chinner d4c712bcf2 xfs: fully initialise temp leaf in xfs_attr3_leaf_compact
xfs_attr3_leaf_compact() uses a temporary buffer for compacting the
the entries in a leaf. It copies the the original buffer into the
temporary buffer, then zeros the original buffer completely. It then
copies the entries back into the original buffer.  However, the
original buffer has not been correctly initialised, and so the
movement of the entries goes horribly wrong.

Make sure the zeroed destination buffer is fully initialised, and
once we've set up the destination incore header appropriately, write
is back to the buffer before starting to move entries around.

While debugging this, the _d/_s prefixes weren't sufficient to
remind me what buffer was what, so rename then all _src/_dst.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-23 17:53:08 -05:00
Dave Chinner 8517de2a81 xfs: fully initialise temp leaf in xfs_attr3_leaf_unbalance
xfs_attr3_leaf_unbalance() uses a temporary buffer for recombining
the entries in two leaves when the destination leaf requires
compaction. The temporary buffer ends up being copied back over the
original destination buffer, so the header in the temporary buffer
needs to contain all the information that is in the destination
buffer.

To make sure the temporary buffer is fully initialised, once we've
set up the temporary incore header appropriately, write is back to
the temporary buffer before starting to move entries around.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-23 17:52:07 -05:00
Chuck Lever 83c168bf80 NFS: Fix SETCLIENTID fallback if GSS is not available
Commit 79d852bf "NFS: Retry SETCLIENTID with AUTH_SYS instead of
AUTH_NONE" did not take into account commit 23631227 "NFSv4: Fix the
fallback to AUTH_NULL if krb5i is not available".

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-05-23 18:50:40 -04:00
Dave Chinner 6863ef8449 xfs: correctly map remote attr buffers during removal
If we don't map the buffers correctly (same as for get/set
operations) then the incore buffer lookup will fail. If a block
number matches but a length is wrong, then debug kernels will ASSERT
fail in _xfs_buf_find() due to the length mismatch. Ensure that we
map the buffers correctly by basing the length of the buffer on the
attribute data length rather than the remote block count.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-23 17:49:28 -05:00
Dave Chinner 4af3644c9a xfs: remote attribute tail zeroing does too much
When an attribute data does not fill then entire remote block, we
zero the remaining part of the buffer. This, however, needs to take
into account that the buffer has a header, and so the offset where
zeroing starts and the length of zeroing need to take this into
account. Otherwise we end up with zeros over the end of the
attribute value when CRCs are enabled.

While there, make sure we only ask to map an extent that covers the
remaining range of the attribute, rather than asking every time for
the full length of remote data. If the remote attribute blocks are
contiguous with other parts of the attribute tree, it will map those
blocks as well and we can potentially zero them incorrectly. We can
also get buffer size mistmatches when trying to read or remove the
remote attribute, and this can lead to not finding the correct
buffer when looking it up in cache.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-23 17:35:18 -05:00
Dave Chinner 913e96bc29 xfs: remote attribute read too short
Reading a maximally size remote attribute fails when CRCs are
enabled with this verification error:

XFS (vdb): remote attribute header does not match required off/len/owner)

There are two reasons for this, the first being that the
length of the buffer being read is determined from the
args->rmtblkcnt which doesn't take into account CRC headers. Hence
the mapped length ends up being too short and so we need to
calculate it directly from the value length.

The second is that the byte count of valid data within a buffer is
capped by the length of the data and so doesn't take into account
that the buffer might be longer due to headers. Hence we need to
calculate the data space in the buffer first before calculating the
actual byte count of data.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-23 17:31:20 -05:00
H. Peter Anvin 3979fdce77 * Avoid confusing the user by returning -EIO instead of -ENOENT in
efivarfs if an EFI variable gets deleted from under us and return EOF
    when reading from a zero-length file - Lingzhu Xiang
 
  * Fix an oops in efivar_update_sysfs_entries() caused by reusing (and
    therefore corrupting) a kzalloc() allocation - Seiji Aguchi
 
  * Initialise the DataSize argument to GetVariable() otherwise it will
    not be updated with the actual size of the variable on return.
    Discovered on a Acer Aspire V3 BIOS - Lee, Chun-Yi
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJRlgJ5AAoJEC84WcCNIz1VcRgQAJWnQpuI+Y1RjPYuLb8ISZ92
 jKnVSAQPaWDccf/KKx7qtk2iK6G1mExomOggp6PpV4/uHWqkQ64qoGL0HmRbmdH7
 //gNXm+ITnSVojoTBBis2LYEYDUQ1aDp3vTevFT2OOMKzKRsgwFD5KbqLnPJCxnl
 kFHlxQSPa7ZYk6GXOVdBJ5ya7spU+Lk4ODiL4xM8EBkMBedUStedB5zNg5rEMLJq
 5FZfs2Ryg7riOkhBnKWCvFjAj3PO/yHLVdrVrGZx+KpATHr/rbsbzDEp3+Pk8zEu
 3aE1p/XLaB4Qk9WvwhWonWF7PSZP41SjrcEt77mTmYivk0cJHBORo/wJsom683BU
 4elWR588Pi92nKEx4MSLREhkzP5RdVFIb4Yrg+1xKLEfI5Kfj1nEQrFVk53VM/rL
 L1m1bwhKLaNUVFJmxJYHaZ6VTjcB6lupz9/AMQr1N1UAavlmr4xEveNTfto+Ys7u
 /J3Z04wouhOGQonVDrZxxOzWFewq1kR/hRfwBAf/V0qemAeS61ZHGpfmmq00y5L2
 jmXncQ5EspFIJfk7+KV2sPQfqC8U8b2jyfO36Ed9u3d/qQ6eiQ1rTZMLKGg1iSdW
 Z+94w7+huEdsNZShSUE8+HXNvzPCS6704PBa9YT292ZibWM9onk6AoBQy/snrEaO
 K+PKxkZyPBML77Vs4TSI
 =8+a3
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent' into x86/urgent

 * Avoid confusing the user by returning -EIO instead of -ENOENT in
   efivarfs if an EFI variable gets deleted from under us and return EOF
   when reading from a zero-length file - Lingzhu Xiang

 * Fix an oops in efivar_update_sysfs_entries() caused by reusing (and
   therefore corrupting) a kzalloc() allocation - Seiji Aguchi

 * Initialise the DataSize argument to GetVariable() otherwise it will
   not be updated with the actual size of the variable on return.
   Discovered on a Acer Aspire V3 BIOS - Lee, Chun-Yi

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-05-22 10:58:57 -07:00
Lukas Czerner bad5483196 reiserfs: use ->invalidatepage() length argument
->invalidatepage() aop now accepts range to invalidate so we can make
use of it in reiserfs_invalidatepage()

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: reiserfs-devel@vger.kernel.org
2013-05-21 23:58:51 -04:00
Lukas Czerner 5c0bb97ce0 gfs2: use ->invalidatepage() length argument
->invalidatepage() aop now accepts range to invalidate so we can make
use of it in gfs2_invalidatepage().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: cluster-devel@redhat.com
2013-05-21 23:58:49 -04:00
Lukas Czerner 569d39fc3e ceph: use ->invalidatepage() length argument
->invalidatepage() aop now accepts range to invalidate so we can make
use of it in ceph_invalidatepage().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Acked-by: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org
2013-05-21 23:58:48 -04:00
Lukas Czerner e5f8d30d68 ocfs2: use ->invalidatepage() length argument
->invalidatepage() aop now accepts range to invalidate so we can make
use of it in ocfs2_invalidatepage().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Joel Becker <jlbec@evilplan.org>
2013-05-21 23:58:46 -04:00
Lukas Czerner 34097dfe88 xfs: use ->invalidatepage() length argument
->invalidatepage() aop now accepts range to invalidate so we can make
use of it in xfs_vm_invalidatepage()

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
2013-05-21 23:58:01 -04:00
Lukas Czerner d8c8900ac1 jbd: change journal_invalidatepage() to accept length
->invalidatepage() aop now accepts range to invalidate so we can make
use of it in journal_invalidatepage() and all the users in ext3 file
system. Also update ext3 trace point to print out length argument.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-05-21 23:26:36 -04:00
Lukas Czerner ca99fdd26b ext4: use ->invalidatepage() length argument
->invalidatepage() aop now accepts range to invalidate so we can make
use of it in all ext4 invalidatepage routines.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-05-21 23:25:01 -04:00
Lukas Czerner 259709b07d jbd2: change jbd2_journal_invalidatepage to accept length
invalidatepage now accepts range to invalidate and there are two file
system using jbd2 also implementing punch hole feature which can benefit
from this. We need to implement the same thing for jbd2 layer in order to
allow those file system take benefit of this functionality.

This commit adds length argument to the jbd2_journal_invalidatepage()
and updates all instances in ext4 and ocfs2.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-05-21 23:20:03 -04:00
Lukas Czerner d47992f86b mm: change invalidatepage prototype to accept length
Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.

Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).

This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.

We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.

Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
2013-05-21 23:17:23 -04:00
Dave Chinner 90253cf142 xfs: remote attribute allocation may be contiguous
When CRCs are enabled, there may be multiple allocations made if the
headers cause a length overflow. This, however, does not mean that
the number of headers required increases, as the second and
subsequent extents may be contiguous with the previous extent. Hence
when we map the extents to write the attribute data, we may end up
with less extents than allocations made. Hence the assertion that we
consume the number of headers we calculated in the allocation loop
is incorrect and needs to be removed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-21 14:22:51 -05:00
Dave Chinner f648167f3a xfs: avoid nesting transactions in xfs_qm_scall_setqlim()
Lockdep reports:

=============================================
[ INFO: possible recursive locking detected ]
3.9.0+ #3 Not tainted
---------------------------------------------
setquota/28368 is trying to acquire lock:
 (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50

but task is already holding lock:
 (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50

from xfs_qm_scall_setqlim()->xfs_dqread() when a dquot needs to be
allocated.

xfs_qm_scall_setqlim() is starting a transaction and then not
passing it into xfs_qm_dqet() and so it starts it's own transaction
when allocating the dquot.  Splat!

Fix this by not allocating the dquot in xfs_qm_scall_setqlim()
inside the setqlim transaction. This requires getting the dquot
first (and allocating it if necessary) then dropping and relocking
the dquot before joining it to the setqlim transaction.

Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-21 13:57:05 -05:00
Jim Rees 1a9357f443 nfsd: avoid undefined signed overflow
In C, signed integer overflow results in undefined behavior, but unsigned
overflow wraps around. So do the subtraction first, then cast to signed.

Reported-by: Joakim Tjernlund <joakim.tjernlund@transmode.se>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-05-21 11:02:03 -04:00
Dave Chinner e461fcb194 xfs: remote attribute lookups require the value length
When reading a remote attribute, to correctly calculate the length
of the data buffer for CRC enable filesystems, we need to know the
length of the attribute data. We get this information when we look
up the attribute, but we don't store it in the args structure along
with the other remote attr information we get from the lookup. Add
this information to the args structure so we can use it
appropriately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20 17:16:12 -05:00
Dave Chinner b38958d715 xfs: xfs_attr_shortform_allfit() does not handle attr3 format.
xfstests generic/117 fails with:

XFS: Assertion failed: leaf->hdr.info.magic == cpu_to_be16(XFS_ATTR_LEAF_MAGIC)

indicating a function that does not handle the attr3 format
correctly. Fix it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20 16:53:22 -05:00
Dave Chinner 72916fb8cb xfs: xfs_da3_node_read_verify() doesn't handle XFS_ATTR3_LEAF_MAGIC
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20 16:32:30 -05:00
Dave Chinner ac14876cf9 xfs: fix missing KM_NOFS tags to keep lockdep happy
There are several places where we use KM_SLEEP allocation contexts
and use the fact that they are called from transaction context to
add KM_NOFS where appropriate. Unfortunately, there are several
places where the code makes this assumption but can be called from
outside transaction context but with filesystem locks held. These
places need explicit KM_NOFS annotations to avoid lockdep
complaining about reclaim contexts.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20 16:18:05 -05:00
Dave Chinner 52c24ad39f xfs: Don't reference the EFI after it is freed
Checking the EFI for whether it is being released from recovery
after we've already released the known active reference is a mistake
worthy of a brown paper bag. Fix the (now) obvious use after free
that it can cause.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20 14:29:34 -05:00
Dave Chinner 28ca489c63 xfs: fix rounding in xfs_free_file_space
The offset passed into xfs_free_file_space() needs to be rounded
down to a certain size, but the rounding mask is built by a 32 bit
variable. Hence the mask will always mask off the upper 32 bits of
the offset and lead to incorrect writeback and invalidation ranges.

This is not actually exposed as a bug because we writeback and
invalidate from the rounded offset to the end of the file, and hence
the offset we are actually punching a hole out of will always be
covered by the code. This needs fixing, however, if we ever want to
use exact ranges for writeback/invalidation here...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20 14:25:50 -05:00
Dave Chinner 49b137cbbc xfs: fix sub-page blocksize data integrity writes
FSX on 512 byte block size filesystems has been failing for some
time with corrupted data. The fault dates back to the change in
the writeback data integrity algorithm that uses a mark-and-sweep
approach to avoid data writeback livelocks.

Unfortunately, a side effect of this mark-and-sweep approach is that
each page will only be written once for a data integrity sync, and
there is a condition in writeback in XFS where a page may require
two writeback attempts to be fully written. As a result of the high
level change, we now only get a partial page writeback during the
integrity sync because the first pass through writeback clears the
mark left on the page index to tell writeback that the page needs
writeback....

The cause is writing a partial page in the clustering code. This can
happen when a mapping boundary falls in the middle of a page - we
end up writing back the first part of the page that the mapping
covers, but then never revisit the page to have the remainder mapped
and written.

The fix is simple - if the mapping boundary falls inside a page,
then simple abort clustering without touching the page. This means
that the next ->writepage entry that write_cache_pages() will make
is the page we aborted on, and xfs_vm_writepage() will map all
sections of the page correctly. This behaviour is also optimal for
non-data integrity writes, as it results in contiguous sequential
writeback of the file rather than missing small holes and having to
write them a "random" writes in a future pass.

With this fix, all the fsx tests in xfstests now pass on a 512 byte
block size filesystem on a 4k page machine.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20 14:14:25 -05:00
Andy Adamson 774d5f14ee NFSv4.1 Fix a pNFS session draining deadlock
On a CB_RECALL the callback service thread flushes the inode using
filemap_flush prior to scheduling the state manager thread to return the
delegation. When pNFS is used and I/O has not yet gone to the data server
servicing the inode, a LAYOUTGET can preceed the I/O. Unlike the async
filemap_flush call, the LAYOUTGET must proceed to completion.

If the state manager starts to recover data while the inode flush is sending
the LAYOUTGET, a deadlock occurs as the callback service thread holds the
single callback session slot until the flushing is done which blocks the state
manager thread, and the state manager thread has set the session draining bit
which puts the inode flush LAYOUTGET RPC to sleep on the forechannel slot
table waitq.

Separate the draining of the back channel from the draining of the fore channel
by moving the NFS4_SESSION_DRAINING bit from session scope into the fore
and back slot tables.  Drain the back channel first allowing the LAYOUTGET
call to proceed (and fail) so the callback service thread frees the callback
slot. Then proceed with draining the forechannel.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-05-20 14:20:14 -04:00
Jan Kara 211d022c43 xfs: Avoid pathological backwards allocation
Writing a large file using direct IO in 16 MB chunks sometimes results
in a pathological allocation pattern where 16 MB chunks of large free
extent are allocated to a file in a reversed order. So extents of a file
look for example as:

 ext logical physical expected length flags
   0        0        13          4550656
   1  4550656 188136807   4550668 12562432
   2 17113088 200699240 200699238 622592
   3 17735680 182046055 201321831   4096
   4 17739776 182041959 182050150   4096
   5 17743872 182037863 182046054   4096
   6 17747968 182033767 182041958   4096
   7 17752064 182029671 182037862   4096
...
6757 45400064 154381644 154389835   4096
6758 45404160 154377548 154385739   4096
6759 45408256 252951571 154381643  73728 eof

This happens because XFS_ALLOCTYPE_THIS_BNO allocation fails (the last
extent in the file cannot be further extended) so we fall back to
XFS_ALLOCTYPE_NEAR_BNO allocation which picks end of a large free
extent as the best place to continue the file. Since the chunk at the
end of the free extent again cannot be further extended, this behavior
repeats until the whole free extent is consumed in a reversed order.

For data allocations this backward allocation isn't beneficial so make
xfs_alloc_compute_diff() pick start of a free extent instead of its end
for them. That avoids the backward allocation pattern.

See thread at http://oss.sgi.com/archives/xfs/2013-03/msg00144.html for
more details about the reproduction case and why this solution was
chosen.

Based on idea by Dave Chinner <dchinner@redhat.com>.

CC: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-05-20 13:09:11 -05:00
Brian Foster bee6c30780 fuse: update inode size and invalidate attributes on fallocate
An fallocate request without FALLOC_FL_KEEP_SIZE set can extend the
size of a file. Update the inode size after a successful fallocate.

Also invalidate the inode attributes after a successful fallocate
to ensure we pick up the latest attribute values (i.e., i_blocks).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-05-20 16:58:58 +02:00
Brian Foster 3634a63278 fuse: truncate pagecache range on hole punch
fuse supports hole punch via the fallocate() FALLOC_FL_PUNCH_HOLE
interface. When a hole punch is passed through, the page cache
is not cleared and thus allows reading stale data from the cache.

This is easily demonstrable (using FOPEN_KEEP_CACHE) by reading a
smallish random data file into cache, punching a hole and creating
a copy of the file. Drop caches or remount and observe that the
original file no longer matches the file copied after the hole
punch. The original file contains a zeroed range and the latter
file contains stale data.

Protect against writepage requests in progress and punch out the
associated page cache range after a successful client fs hole
punch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-05-20 16:58:57 +02:00
Linus Torvalds 130901ba33 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Miao Xie has been very busy, fixing races and enospc problems and many
  other small but important pieces.

  Alexandre Oliva discovered some problems with how our error handling
  was interacting with the block layer and for now has disabled our
  partial handling of sub-page writes.  The real sub-page work is in a
  series of patches from IBM that we still need to integrate and test.
  The code Alexandre has turned off was really incomplete.

  Josef has more error handling fixes and an important fix for the new
  skinny extent format.

  This also has my fix for the tracepoint crash from late in 3.9.  It's
  the first stage in a larger clean up to get rid of btrfs_bio and make
  a proper bioset for all the items we need to tack into the bio.  For
  now the bioset only holds our mirror_num and stripe_index, but for the
  next merge window I'll shuffle more in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
  Btrfs: use a btrfs bioset instead of abusing bio internals
  Btrfs: make sure roots are assigned before freeing their nodes
  Btrfs: explicitly use global_block_rsv for quota_tree
  btrfs: do away with non-whole_page extent I/O
  Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context
  Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()
  Btrfs: pause the space balance when remounting to R/O
  Btrfs: fix unprotected root node of the subvolume's inode rb-tree
  Btrfs: fix accessing a freed tree root
  Btrfs: return errno if possible when we fail to allocate memory
  Btrfs: update the global reserve if it is empty
  Btrfs: don't steal the reserved space from the global reserve if their space type is different
  Btrfs: optimize the error handle of use_block_rsv()
  Btrfs: don't use global block reservation for inode cache truncation
  Btrfs: don't abort the current transaction if there is no enough space for inode cache
  Correct allowed raid levels on balance.
  Btrfs: fix possible memory leak in replace_path()
  Btrfs: fix possible memory leak in the find_parent_nodes()
  Btrfs: don't allow device replace on RAID5/RAID6
  Btrfs: handle running extent ops with skinny metadata
  ...
2013-05-18 11:35:28 -07:00
Chris Mason c5cb6a0573 Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next 2013-05-17 21:53:17 -04:00
Chris Mason 9be3395bcd Btrfs: use a btrfs bioset instead of abusing bio internals
Btrfs has been pointer tagging bi_private and using bi_bdev
to store the stripe index and mirror number of failed IOs.

As bios bubble back up through the call chain, we use these
to decide if and how to retry our IOs.  They are also used
to count IO failures on a per device basis.

Recently a bio tracepoint was added lead to crashes because
we were abusing bi_bdev.

This commit adds a btrfs bioset, and creates explicit fields
for the mirror number and stripe index.  The plan is to
extend this structure for all of the fields currently in
struct btrfs_bio, which will mean one less kmalloc in
our IO path.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Tejun Heo <tj@kernel.org>
2013-05-17 21:52:52 -04:00
Josef Bacik 655b09fe54 Btrfs: make sure roots are assigned before freeing their nodes
If we fail to load the chunk tree we'll call free_root_pointers, except we may
not have assigned the roots for the dev_root/extent_root/csum_root yet, so we
could NULL pointer deref at this point.  Just add checks to make sure these
roots are set to keep us from panicing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:38 -04:00
Stefan Behrens 3a6cad9009 Btrfs: explicitly use global_block_rsv for quota_tree
The quota_tree was set up to use the empty_block_rsv before
which would be problematic when the filesystem is filled up
and ENOSPC happens during internal operations while the quota
tree is updated and COWed (when the btrfs_qgroup_info_item
items) are written. In fact, use_block_rsv() which is used
in btrfs_cow_block() falls back to the global_block_rsv in
this case. But just in order to make it more clear what is
happening, change it to explicitly use the global_block_rsv.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:36 -04:00
Alexandre Oliva 17a5adccf3 btrfs: do away with non-whole_page extent I/O
end_bio_extent_readpage computes whole_page based on bv_offset and
bv_len, without taking into account that blk_update_request may modify
them when some of the blocks to be read into a page produce a read
error.  This would cause the read to unlock only part of the file
range associated with the page, which would in turn leave the entire
page locked, which would not only keep the process blocked instead of
returning -EIO to it, but also prevent any further access to the file.

It turns out that btrfs always issues whole-page reads and writes.
The special handling of non-whole_page appears to be a mistake or a
left-over from a time when this wasn't the case.  Indeed,
end_bio_extent_writepage distinguished between whole_page and
non-whole_page writes but behaved identically in both cases!

I've replaced the whole_page computations with warnings, just to be
sure that we're not issuing partial page reads or writes.  The
warnings should probably just go away some time.

Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:35 -04:00
Miao Xie b216cbfb52 Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context
btrfs_invalidate_inodes() may sleep, so we should not invoke it in the
spin lock context. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:34 -04:00
Miao Xie 314297c2a3 Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()
We have checked if ->node is NULL or not, so it is unnecessary to
use BUG_ON() to check again. Remove it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:32 -04:00
Miao Xie 061594ef17 Btrfs: pause the space balance when remounting to R/O
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:31 -04:00
Miao Xie e1409cef85 Btrfs: fix unprotected root node of the subvolume's inode rb-tree
The root node of the rb-tree may be changed, so we should get it under
the lock. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:30 -04:00
Miao Xie 89042e5ad2 Btrfs: fix accessing a freed tree root
inode_tree_del() will move the tree root into the dead root list, and
then the tree will be destroyed by the cleaner. So if we remove the
delayed node which is cached in the inode after inode_tree_del(),
we may access a freed tree root. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:29 -04:00
Liu Bo b9aa55bed1 Btrfs: return errno if possible when we fail to allocate memory
We need to set return value explicitly, otherwise we'll lose the error
value.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:27 -04:00
Miao Xie d88033dbf4 Btrfs: update the global reserve if it is empty
Before applying this patch, we reserved the space for the global reserve
by the minimum unit if we found it is empty, it was unreasonable and
inefficient, because if the global reserve space was depleted, it implied
that the size of the global reserve was too small. In this case, we shoud
update the global reserve and fill it.

Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:26 -04:00
Miao Xie 5881cfc924 Btrfs: don't steal the reserved space from the global reserve if their space type is different
If the type of the space we need is different with the global reserve, we
can not steal the space from the global reserve, because we can not allocate
the space from the free space cache that the global reserve points to.

Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:25 -04:00
Miao Xie b586b32374 Btrfs: optimize the error handle of use_block_rsv()
cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:24 -04:00
Miao Xie 7b61cd9224 Btrfs: don't use global block reservation for inode cache truncation
It is very likely that there are lots of subvolumes/snapshots in the filesystem,
so if we use global block reservation to do inode cache truncation, we may hog
all the free space that is reserved in global rsv. So it is better that we do
the free space reservation for inode cache truncation by ourselves.

Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:22 -04:00
Miao Xie 7cfa9e51d2 Btrfs: don't abort the current transaction if there is no enough space for inode cache
The filesystem with inode cache was forced to be read-only when we umounted it.

Steps to reproduce:
 # mkfs.btrfs -f ${DEV}
 # mount -o inode_cache ${DEV} ${MNT}
 # dd if=/dev/zero of=${MNT}/file1 bs=1M count=8192
 # btrfs fi syn ${MNT}
 # dd if=${MNT}/file1 of=/dev/null bs=1M
 # rm -f ${MNT}/file1
 # btrfs fi syn ${MNT}
 # umount ${MNT}

It is because there was no enough space to do inode cache truncation, and then
we aborted the current transaction.

But no space error is not a serious problem when we write out the inode cache,
and it is safe that we just skip this step if we meet this problem. So we need
not abort the current transaction.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:21 -04:00
Andreas Philipp 8250dabedb Correct allowed raid levels on balance.
Raid5 with 3 devices is well defined while the old logic allowed
raid5 only with a minimum of 4 devices when converting the block group
profile via btrfs balance. Creating a raid5 with just three devices
using mkfs.btrfs worked always as expected. This is now fixed and the
whole logic is rewritten.

Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:20 -04:00
Stefan Behrens 379cde741b Btrfs: fix possible memory leak in replace_path()
In replace_path(), if read_tree_block() fails, we cannot return
directly, we should free some allocated memory otherwise memory
leak happens.

Similar to Wang's "Btrfs: fix possible memory leak in the
find_parent_nodes()" patch, the current commit fixes an issue that
is related to the "Btrfs: fix all callers of read_tree_block"
commit.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:19 -04:00
Wang Shilong c16c2e2e51 Btrfs: fix possible memory leak in the find_parent_nodes()
In the find_parent_nodes(), if read_tree_block() fails, we can
not return directly, we should free some allocated memory otherwise
memory leak happens.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:17 -04:00
Stefan Behrens 4968810752 Btrfs: don't allow device replace on RAID5/RAID6
This is not yet supported and causes crashes. One sad user reported
that it destroyed his filesystem.

One failure is in __btrfs_map_block+0xc1f calling kmalloc(0).

0x5f21f is in __btrfs_map_block (fs/btrfs/volumes.c:4923).
4918                            num_stripes = map->num_stripes;
4919                            max_errors = nr_parity_stripes(map);
4920
4921                            raid_map = kmalloc(sizeof(u64) * num_stripes,
4922                                               GFP_NOFS);
4923                            if (!raid_map) {
4924                                    ret = -ENOMEM;
4925                                    goto out;
4926                            }
4927

There might be more issues. Until this is really tested, don't allow
users to start the procedure on RAID5/RAID6 filesystems.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:16 -04:00
Josef Bacik b1c79e0947 Btrfs: handle running extent ops with skinny metadata
Chris hit a bug where we weren't finding extent records when running extent ops.
This is because we use the delayed_ref_head when running the extent op, which
means we can't use the ->type checks to see if we are metadata.  We also lose
the level of the metadata we are working on.  So to fix this we can just check
the ->is_data section of the extent_op, and we can store the level of the buffer
we were modifying in the extent_op.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:15 -04:00
Josef Bacik 73e1e61fb8 Btrfs: remove warn on in free space cache writeout
This catches block groups that are too large to properly cache.  We deal with
this case fine, so the warning just confuses users.  Remove the warning.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:13 -04:00
Josef Bacik 69a85bd87c Btrfs: don't null pointer deref on abort
I'm sorry, theres no excuse for this sort of work.  We need to use
root->leafsize since eb may be NULL.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:12 -04:00
Gabriel de Perthuis 03b71c6ca6 btrfs: don't stop searching after encountering the wrong item
The search ioctl skips items that are too large for a result buffer, but
inline items of a certain size occuring before any search result is
found would trigger an overflow and stop the search entirely.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=57641

Cc: stable@vger.kernel.org
Signed-off-by: Gabriel de Perthuis <g2p.code+btrfs@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:10 -04:00
Liu Bo a52f4cd2b1 Btrfs: fix off-by-one in fiemap
lock_extent/unlock_extent expect an exclusive end.

Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 16:27:26 -04:00
David Sterba 60b62978bc btrfs: annotate quota tree for lockdep
Quota tree has been missing from lockdep annotations, though no warning
has been seen in the wild.

There's currently one entry that does not belong there,
BTRFS_ORPHAN_OBJECTID.  No such tree exists, it's probably a copy &
paste mistake, the id is defined among tree ids.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 16:27:25 -04:00
Jim Schutt 39be95e9c8 ceph: ceph_pagelist_append might sleep while atomic
Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc()
while holding a lock, but it's spoiled because ceph_pagelist_addpage()
always calls kmap(), which might sleep.  Here's the result:

[13439.295457] ceph: mds0 reconnect start
[13439.300572] BUG: sleeping function called from invalid context at include/linux/highmem.h:58
[13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1
    . . .
[13439.376225] Call Trace:
[13439.378757]  [<ffffffff81076f4c>] __might_sleep+0xfc/0x110
[13439.384353]  [<ffffffffa03f4ce0>] ceph_pagelist_append+0x120/0x1b0 [libceph]
[13439.391491]  [<ffffffffa0448fe9>] ceph_encode_locks+0x89/0x190 [ceph]
[13439.398035]  [<ffffffff814ee849>] ? _raw_spin_lock+0x49/0x50
[13439.403775]  [<ffffffff811cadf5>] ? lock_flocks+0x15/0x20
[13439.409277]  [<ffffffffa045e2af>] encode_caps_cb+0x41f/0x4a0 [ceph]
[13439.415622]  [<ffffffff81196748>] ? igrab+0x28/0x70
[13439.420610]  [<ffffffffa045e9f8>] ? iterate_session_caps+0xe8/0x250 [ceph]
[13439.427584]  [<ffffffffa045ea25>] iterate_session_caps+0x115/0x250 [ceph]
[13439.434499]  [<ffffffffa045de90>] ? set_request_path_attr+0x2d0/0x2d0 [ceph]
[13439.441646]  [<ffffffffa0462888>] send_mds_reconnect+0x238/0x450 [ceph]
[13439.448363]  [<ffffffffa0464542>] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph]
[13439.455250]  [<ffffffffa0462e42>] check_new_map+0x352/0x500 [ceph]
[13439.461534]  [<ffffffffa04631ad>] ceph_mdsc_handle_map+0x1bd/0x260 [ceph]
[13439.468432]  [<ffffffff814ebc7e>] ? mutex_unlock+0xe/0x10
[13439.473934]  [<ffffffffa043c612>] extra_mon_dispatch+0x22/0x30 [ceph]
[13439.480464]  [<ffffffffa03f6c2c>] dispatch+0xbc/0x110 [libceph]
[13439.486492]  [<ffffffffa03eec3d>] process_message+0x1ad/0x1d0 [libceph]
[13439.493190]  [<ffffffffa03f1498>] ? read_partial_message+0x3e8/0x520 [libceph]
    . . .
[13439.587132] ceph: mds0 reconnect success
[13490.720032] ceph: mds0 caps stale
[13501.235257] ceph: mds0 recovery completed
[13501.300419] ceph: mds0 caps renewed

Fix it up by encoding locks into a buffer first, and when the number
of encoded locks is stable, copy that into a ceph_pagelist.

[elder@inktank.com: abbreviated the stack info a bit.]

Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-17 12:45:48 -05:00
Jim Schutt c420276a53 ceph: add cpu_to_le32() calls when encoding a reconnect capability
In his review, Alex Elder mentioned that he hadn't checked that
num_fcntl_locks and num_flock_locks were properly decoded on the
server side, from a le32 over-the-wire type to a cpu type.
I checked, and AFAICS it is done; those interested can consult
    Locker::_do_cap_update()
in src/mds/Locker.cc and src/include/encoding.h in the Ceph server
code (git://github.com/ceph/ceph).

I also checked the server side for flock_len decoding, and I believe
that also happens correctly, by virtue of having been declared
__le32 in struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h.

Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-17 12:45:43 -05:00
Rami Rosen 5240d58c04 sysfs: kill sysfs_sb declaration in fs/sysfs/inode.c.
This patch removes sysfs_sb declaration from fs/sysfs/inode.c
(due to 0f4288ec6f,
 "Kill unused sysfs_sb variable").

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-17 10:26:08 -07:00
Warner Wang 434749108c sysfs: sysfs_link_sibling(): fix typo in comment
Fix a typo subling->sibling in the comment of sysfs_link_sibling().

Signed-off-by: Warner Wang <warner.wang@hp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-17 10:26:08 -07:00
J. Bruce Fields ba4e55bb67 nfsd4: fix compile in !CONFIG_NFSD_V4_SECURITY_LABEL case
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-05-15 10:35:18 -04:00
David Quigley 18032ca062 NFSD: Server implementation of MAC Labeling
Implement labeled NFS on the server: encoding and decoding, and writing
and reading, of file labels.

Enabled with CONFIG_NFSD_V4_SECURITY_LABEL.

Signed-off-by: Matthew N. Dodd <Matthew.Dodd@sparta.com>
Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-05-15 09:27:02 -04:00
Linus Torvalds b973425cbb Fixed regressions (two stability regressions and a performance
regression) introduced during the 3.10-rc1 merge window.  Also
 included is a bug fix relating to allocating blocks after resizing an
 ext3 file system when using the ext4 file system driver.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJRkZBlAAoJENNvdpvBGATwLYQP/iWOBs2z93WG23cqkgqvL8o6
 ZyeJdgy9dkFCArVDX5SSnGkJXZ3iqIKi5HoTKTJKfytgMzgiDAZcLsIHVv6NczwR
 UGhjgS3HEdV5tJ46E6JnpB3NLSb+rAdc5kCdlsbzU46CP+JjFiYEhxVpK7ELuM/G
 yctChbIH9FY+1OwxHccacBOaJU2ELhnH6B/8Ry/6gM2H0vfKeTNOdocOHdxvbNqg
 ooGjytMfVopMQEfVG8aXtTfy341NFJH5fAYEahCcXxeO9ta6Unj9yOu5JV2wVrTt
 39+DBsquGX6AVQsc9IxJ6YAN6ldwWN7l3huE9/AI0o/alwGsfVi5M+M/d1MMjDqf
 Fgl2EzzBpZQeKKY9UXNi4LLgYdBiILMgKDOGoRKhRb8ynSSf/JX43+24FvidEi3o
 o//J4aR+oSZfaovGAeikqyF1cumayhoNN8MINRN8igIinBiC4GjBFEl/Kl/1eAY/
 lREGcsmYPXOkVPpM72waRYlP4GwNdOg4QSEY0SGljpwluO+dYtKQjHXcv/s/xL5v
 j3GemzYVyjx4zaq1g3PxGfuD6VKFHr0T6jvzd6cHu17lnPlw9fwznHbEm9BEcXDY
 gbGx9u+a2ZTqDwYVALbeoRpf9Zz6DUCse3ts4N3rbkXUQQiBYo7tybfVopIMAukb
 CexvidDE/ryJrJJFBwoK
 =6cRD
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
 "Fixed regressions (two stability regressions and a performance
  regression) introduced during the 3.10-rc1 merge window.

  Also included is a bug fix relating to allocating blocks after
  resizing an ext3 file system when using the ext4 file system driver"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd,jbd2: fix oops in jbd2_journal_put_journal_head()
  ext4: revert "ext4: use io_end for multiple bios"
  ext4: limit group search loop for non-extent files
  ext4: fix fio regression
2013-05-14 09:30:54 -07:00
Brian Foster de82b92301 fuse: allocate for_background dio requests based on io->async state
Commit 8b41e671 introduced explicit background checking for fuse_req
structures with BUG_ON() checks for the appropriate type of request in
in the associated send functions. Commit bcba24cc introduced the ability
to send dio requests as background requests but does not update the
request allocation based on the type of I/O request. As a result, a
BUG_ON() triggers in the fuse_request_send_background() background path if
an async I/O is sent.

Allocate a request based on the async state of the fuse_io_priv to avoid
the BUG.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-05-14 18:25:44 +02:00
Lingzhu Xiang 3fab70c165 efivarfs: Never return ENOENT from firmware again
Previously in 1fa7e69 efi_status_to_err() translated firmware status
EFI_NOT_FOUND to -EIO instead of -ENOENT for efivarfs operations to
avoid confusion. After refactoring in e14ab23, it is also used in other
places where the translation may be unnecessary.

So move the translation to efivarfs specific code. Also return EOF
for reading zero-length files, which is what users would expect.

Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: Lee, Chun-Yi <jlee@suse.com>
Cc: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Lingzhu Xiang <lxiang@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-05-13 20:12:10 +01:00
Steve Dickson 4bdc33ed5b NFSDv4.2: Add NFS v4.2 support to the NFS server
This enables NFSv4.2 support for the server. To enable this
code do the following:
  echo "+4.2" >/proc/fs/nfsd/versions

after the nfsd kernel module is loaded.

On its own this does nothing except allow the server to respond to
compounds with minorversion set to 2.  All the new NFSv4.2 features are
optional, so this is perfectly legal.

Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-05-13 10:11:47 -04:00
J. Bruce Fields 4f540e29dc nfsd4: store correct client minorversion for >=4.2
This code assumes that any client using exchange_id is using NFSv4.1,
but with the introduction of 4.2 that will no longer true.

This main effect of this is that client callbacks will use the same
minorversion as that used on the exchange_id.

Note that clients are forbidden from mixing 4.1 and 4.2 compounds.  (See
rfc 5661, section 2.7, #13: "A client MUST NOT attempt to use a stateid,
filehandle, or similar returned object from the COMPOUND procedure with
minor version X for another COMPOUND procedure with minor version Y,
where X != Y.")  However, we do not currently attempt to enforce this
except in the case of mixing zero minor version with non-zero minor
versions.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-05-13 10:11:45 -04:00
Zhao Hongjiang eb48241bb4 nfsd: get rid of the unused functions in vfs
The fh_lock_parent(), nfsd_truncate(), nfsd_notify_change() and
nfsd_sync_dir() fuctions are neither implemented nor used, just remove
them.

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-05-13 10:11:44 -04:00
Steve Dickson 4488cc96c5 NFS: Add NFSv4.2 protocol constants
Signed-off-by: Matthew N. Dodd <Matthew.Dodd@sparta.com>
Signed-off-by: Miguel Rodel Felipe <Rodel_FM@dsi.a-star.edu.sg>
Signed-off-by: Phua Eu Gene <PHUA_Eu_Gene@dsi.a-star.edu.sg>
Signed-off-by: Khin Mi Mi Aung <Mi_Mi_AUNG@dsi.a-star.edu.sg>
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-05-13 10:05:43 -04:00
Colin Cross 9745cdb36d select: use freezable blocking call
Avoid waking up every thread sleeping in a select call during
suspend and resume by calling a freezable blocking call.  Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-05-12 14:16:22 +02:00
Colin Cross 1c441e9212 epoll: use freezable blocking call
Avoid waking up every thread sleeping in an epoll_wait call during
suspend and resume by calling a freezable blocking call.  Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-05-12 14:16:22 +02:00
Colin Cross 5853cc2a89 freezer: add unsafe versions of freezable helpers for CIFS
CIFS calls wait_event_freezekillable_unsafe with a VFS lock held,
which is unsafe and will cause lockdep warnings when 6aa9707
"lockdep: check that no locks held at freeze time" is reapplied
(it was reverted in dbf520a).  CIFS shouldn't be doing this, but
it has long-running syscalls that must hold a lock but also
shouldn't block suspend.  Until CIFS freeze handling is rewritten
to use a signal to exit out of the critical section, add a new
wait_event_freezekillable_unsafe helper that will not run the
lockdep test when 6aa9707 is reapplied, and call it from CIFS.

In practice the likley result of holding the lock while freezing
is that a second task blocked on the lock will never freeze,
aborting suspend, but it is possible to manufacture a case using
the cgroup freezer, the lock, and the suspend freezer to create
a deadlock.  Silencing the lockdep warning here will allow
problems to be found in other drivers that may have a more
serious deadlock risk, and prevent new problems from being added.

Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-05-12 14:16:21 +02:00
Colin Cross 416ad3c9c0 freezer: add unsafe versions of freezable helpers for NFS
NFS calls the freezable helpers with locks held, which is unsafe
and will cause lockdep warnings when 6aa9707 "lockdep: check
that no locks held at freeze time" is reapplied (it was reverted
in dbf520a).  NFS shouldn't be doing this, but it has
long-running syscalls that must hold a lock but also shouldn't
block suspend.  Until NFS freeze handling is rewritten to use a
signal to exit out of the critical section, add new *_unsafe
versions of the helpers that will not run the lockdep test when
6aa9707 is reapplied, and call them from NFS.

In practice the likley result of holding the lock while freezing
is that a second task blocked on the lock will never freeze,
aborting suspend, but it is possible to manufacture a case using
the cgroup freezer, the lock, and the suspend freezer to create
a deadlock.  Silencing the lockdep warning here will allow
problems to be found in other drivers that may have a more
serious deadlock risk, and prevent new problems from being added.

Signed-off-by: Colin Cross <ccross@android.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-05-12 14:16:21 +02:00
Theodore Ts'o a549984b8c ext4: revert "ext4: use io_end for multiple bios"
This reverts commit 4eec708d26.

Multiple users have reported crashes which is apparently caused by
this commit.  Thanks to Dmitry Monakhov for bisecting it.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: Jan Kara <jack@suse.cz>
2013-05-11 19:07:42 -04:00
Linus Torvalds c4cc75c332 Merge git://git.infradead.org/users/eparis/audit
Pull audit changes from Eric Paris:
 "Al used to send pull requests every couple of years but he told me to
  just start pushing them to you directly.

  Our touching outside of core audit code is pretty straight forward.  A
  couple of interface changes which hit net/.  A simple argument bug
  calling audit functions in namei.c and the removal of some assembly
  branch prediction code on ppc"

* git://git.infradead.org/users/eparis/audit: (31 commits)
  audit: fix message spacing printing auid
  Revert "audit: move kaudit thread start from auditd registration to kaudit init"
  audit: vfs: fix audit_inode call in O_CREAT case of do_last
  audit: Make testing for a valid loginuid explicit.
  audit: fix event coverage of AUDIT_ANOM_LINK
  audit: use spin_lock in audit_receive_msg to process tty logging
  audit: do not needlessly take a lock in tty_audit_exit
  audit: do not needlessly take a spinlock in copy_signal
  audit: add an option to control logging of passwords with pam_tty_audit
  audit: use spin_lock_irqsave/restore in audit tty code
  helper for some session id stuff
  audit: use a consistent audit helper to log lsm information
  audit: push loginuid and sessionid processing down
  audit: stop pushing loginid, uid, sessionid as arguments
  audit: remove the old depricated kernel interface
  audit: make validity checking generic
  audit: allow checking the type of audit message in the user filter
  audit: fix build break when AUDIT_DEBUG == 2
  audit: remove duplicate export of audit_enabled
  Audit: do not print error when LSMs disabled
  ...
2013-05-11 14:29:11 -07:00
Linus Torvalds 2dbd3cac87 Merge branch 'for-3.10' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
 "Small fixes for two bugs and two warnings"

* 'for-3.10' of git://linux-nfs.org/~bfields/linux:
  nfsd: fix oops when legacy_recdir_name_error is passed a -ENOENT error
  SUNRPC: fix decoding of optional gss-proxy xdr fields
  SUNRPC: Refactor gssx_dec_option_array() to kill uninitialized warning
  nfsd4: don't allow owner override on 4.1 CLAIM_FH opens
2013-05-10 09:28:55 -07:00
Linus Torvalds 3644bc2ec7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull stray syscall bits from Al Viro:
 "Several syscall-related commits that were missing from the original"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  switch compat_sys_sysctl to COMPAT_SYSCALL_DEFINE
  unicore32: just use mmap_pgoff()...
  unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE
  x86, vm86: fix VM86 syscalls: use SYSCALL_DEFINEx(...)
2013-05-10 09:21:05 -07:00
Linus Torvalds 6fad8d02ef Improve performance when AES-NI (and most likely other crypto accelerators) is
available by moving to the ablkcipher crypto API. The improvement is more
 apparent on faster storage devices. There's no noticeable change when hardware
 crypto is not available.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCgAGBQJRjJFjAAoJENaSAD2qAscKuRgQAJkefyLPNBb4plXr3R+zJ1aM
 jkcqkUJWamCxlmqHg/n9LmlxpSZiiWSWJM8iiq8zQhPE0PVXVOVhqvzogT1xwv75
 xfT9xTuRi1v7UFaSGHtj3WoO2nscQ0pjZW/0CHgd/PZjz9y0iJ/l6ueoWCOz5L2i
 3oJvx/W407qM+MSogWKx79i1B2jILdBdQH/7PZ+UJS3jWEo3rMWBbfCwbYhd4pUG
 oVc+qFglNs+3HLdHVUmHPCerCL9qYAJIJmDrvupSOQ6DwdFaV8IysTgSEdFtLcfC
 8Z6DUOPzXnvA/+y+NCCCUxg1CrkgYkNrefLKAq18atFu63zIZIHZyJBTJ5Q6vXVF
 o1H8UcIOg/liGa6lXGf5b4ENNKvB0qMYQgiSrL0/FVVima4zGqUWkQLno4kQl1zx
 FHB5imQ7F/EMcow/nTN3YQYC3N/iYIFAIRxf35SiGGsNhO2sEyIqYlRSyq2MNrRl
 pLWNbhnRuhBUqcbqZDxq1oZ7624Ui4jnHHx7rl6Y3gfm8Xa1ZmQeY6rOadSZaRd7
 +ZqFZi1jiHz1c0tVUO/3DsqABIhbr9Ee03vracN8bVTGV0ZmO0qneugFB9esoeDf
 UnrU0Im8ilHu3OHAyc+UkRZuzThL9bLZEivwICQ+JDUJ1zsLYSQZEaGIt8hA0iDy
 8Bu3gtfX2WxD4ak/AkFv
 =3EhP
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-3.10-rc1-ablkcipher' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull eCryptfs update from Tyler Hicks:
 "Improve performance when AES-NI (and most likely other crypto
  accelerators) is available by moving to the ablkcipher crypto API.
  The improvement is more apparent on faster storage devices.

  There's no noticeable change when hardware crypto is not available"

* tag 'ecryptfs-3.10-rc1-ablkcipher' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  eCryptfs: Use the ablkcipher crypto API
2013-05-10 09:20:01 -07:00
Linus Torvalds 977b58e1dd Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
Pull m68knommu updates from Greg Ungerer:
 "The bulk of the changes are generalizing the ColdFire v3 core support
  and adding in 537x CPU support.  Also a couple of other bug fixes, one
  to fix a reintroduction of a past bug in the romfs filesystem nommu
  support."

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
  m68knommu: enable Timer on coldfire 532x
  m68knommu: fix ColdFire 5373/5329 QSPI base address
  m68knommu: add support for configuring a Freescale M5373EVB board
  m68knommu: add support for the ColdFire 537x family of CPUs
  m68knommu: make ColdFire M532x platform support more v3 generic
  m68knommu: create and use a common M53xx ColdFire class of CPUs
  m68k: remove unused asm/dbg.h
  m68k: Set ColdFire ACR1 cache mode depending on kernel configuration
  romfs: fix nommu map length to keep inside filesystem
  m68k: clean up unused "config ROMVECSIZE"
2013-05-10 07:22:35 -07:00
Richard Genoud beadadfa54 UBIFS: correct mount message
When mounting an UBIFS R/W volume, we have the message:
UBIFS: mounted UBI device 0, volume 1, name "rootfs"(null)
With this patch, we'll have:
UBIFS: mounted UBI device 0, volume 1, name "rootfs"
Which is, I think, what was intended.

Signed-off-by: Richard Genoud <richard.genoud@gmail.com>
Cc: stable@vger.kernel.org [v3.7+]
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2013-05-10 17:07:01 +03:00
Tyler Hicks 4dfea4f0d7 eCryptfs: Use the ablkcipher crypto API
Make the switch from the blkcipher kernel crypto interface to the
ablkcipher interface.

encrypt_scatterlist() and decrypt_scatterlist() now use the ablkcipher
interface but, from the eCryptfs standpoint, still treat the crypto
operation as a synchronous operation. They submit the async request and
then wait until the operation is finished before they return. Most of
the changes are contained inside those two functions.

Despite waiting for the completion of the crypto operation, the
ablkcipher interface provides performance increases in most cases when
used on AES-NI capable hardware.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Colin King <colin.king@canonical.com>
Reviewed-by: Zeev Zilberman <zeev@annapurnaLabs.com>
Cc: Dustin Kirkland <dustin.kirkland@gazzang.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Ying Huang <ying.huang@intel.com>
Cc: Thieu Le <thieule@google.com>
Cc: Li Wang <dragonylffly@163.com>
Cc: Jarkko Sakkinen <jarkko.sakkinen@iki.fi>
2013-05-09 16:55:07 -07:00
Linus Torvalds 70eba4226d Couple of pstore cleanups
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJRiosmAAoJEKurIx+X31iBQ/IP/2D+JHfGcMzZbRo+2moRyhp9
 VYUrNOg+QA+Csm6ZO43GMB4JZtodI/tO12YLNGFbNh+YELdpox7XTqmG058+pABV
 El1xt7i2Ro2/PxnWBRnnvWVHoWSsPI9R65aVNgz8C2QyDxG4wY9/ZYfcZQgEajJO
 M4gyKx/d54WjKcD31OJBF5NGki2zZQcuBI8vWkjXZBPximNj+cJeC27VPnuNI2GC
 p32p9Q+pvBv43bkf3EPEFGsd/ZKdczZ75SOzLXVqOEJhmsxfEKgleQbBd3hiRlwe
 zwj8lCzjZS3xR9oCnvalzWgswFQMd9S1kQYbItdztIofU4Y9hP6wmsEzLofXku+0
 FshRxAOCtW1jSgRGo/BiDNfRQm8w+l+rJGafZafK6cQOtUHLD+Kig8AhFBIY3Nt1
 Wzsmjz5ERhMDImM2lVw4ypji5vWkQ50wst6sX5CQHiOdOEmX+HYJkpOXCHzZYNY2
 IrAa/qq6EKbQCfUf7btbsFMDMDIYk66gy6R596sC1Onwy14ecxWerY1HO0BFg3b8
 UUo9tGVHEBBLdogc1aErU6OBL2DLeO7TeURhFOu7UHo0PfJNtLyG2ekYOYjwgS+Q
 RKNJz9PHKsQnBOdI+dK3iHG2uqkbv6rg0pH5oIglCptabqflN9SE0RpM5b2PSLT3
 74fTPSlUDSuCcm1PXJ8O
 =Zw0E
 -----END PGP SIGNATURE-----

Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux

Pull trivial pstore update from Tony Luck:
 "Couple of pstore cleanups"

It turns out that the kmemdup() conversion ends up being undone by the
fact that the memory block also needed the ecc information (see commit
bd08ec33b5c2: "pstore/ram: Restore ecc information block"), so all that
remains after merging is the error return code change.

* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  pstore/ram: fix error return code in ramoops_probe()
  fs: pstore: Replaced calls to kmalloc and memcpy with kmemdup
2013-05-09 16:42:10 -07:00
Linus Torvalds 07e074503e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs fixes from Al Viro:
 "Regression fix from Geert + yet another open-coded kernel_read()"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ecryptfs: don't open-code kernel_read()
  xtensa simdisk: Fix proc_create_data() conversion fallout
2013-05-09 13:44:35 -07:00
Linus Torvalds 983a5f84a4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
 "These are mostly fixes.  The biggest exceptions are Josef's skinny
  extents and Jan Schmidt's code to rebuild our quota indexes if they
  get out of sync (or you enable quotas on an existing filesystem).

  The skinny extents are off by default because they are a new variation
  on the extent allocation tree format.  btrfstune -x enables them, and
  the new format makes the extent allocation tree about 30% smaller.

  I rebased this a few days ago to rework Dave Sterba's crc checks on
  the super block, but almost all of these go back to rc6, since I
  though 3.9 was due any minute.

  The biggest missing fix is the tracepoint bug that was hit late in
  3.9.  I ran into problems with that in overnight testing and I'm still
  tracking it down.  I'll definitely have that fixed for rc2."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (101 commits)
  Btrfs: allow superblock mismatch from older mkfs
  btrfs: enhance superblock checks
  btrfs: fix misleading variable name for flags
  btrfs: use unsigned long type for extent state bits
  Btrfs: improve the loop of scrub_stripe
  btrfs: read entire device info under lock
  btrfs: remove unused gfp mask parameter from release_extent_buffer callchain
  btrfs: handle errors returned from get_tree_block_key
  btrfs: make static code static & remove dead code
  Btrfs: deal with errors in write_dev_supers
  Btrfs: remove almost all of the BUG()'s from tree-log.c
  Btrfs: deal with free space cache errors while replaying log
  Btrfs: automatic rescan after "quota enable" command
  Btrfs: rescan for qgroups
  Btrfs: split btrfs_qgroup_account_ref into four functions
  Btrfs: allocate new chunks if the space is not enough for global rsv
  Btrfs: separate sequence numbers for delayed ref tracking and tree mod log
  btrfs: move leak debug code to functions
  Btrfs: return free space in cow error path
  Btrfs: set UUID in root_item for created trees
  ...
2013-05-09 13:07:40 -07:00
Linus Torvalds 8769e078a9 xfs: update (#2) for v3.10-rc1
* add CONFIG_XFS_WARN, a step between zero debugging and CONFIG_XFS_DEBUG.
 * fix attrmulti and attrlist to fall back to vmalloc when kmalloc fails.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJRi/V+AAoJENaLyazVq6ZOgncP/i0p141SO7qAwf20ql5NMi5Q
 KdhzWymyILIZ8GQoxsvoDASXb3D5I/f+8qX1WgrAG7/k7fahdGOYCtFTJ5YqYLvy
 ocxNoKqkhmUrMBbx0BcoNB4rtqgwLuHMR9ihDGDJrJDsn1b6nZlr38VBngIMhcC3
 rWS8hU6ZwEnGm9hcmNzMBLJpjCJRfwqRr9CmHbENP/LspV0ZTSgltIuFggBfjK6q
 LK3FYdrlYfiX1md3c9zPpdVn8P3hxeM/3Jq+O3mLZ6hY4uq1+NBSKQbS5uwg1d6e
 3ib5hxBDlRk//+S0mzJJ3DZ+Qqa2zHpZ/jNKsjdOBUoyqEdRC5tI9hx3F3la5VxP
 4oktNP2rvhpziRqRny/EwZm5xHrGTrdP8Uwq2nO2nxevtZyVd+P73K/17sNgOeCU
 8Xm6d3Usxw4FUTbiHjw0LFoJlM8yfEUSiTr1K8TmDP5G4phE6RsNnlbDLSUZ6kgh
 6UodqGZtKqdXegYljtwysX75PELQcWWcrA5+7U2/yk+qKyEJ3HvMNIXHvW6MqGTL
 69gxdrdD/Ff+83N+ktA3Ks31aj9n0LYpy6shxXg+YpHl6Mny3UiEpiUEsVc8jn4E
 iJ4qVqQI73qLDCNLM9XShBpPz5N/aPzaFw7QPXbj3A9UH3wd5OmUKNieoL4t9ucH
 sAreKLnndOuriYoTAWA9
 =IPgF
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.10-rc1-2' of git://oss.sgi.com/xfs/xfs

Pull xfs update (#2) from Ben Myers:

 - add CONFIG_XFS_WARN, a step between zero debugging and
   CONFIG_XFS_DEBUG.

 - fix attrmulti and attrlist to fall back to vmalloc when kmalloc
   fails.

* tag 'for-linus-v3.10-rc1-2' of git://oss.sgi.com/xfs/xfs:
  xfs: fallback to vmalloc for large buffers in xfs_compat_attrlist_by_handle
  xfs: fallback to vmalloc for large buffers in xfs_attrlist_by_handle
  xfs: introduce CONFIG_XFS_WARN
2013-05-09 13:06:20 -07:00
Al Viro 91c2e0bcae unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-09 13:46:38 -04:00
Al Viro 39dfe6c6a5 ecryptfs: don't open-code kernel_read()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-09 13:39:58 -04:00
Linus Torvalds 8cbc95ee74 More NFS client bugfixes for 3.10
- Ensure that we match the 'sec=' mount flavour against the server list
 - Fix the NFSv4 byte range locking in the presence of delegations
 - Ensure that we conform to the NFSv4.1 spec w.r.t. freeing lock stateids
 - Fix a pNFS data server connection race
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJRit1yAAoJEGcL54qWCgDyD9EQAKgb37dXhGt7OXBRBP4EY/T8
 xJZ2tmdDZ6etLFJVftqCv05hBvyfilPLK0E9zg/zW/kvkKxYQ/fykvpzBR/+Q7KF
 quOmjDHLhDTXBnXzPg1HEoeTaXI2/a8CdjpxxEkthD4+FaKlyCXM+EFtA9orT9ZI
 oM+aNaqEzTjoQyryTFMcHxAvsrqjnZBa0MT6Fh45HaLaijV7CdDWoj6gjy6Lc3Al
 4wHeT8QrZTp/NfIN16uykFZjeWwul4N9upu+CI2V8ZDMEit6JDYX4sl5tB41PzYW
 audDBcu0waSqoVQ2mJ5OHoYGZf0wopMUFaAst+tn0pQvwWUfTjD8XtO8uOgeMNoz
 2S+XxUC2qhSMszwNBVSmwe2LtSAyHiw32Md4hqkLYDH2c7tk8bJPKDXZJACBzJS7
 O1aMmOgWar8+nmzvmXFeU804SxBykV1V8UgtXWp5IwC36V0HAYnM5xtHwXBR7HWe
 lnuVHVdux7ySeAyrs2aMdKk7SAw5OC//WW8qoEF5USDEIljeoBzA+IYu9n91Hg2b
 ufnsyxumGJ6dZ0iU2nJVoLagRaZcm6kOhnxcegMpb9IH2+RLCQNef09lj2iklm2j
 mJA4o2lkVEHOswg/NwKn/I4ho8tbNNb8v//S5KiqrYhiiqZhOzu3RRtFeZi91iac
 P/g+hPzfuGnmwcoCEUSa
 =5zpc
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull more NFS client bugfixes from Trond Myklebust:

 - Ensure that we match the 'sec=' mount flavour against the server list

 - Fix the NFSv4 byte range locking in the presence of delegations

 - Ensure that we conform to the NFSv4.1 spec w.r.t.  freeing lock
   stateids

 - Fix a pNFS data server connection race

* tag 'nfs-for-3.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS4.1 Fix data server connection race
  NFSv3: match sec= flavor against server list
  NFSv4.1: Ensure that we free the lock stateid on the server
  NFSv4: Convert nfs41_free_stateid to use an asynchronous RPC call
  SUNRPC: Don't spam syslog with "Pseudoflavor not found" messages
  NFSv4.x: Fix handling of partially delegated locks
2013-05-09 10:24:54 -07:00
Jeff Layton 7255e716b1 nfsd: fix oops when legacy_recdir_name_error is passed a -ENOENT error
Toralf reported the following oops to the linux-nfs mailing list:

    -----------------[snip]------------------
    NFSD: unable to generate recoverydir name (-2).
    NFSD: disabling legacy clientid tracking. Reboot recovery will not function correctly!
    BUG: unable to handle kernel NULL pointer dereference at 000003c8
    IP: [<f90a3d91>] nfsd4_client_tracking_exit+0x11/0x50 [nfsd]
    *pdpt = 000000002ba33001 *pde = 0000000000000000
    Oops: 0000 [#1] SMP
    Modules linked in: loop nfsd auth_rpcgss ipt_MASQUERADE xt_owner xt_multiport ipt_REJECT xt_tcpudp xt_recent xt_conntrack nf_conntrack_ftp xt_limit xt_LOG iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables af_packet pppoe pppox ppp_generic slhc bridge stp llc tun arc4 iwldvm mac80211 coretemp kvm_intel uvcvideo sdhci_pci sdhci mmc_core videobuf2_vmalloc videobuf2_memops usblp videobuf2_core i915 iwlwifi psmouse videodev cfg80211 kvm fbcon bitblit cfbfillrect acpi_cpufreq mperf evdev softcursor font cfbimgblt i2c_algo_bit cfbcopyarea intel_agp intel_gtt drm_kms_helper snd_hda_codec_conexant drm agpgart fb fbdev tpm_tis thinkpad_acpi tpm nvram e1000e rfkill thermal ptp wmi pps_core tpm_bios 8250_pci processor 8250 ac snd_hda_intel snd_hda_codec snd_pcm battery video i2c_i801 snd_page_alloc snd_timer button serial_core i2c_core snd soundcore thermal_sys hwmon aesni_intel ablk_helper cryp
td lrw aes_i586 xts gf128mul cbc fuse nfs lockd sunrpc dm_crypt dm_mod hid_monterey hid_microsoft hid_logitech hid_ezkey hid_cypress hid_chicony hid_cherry hid_belkin hid_apple hid_a4tech hid_generic usbhid hid sr_mod cdrom sg [last unloaded: microcode]
    Pid: 6374, comm: nfsd Not tainted 3.9.1 #6 LENOVO 4180F65/4180F65
    EIP: 0060:[<f90a3d91>] EFLAGS: 00010202 CPU: 0
    EIP is at nfsd4_client_tracking_exit+0x11/0x50 [nfsd]
    EAX: 00000000 EBX: fffffffe ECX: 00000007 EDX: 00000007
    ESI: eb9dcb00 EDI: eb2991c0 EBP: eb2bde38 ESP: eb2bde34
    DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
    CR0: 80050033 CR2: 000003c8 CR3: 2ba80000 CR4: 000407f0
    DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    DR6: ffff0ff0 DR7: 00000400
    Process nfsd (pid: 6374, ti=eb2bc000 task=eb2711c0 task.ti=eb2bc000)
    Stack:
    fffffffe eb2bde4c f90a3e0c f90a7754 fffffffe eb0a9c00 eb2bdea0 f90a41ed
    eb2991c0 1b270000 eb2991c0 eb2bde7c f9099ce9 eb2bde98 0129a020 eb29a020
    eb2bdecc eb2991c0 eb2bdea8 f9099da5 00000000 eb9dcb00 00000001 67822f08
    Call Trace:
    [<f90a3e0c>] legacy_recdir_name_error+0x3c/0x40 [nfsd]
    [<f90a41ed>] nfsd4_create_clid_dir+0x15d/0x1c0 [nfsd]
    [<f9099ce9>] ? nfsd4_lookup_stateid+0x99/0xd0 [nfsd]
    [<f9099da5>] ? nfs4_preprocess_seqid_op+0x85/0x100 [nfsd]
    [<f90a4287>] nfsd4_client_record_create+0x37/0x50 [nfsd]
    [<f909d6ce>] nfsd4_open_confirm+0xfe/0x130 [nfsd]
    [<f90980b1>] ? nfsd4_encode_operation+0x61/0x90 [nfsd]
    [<f909d5d0>] ? nfsd4_free_stateid+0xc0/0xc0 [nfsd]
    [<f908fd0b>] nfsd4_proc_compound+0x41b/0x530 [nfsd]
    [<f9081b7b>] nfsd_dispatch+0x8b/0x1a0 [nfsd]
    [<f857b85d>] svc_process+0x3dd/0x640 [sunrpc]
    [<f908165d>] nfsd+0xad/0x110 [nfsd]
    [<f90815b0>] ? nfsd_destroy+0x70/0x70 [nfsd]
    [<c1054824>] kthread+0x94/0xa0
    [<c1486937>] ret_from_kernel_thread+0x1b/0x28
    [<c1054790>] ? flush_kthread_work+0xd0/0xd0
    Code: 86 b0 00 00 00 90 c5 0a f9 c7 04 24 70 76 0a f9 e8 74 a9 3d c8 eb ba 8d 76 00 55 89 e5 53 66 66 66 66 90 8b 15 68 c7 0a f9 85 d2 <8b> 88 c8 03 00 00 74 2c 3b 11 77 28 8b 5c 91 08 85 db 74 22 8b
    EIP: [<f90a3d91>] nfsd4_client_tracking_exit+0x11/0x50 [nfsd] SS:ESP 0068:eb2bde34
    CR2: 00000000000003c8
    ---[ end trace 09e54015d145c9c6 ]---

The problem appears to be a regression that was introduced in commit
9a9c6478 "nfsd: make NFSv4 recovery client tracking options per net".
Prior to that commit, it was safe to pass a NULL net pointer to
nfsd4_client_tracking_exit in the legacy recdir case, and
legacy_recdir_name_error did so. After that comit, the net pointer must
be valid.

This patch just fixes legacy_recdir_name_error to pass in a valid net
pointer to that function.

Cc: <stable@vger.kernel.org> # v3.8+
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Reported-and-tested-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-05-09 12:58:36 -04:00
Jeff Layton c2b93e0699 cifs: only set ops for inodes in I_NEW state
It's generally not safe to reset the inode ops once they've been set. In
the case where the inode was originally thought to be a directory and
then later found to be a DFS referral, this can lead to an oops when we
try to trigger an inode op on it after changing the ops to the blank
referral operations.

Cc: <stable@vger.kernel.org>
Reported-and-Tested-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-05-08 17:12:47 -05:00
Linus Torvalds 942d33da99 f2fs updates for v3.10
This patch-set includes the following major enhancement patches.
 o introduce a new gloabl lock scheme
 o add tracepoints on several major functions
 o fix the overall cleaning process focused on victim selection
 o apply the block plugging to merge IOs as much as possible
 o enhance management of free nids and its list
 o enhance the readahead mode for node pages
 o address several cretical deadlock conditions
 o reduce lock_page calls
 
 The other minor bug fixes and enhancements are as follows.
 o calculation mistakes: overflow
 o bio types: READ, READA, and READ_SYNC
 o fix the recovery flow, data races, and null pointer errors
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJRijCLAAoJEEAUqH6CSFDSg9kQAIqxmQzCUvCN3HcyVe8bGhKz
 8xhKrAY6ySRCKMuBbFRQsNrXUhckE3A44DgzYm5/gQikr/c8zhbqPVrtZ968eCKb
 wm3J+Re/uwZr5eOXlJEaHIiSkMDtERN7Cu2oYJWZi2B9wCSZcgvoWQ3c3LUVk6yF
 GFdi1Y00ll5tFKbEGbXSsfdul9P8jp0MmuMnWBBQZF3TrjETXMdThA5FXN0yTf9s
 XkcGE9vTCCPk8p7P3YmGGw6CwlaL8oallm0//iL4nMNpJzveq2C09IlY2BNrxU3L
 iTNXeIBdbhwXpnh2zq26Cy+cIEDIp0oXYui5BYdr/LWyWU3T/INa+hjUUszsESxF
 51LIUA1rA9nX/BSmj2QomswZ3lt4u5jl6rSBFKv3NG1KsFrAdb8S4tHukRSTSxAJ
 gzpY6kLT1+bgciA16F5W4yhzMYPN5hPa8s6hx4LHlpoqQICQsurjtS9KW7vncLFt
 ttmCMn8ehHcTzKRNNqYaBerCtSB3Z3G/uAy1y+DB7Zx2h2mqhCBXRalyRvs7RKvK
 d5OyYCpHntxuzDwVuivnr9Ddp30LUP1WqexxK+ykn99Ji3leMmffHP8Oari8w96b
 RxSbjoo8hOgoS5xZ4v3AaqtLDlBpxC6oWJzDaq/fJeKxOx22Z5BDFUM9mBGxrouJ
 AATl8b+cW/aTZ4l7WOPU
 =Hqii
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "This patch-set includes the following major enhancement patches.
   - introduce a new gloabl lock scheme
   - add tracepoints on several major functions
   - fix the overall cleaning process focused on victim selection
   - apply the block plugging to merge IOs as much as possible
   - enhance management of free nids and its list
   - enhance the readahead mode for node pages
   - address several cretical deadlock conditions
   - reduce lock_page calls

  The other minor bug fixes and enhancements are as follows.
   - calculation mistakes: overflow
   - bio types: READ, READA, and READ_SYNC
   - fix the recovery flow, data races, and null pointer errors"

* tag 'f2fs-for-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (68 commits)
  f2fs: cover free_nid management with spin_lock
  f2fs: optimize scan_nat_page()
  f2fs: code cleanup for scan_nat_page() and build_free_nids()
  f2fs: bugfix for alloc_nid_failed()
  f2fs: recover when journal contains deleted files
  f2fs: continue to mount after failing recovery
  f2fs: avoid deadlock during evict after f2fs_gc
  f2fs: modify the number of issued pages to merge IOs
  f2fs: remove useless #include <linux/proc_fs.h> as we're now using sysfs as debug entry.
  f2fs: fix inconsistent using of NM_WOUT_THRESHOLD
  f2fs: check truncation of mapping after lock_page
  f2fs: enhance alloc_nid and build_free_nids flows
  f2fs: add a tracepoint on f2fs_new_inode
  f2fs: check nid == 0 in add_free_nid
  f2fs: add REQ_META about metadata requests for submit
  f2fs: give a chance to merge IOs by IO scheduler
  f2fs: avoid frequent background GC
  f2fs: add tracepoints to debug checkpoint request
  f2fs: add tracepoints for write page operations
  f2fs: add tracepoints to debug the block allocation
  ...
2013-05-08 15:11:48 -07:00
Andy Adamson c23266d532 NFS4.1 Fix data server connection race
Unlike meta data server mounts which support multiple mount points to
the same server via struct nfs_server, data servers support a single connection.

Concurrent calls to setup the data server connection can race where the first
call allocates the nfs_client struct, and before the cache struct nfs_client
pointer can be set, a second call also tries to setup the connection, finds the
already allocated nfs_client, bumps the reference count, re-initializes the
session,etc. This results in a hanging data server session after umount.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-05-08 17:19:32 -04:00
Wei Yongjun 47110b8891 pstore/ram: fix error return code in ramoops_probe()
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-05-08 10:24:48 -07:00
Linus Torvalds 4de13d7aa8 Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block
Pull block core updates from Jens Axboe:

 - Major bit is Kents prep work for immutable bio vecs.

 - Stable candidate fix for a scheduling-while-atomic in the queue
   bypass operation.

 - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
   discard bios.

 - Tejuns changes to convert the writeback thread pool to the generic
   workqueue mechanism.

 - Runtime PM framework, SCSI patches exists on top of these in James'
   tree.

 - A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
  relay: move remove_buf_file inside relay_close_buf
  partitions/efi.c: replace useless kzalloc's by kmalloc's
  fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
  block: fix max discard sectors limit
  blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  Documentation: cfq-iosched: update documentation help for cfq tunables
  writeback: expose the bdi_wq workqueue
  writeback: replace custom worker pool implementation with unbound workqueue
  writeback: remove unused bdi_pending_list
  aoe: Fix unitialized var usage
  bio-integrity: Add explicit field for owner of bip_buf
  block: Add an explicit bio flag for bios that own their bvec
  block: Add bio_alloc_pages()
  block: Convert some code to bio_for_each_segment_all()
  block: Add bio_for_each_segment_all()
  bounce: Refactor __blk_queue_bounce to not use bi_io_vec
  raid1: use bio_copy_data()
  pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
  pktcdvd: use bio_copy_data()
  block: Add bio_copy_data()
  ...
2013-05-08 10:13:35 -07:00
Jaegeuk Kim 59bbd474ab f2fs: cover free_nid management with spin_lock
After build_free_nids() searches free nid candidates from nat pages and
current journal blocks, it checks all the candidates if they are allocated
so that the nat cache has its nid with an allocated block address.

In this procedure, previously we used
    list_for_each_entry_safe(fnid, next_fnid, &nm_i->free_nid_list, list).
But, this is not covered by free_nid_list_lock, resulting in null pointer bug.

This patch moves this checking routine inside add_free_nid() in order not to use
the spin_lock.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-08 19:54:22 +09:00
Haicheng Li 23d3884427 f2fs: optimize scan_nat_page()
When nm_i->fcnt > 2 * MAX_FREE_NIDS, stop scanning other NAT entries.

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
[Jaegeuk Kim: fix handling the return value of add_free_nid()]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-08 19:54:22 +09:00
Haicheng Li 8760952d92 f2fs: code cleanup for scan_nat_page() and build_free_nids()
This patch does two cleanups:
1. remove unused variable "fcnt" in build_free_nids().
2. make scan_nat_page() as void type and remove useless variable "fcnt".

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-08 19:54:21 +09:00
Haicheng Li 95630cbadc f2fs: bugfix for alloc_nid_failed()
Directly drop the free_nid cache when nm_i->fcnt > 2 * MAX_FREE_NIDS

Since there is NOT nmi->free_nid_list_lock spinlock protection between
a sequential calling of alloc_nid() and alloc_nid_failed(), some other
threads may already add new free_nid to the free_nid_list during this
period.

We need to make sure nmi->fcnt is never > 2 * MAX_FREE_NIDS.

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
[Jaegeuk Kim: fit the coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-08 19:54:20 +09:00
Chris Fries 047184b42b f2fs: recover when journal contains deleted files
When recovering a journal file with fsync data for files that have
been deleted, don't bail out on recovery.

Signed-off-by: Chris Fries <C.Fries@motorola.com>
Reviewed-by: Russell Knize <rknize2@motorola.com>
Reviewed-by: Jason Hrycay <jason.hrycay@motorola.com>
[Jaegeuk Kim: fit the coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-08 19:54:20 +09:00
Chris Fries bde582b225 f2fs: continue to mount after failing recovery
When unable to roll forward the journal, we shouldn't bail out and
not mount, we should continue to attempt the mount.  Bad recovery data
is likely unrecoverable at this point, and requiring the user to try
to mount again doesn't solve any issues.

Signed-off-by: Chris Fries <C.Fries@motorola.com>
Reviewed-by: Russell Knize <rknize2@motorola.com>
Reviewed-by: Jason Hrycay <jason.hrycay@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-08 19:54:19 +09:00
Jaegeuk Kim 531ad7d58c f2fs: avoid deadlock during evict after f2fs_gc
o Deadlock case #1

Thread 1:
- writeback_sb_inodes
 - do_writepages
  - f2fs_write_data_pages
   - write_cache_pages
    - f2fs_write_data_page
     - f2fs_balance_fs
      - wait mutex_lock(gc_mutex)

Thread 2:
- f2fs_balance_fs
 - mutex_lock(gc_mutex)
 - f2fs_gc
  - f2fs_iget
   - wait iget_locked(inode->i_lock)

Thread 3:
- do_unlinkat
 - iput
  - lock(inode->i_lock)
   - evict
    - inode_wait_for_writeback

o Deadlock case #2

Thread 1:
- __writeback_single_inode
 : set I_SYNC
  - do_writepages
   - f2fs_write_data_page
    - f2fs_balance_fs
     - f2fs_gc
      - iput
       - evict
        - inode_wait_for_writeback(I_SYNC)

In order to avoid this, even though iput is called with the zero-reference
count, we need to stop the eviction procedure if the inode is on writeback.
So this patch links f2fs_drop_inode which checks the I_SYNC flag.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-05-08 19:54:08 +09:00
Linus Torvalds 5af43c24ca Merge branch 'akpm' (incoming from Andrew)
Merge more incoming from Andrew Morton:

 - Various fixes which were stalled or which I picked up recently

 - A large rotorooting of the AIO code.  Allegedly to improve
   performance but I don't really have good performance numbers (I might
   have lost the email) and I can't raise Kent today.  I held this out
   of 3.9 and we could give it another cycle if it's all too late/scary.

I ended up taking only the first two thirds of the AIO rotorooting.  I
left the percpu parts and the batch completion for later.  - Linus

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (33 commits)
  aio: don't include aio.h in sched.h
  aio: kill ki_retry
  aio: kill ki_key
  aio: give shared kioctx fields their own cachelines
  aio: kill struct aio_ring_info
  aio: kill batch allocation
  aio: change reqs_active to include unreaped completions
  aio: use cancellation list lazily
  aio: use flush_dcache_page()
  aio: make aio_read_evt() more efficient, convert to hrtimers
  wait: add wait_event_hrtimeout()
  aio: refcounting cleanup
  aio: make aio_put_req() lockless
  aio: do fget() after aio_get_req()
  aio: dprintk() -> pr_debug()
  aio: move private stuff out of aio.h
  aio: add kiocb_cancel()
  aio: kill return value of aio_complete()
  char: add aio_{read,write} to /dev/{null,zero}
  aio: remove retry-based AIO
  ...
2013-05-07 20:49:51 -07:00
Kent Overstreet a27bb332c0 aio: don't include aio.h in sched.h
Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07 20:16:25 -07:00
Kent Overstreet 41ef4eb8ee aio: kill ki_retry
Thanks to Zach Brown's work to rip out the retry infrastructure, we don't
need this anymore - ki_retry was only called right after the kiocb was
initialized.

This also refactors and trims some duplicated code, as well as cleaning up
the refcounting/error handling a bit.

[akpm@linux-foundation.org: use fmode_t in aio_run_iocb()]
[akpm@linux-foundation.org: fix file_start_write/file_end_write tests]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07 19:46:02 -07:00
Kent Overstreet 8a6608907c aio: kill ki_key
ki_key wasn't actually used for anything previously - it was always 0.
Drop it to trim struct kiocb a bit.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07 19:41:50 -07:00
Jeff Layton 33e2208acf audit: vfs: fix audit_inode call in O_CREAT case of do_last
Jiri reported a regression in auditing of open(..., O_CREAT) syscalls.
In older kernels, creating a file with open(..., O_CREAT) created
audit_name records that looked like this:

type=PATH msg=audit(1360255720.628:64): item=1 name="/abc/foo" inode=138810 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
type=PATH msg=audit(1360255720.628:64): item=0 name="/abc/" inode=138635 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0

...in recent kernels though, they look like this:

type=PATH msg=audit(1360255402.886:12574): item=2 name=(null) inode=264599 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
type=PATH msg=audit(1360255402.886:12574): item=1 name=(null) inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
type=PATH msg=audit(1360255402.886:12574): item=0 name="/abc/foo" inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0

Richard bisected to determine that the problems started with commit
bfcec708, but the log messages have changed with some later
audit-related patches.

The problem is that this audit_inode call is passing in the parent of
the dentry being opened, but audit_inode is being called with the parent
flag false. This causes later audit_inode and audit_inode_child calls to
match the wrong entry in the audit_names list.

This patch simply sets the flag to properly indicate that this inode
represents the parent. With this, the audit_names entries are back to
looking like they did before.

Cc: <stable@vger.kernel.org> # v3.7+
Reported-by: Jiri Jaburek <jjaburek@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Test By: Richard Guy Briggs <rbriggs@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-05-07 22:27:19 -04:00
Kent Overstreet 4e23bcaeb9 aio: give shared kioctx fields their own cachelines
[akpm@linux-foundation.org: make reqs_active __cacheline_aligned_in_smp]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07 18:38:29 -07:00