frequently updated inods to never have their timestamps updated.
These changes guarantee that no timestamp on disk will be stale by
more than 24 hours.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJVGx6pAAoJEPL5WVaVDYGjh5cIAKQAyGST92IbTkxRZsxMgqnH
7LQI+fbNn6oHGEjSSnsWLxl6CpwT4WrCmj8WhVmpAoTLU958nBbF7iZAaaeQCGeS
3EqaNOlKvuOK9M5PKK7a5AWO04uJuj+t6s536OqHyB1zRb1yYMsywllPzu63eigA
jxu2yZxkFIKjo2ohSaTDRONVCsQGlqgZ2Aq/Ho5vy5QffVJKTN1G/3Kf33xukUyr
SAnndaax23jMqcFJE3gePYXc3W8EuGoloehKyo04qFeNNVMmSoytXAwMzcTmHn+H
biOTN5ezSKbYzv1aevRg7UuSPv17/yIo3aEberfLBgsn5O4wJGDdS+LajaI5/x8=
=0k0d
-----END PGP SIGNATURE-----
Merge tag 'lazytime_fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull lazytime fixes from Ted Ts'o:
"This fixes a problem in the lazy time patches, which can cause
frequently updated inods to never have their timestamps updated.
These changes guarantee that no timestamp on disk will be stale by
more than 24 hours"
* tag 'lazytime_fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
fs: add dirtytime_expire_seconds sysctl
fs: make sure the timestamps for lazytime inodes eventually get written
Pull nfsd fixes from Bruce Fields:
"Two main issues:
- We found that turning on pNFS by default (when it's configured at
build time) was too aggressive, so we want to switch the default
before the 4.0 release.
- Recent client changes to increase open parallelism uncovered a
serious bug lurking in the server's open code.
Also fix a krb5/selinux regression.
The rest is mainly smaller pNFS fixes"
* 'for-4.0' of git://linux-nfs.org/~bfields/linux:
sunrpc: make debugfs file creation failure non-fatal
nfsd: require an explicit option to enable pNFS
NFSD: Fix bad update of layout in nfsd4_return_file_layout
NFSD: Take care the return value from nfsd4_encode_stateid
NFSD: Printk blocklayout length and offset as format 0x%llx
nfsd: return correct lockowner when there is a race on hash insert
nfsd: return correct openowner when there is a race to put one in the hash
NFSD: Put exports after nfsd4_layout_verify fail
NFSD: Error out when register_shrinker() fail
NFSD: Take care the return value from nfsd4_decode_stateid
NFSD: Check layout type when returning client layouts
NFSD: restore trace event lost in mismerge
Turns out sending out layouts to any client is a bad idea if they
can't get at the storage device, so require explicit admin action
to enable pNFS.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
locks_delete_lock_ctx() is called inside the loop, so we
should use list_for_each_entry_safe.
Fixes: 8634b51f6c (locks: convert lease handling to file_lock_context)
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
With return layout as, (seg is return layout, lo is record layout)
seg->offset <= lo->offset and layout_end(seg) < layout_end(lo),
nfsd should update lo's offset to seg's end,
and,
seg->offset > lo->offset and layout_end(seg) >= layout_end(lo),
nfsd should update lo's end to seg's offset.
Fixes: 9cf514ccfa ("nfsd: implement pNFS operations")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When testing pnfs with nfsd_debug on, nfsd print a negative number
of layout length and foff in nfsd4_block_proc_layoutget as,
"GET: -xxxx:-xxx 2"
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
alloc_init_lock_stateowner can return an already freed entry if there is
a race to put openowners in the hashtable.
Noticed by inspection after Jeff Layton fixed the same bug for open
owners. Depending on client behavior, this one may be trickier to
trigger in practice.
Fixes: c58c6610ec "nfsd: Protect adding/removing lock owners using client_lock"
Cc: <stable@vger.kernel.org>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Acked-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
alloc_init_open_stateowner can return an already freed entry if there is
a race to put openowners in the hashtable.
In commit 7ffb588086, we changed it so that we allocate and initialize
an openowner, and then check to see if a matching one got stuffed into
the hashtable in the meantime. If it did, then we free the one we just
allocated and take a reference on the one already there. There is a bug
here though. The code will then return the pointer to the one that was
allocated (and has now been freed).
This wasn't evident before as this race almost never occurred. The Linux
kernel client used to serialize requests for a single openowner. That
has changed now with v4.0 kernels, and this race can now easily occur.
Fixes: 7ffb588086
Cc: <stable@vger.kernel.org> # v3.17+
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Reported-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Fix B-tree corruption when a new record is inserted at position 0 in the
node in hfs_brec_insert(). In this case a hfs_brec_update_parent() is
called to update the parent index node (if exists) and it is passed
hfs_find_data with a search_key containing a newly inserted key instead
of the key to be updated. This results in an inconsistent index node.
The bug reproduces on my machine after an extents overflow record for
the catalog file (CNID=4) is inserted into the extents overflow B-tree.
Because of a low (reserved) value of CNID=4, it has to become the first
record in the first leaf node.
The resulting first leaf node is correct:
----------------------------------------------------
| key0.CNID=4 | key1.CNID=123 | key2.CNID=456, ... |
----------------------------------------------------
But the parent index key0 still contains the previous key CNID=123:
-----------------------
| key0.CNID=123 | ... |
-----------------------
A change in hfs_brec_insert() makes hfs_brec_update_parent() work
correctly by preventing it from getting fd->record=-1 value from
__hfs_brec_find().
Along the way, I removed duplicate code with unification of the if
condition. The resulting code is equivalent to the original code
because node is never 0.
Also hfs_brec_update_parent() will now return an error after getting a
negative fd->record value. However, the return value of
hfs_brec_update_parent() is not checked anywhere in the file and I'm
leaving it unchanged by this patch. brec.c lacks error checking after
some other calls too, but this issue is of less importance than the one
being fixed by this patch.
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Cc: Joe Perches <joe@perches.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Acked-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Cc: Anton Altaparmakov <aia21@cam.ac.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When affs_bread_ino() fails, correctly unlock the page and release the
page cache with proper error value. All write_end() should
unlock/release the page that was locked by write_beg().
Signed-off-by: Taesoo Kim <tsgatesv@gmail.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here are two bugfixes for things reported. One regression in kernfs,
and another issue fixed in the LZ4 code that was fixed in the "upstream"
codebase that solves a reported kernel crash.
Both have been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlUOl6YACgkQMUfUDdst+ylRjQCfT46/we0U/S38/sTNYtElQXUI
ZEsAnAwozBj6nIYOydxaKnA0BcujsSGz
=unYd
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are two bugfixes for things reported. One regression in kernfs,
and another issue fixed in the LZ4 code that was fixed in the
"upstream" codebase that solves a reported kernel crash
Both have been in linux-next for a while"
* tag 'driver-core-4.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
LZ4 : fix the data abort issue
kernfs: handle poll correctly on 'direct_read' files.
Pull btrfs fixes from Chris Mason:
"Most of these are fixing extent reservation accounting, or corners
with tree writeback during commit.
Josef's set does add a test, which isn't strictly a fix, but it'll
keep us from making this same mistake again"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix outstanding_extents accounting in DIO
Btrfs: add sanity test for outstanding_extents accounting
Btrfs: just free dummy extent buffers
Btrfs: account merges/splits properly
Btrfs: prepare block group cache before writing
Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)
Btrfs: account for the correct number of extents for delalloc reservations
Btrfs: fix merge delalloc logic
Btrfs: fix comp_oper to get right order
Btrfs: catch transaction abortion after waiting for it
btrfs: fix sizeof format specifier in btrfs_check_super_valid()
Pull nfsd bufix from Bruce Fields:
"This is a fix for a crash easily triggered by 4.1 activity to a server
built with CONFIG_NFSD_PNFS.
There are some more bugfixes queued up that I intend to pass along
next week, but this is the most critical"
* 'for-4.0' of git://linux-nfs.org/~bfields/linux:
Subject: nfsd: don't recursively call nfsd4_cb_layout_fail
Fix commit 9cf514ccfa (nfsd: implement pNFS operations).
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Return status after nfsd4_decode_stateid failed.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
According to RFC5661:
" When lr_returntype is LAYOUTRETURN4_FSID, the current filehandle is used
to identify the file system and all layouts matching the client ID,
the fsid of the file system, lora_layout_type, and lora_iomode are
returned. When lr_returntype is LAYOUTRETURN4_ALL, all layouts
matching the client ID, lora_layout_type, and lora_iomode are
returned and the current filehandle is not used. "
When returning client layouts, always check layout type.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
31ef83dc05 "nfsd: add trace events" had a typo that dropped a trace
event and replaced it by an incorrect recursive call to
nfsd4_cb_layout_fail. 133d558216 "Subject: nfsd: don't recursively
call nfsd4_cb_layout_fail" fixed the crash, this restores the
tracepoint.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull overlayfs fixes from Miklos Szeredi:
"This fixes minor issues with the multi-layer update in v4.0"
* 'overlayfs-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: upper fs should not be R/O
ovl: check lowerdir amount for non-upper mount
ovl: print error message for invalid mount options
The misc subsystem (which is used for /dev/fuse) initializes private_data to
point to the misc device when a driver has registered a custom open file
operation, and initializes it to NULL when a custom open file operation has
*not* been provided.
This subtle quirk is confusing, to the point where kernel code registers
*empty* file open operations to have private_data point to the misc device
structure. And it leads to bugs, where the addition or removal of a custom open
file operation surprisingly changes the initial contents of a file's
private_data structure.
So to simplify things in the misc subsystem, a patch [1] has been proposed to
*always* set the private_data to point to the misc device, instead of only
doing this when a custom open file operation has been registered.
But before this patch can be applied we need to modify drivers that make the
assumption that a misc device file's private_data is initialized to NULL
because they didn't register a custom open file operation, so they don't rely
on this assumption anymore. FUSE uses private_data to store the fuse_conn and
errors out if this is not initialized to NULL at mount time.
Hence, we now set a file's private_data to NULL explicitly, to be independent
of whatever value the misc subsystem initializes it to by default.
[1] https://lkml.org/lkml/2014/12/4/939
Reported-by: Giedrius Statkevicius <giedriuswork@gmail.com>
Reported-by: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: Tom Van Braeckel <tomvanbraeckel@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
After importing multi-lower layer support, users could mount a r/o
partition as the left most lowerdir instead of using it as upperdir.
And a r/o upperdir may cause an error like
overlayfs: failed to create directory ./workdir/work
during mount.
This patch check the *s_flags* of upper fs and return an error if
it is a r/o partition. The checking of *upper_mnt->mnt_sb->s_flags*
can be removed now.
This patch also remove
/* FIXME: workdir is not needed for a R/O mount */
from ovl_fill_super() because:
1) for upper fs r/o case
Setting a r/o partition as upper is prevented, no need to care about
workdir in this case.
2) for "mount overlay -o ro" with a r/w upper fs case
Users could remount overlayfs to r/w in this case, so workdir should
not be omitted.
Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Recently multi-lower layer mount support allow upperdir and workdir
to be omitted, then cause overlayfs can be mount with only one
lowerdir directory. This action make no sense and have potential risk.
This patch check the total number of lower directories to prevent
mounting overlayfs with only one directory.
Also, an error message is added to indicate lower directories exceed
OVL_MAX_STACK limit.
Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Overlayfs should print an error message if an incorrect mount option
is caught like other filesystems.
After this patch, improper option input could be clearly known.
Reported-by: Fabian Sturm <fabian.sturm@aduu.de>
Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
We are keeping track of how many extents we need to reserve properly based on
the amount we want to write, but we were still incrementing outstanding_extents
if we wrote less than what we requested. This isn't quite right since we will
be limited to our max extent size. So instead lets do something horrible! Keep
track of how many outstanding_extents we reserved, and decrement each time we
allocate an extent. If we use our entire reserve make sure to jack up
outstanding_extents on the inode so the accounting works out properly. Thanks,
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
I introduced a regression wrt outstanding_extents accounting. These are tricky
areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
at any time. So add sanity tests to cover the various conditions that are
tricky in order to make sure we don't introduce regressions in the future.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
If we fail during our sanity tests we could get NULL deref's because we unload
the module before the dummy extent buffers are free'd via RCU. So check for
this case and just free the things directly. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
My fix
Btrfs: fix merge delalloc logic
only fixed half of the problems, it didn't fix the case where we have two large
extents on either side and then join them together with a new small extent. We
need to instead keep track of how many extents we have accounted for with each
side of the new extent, and then see how many extents we need for the new large
extent. If they match then we know we need to keep our reservation, otherwise
we need to drop our reservation. This shows up with a case like this
[BTRFS_MAX_EXTENT_SIZE+4K][4K HOLE][BTRFS_MAX_EXTENT_SIZE+4K]
Previously the logic would have said that the number extents required for the
new size (3) is larger than the number of extents required for the largest side
(2) therefore we need to keep our reservation. But this isn't the case, since
both sides require a reservation of 2 which leads to 4 for the whole range
currently reserved, but we only need 3, so we need to drop one of the
reservations. The same problem existed for splits, we'd think we only need 3
extents when creating the hole but in reality we need 4. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
As pointed by recent post[1] on exploiting DRAM physical imperfection,
/proc/PID/pagemap exposes sensitive information which can be used to do
attacks.
This disallows anybody without CAP_SYS_ADMIN to read the pagemap.
[1] http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html
[ Eventually we might want to do anything more finegrained, but for now
this is the simple model. - Linus ]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Seaborn <mseaborn@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a tuning knob so we can adjust the dirtytime expiration timeout,
which is very useful for testing lazytime.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Jan Kara pointed out that if there is an inode which is constantly
getting dirtied with I_DIRTY_PAGES, an inode with an updated timestamp
will never be written since inode->dirtied_when is constantly getting
updated. We fix this by adding an extra field to the inode,
dirtied_time_when, so inodes with a stale dirtytime can get detected
and handled.
In addition, if we have a dirtytime inode caused by an atime update,
and there is no write activity on the file system, we need to have a
secondary system to make sure these inodes get written out. We do
this by setting up a second delayed work structure which wakes up the
CPU much more rarely compared to writeback_expire_centisecs.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Writing the block group cache will modify the extent tree quite a bit because it
truncates the old space cache and pre-allocates new stuff. To try and cut down
on the churn lets do the setup dance first, then later on hopefully we can avoid
looping with newly dirtied roots. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Kernfs supports two styles of read: direct_read and seqfile_read.
The latter supports 'poll' correctly thanks to the update of
'->event' in kernfs_seq_show.
The former does not as '->event' is never updated on a read.
So add an appropriate update in kernfs_file_direct_read().
This was noticed because some 'md' sysfs attributes were
recently changed to use direct reads.
Reported-by: Prakash Punnoor <prakash@punnoor.de>
Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Fixes: 750f199ee8
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It's possible that "fl" won't point at a valid lock at this point, so
use "victim" instead which is either a valid lock or NULL.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Dave could hit this assert consistently running btrfs/078. This is because
when we update the block groups we could truncate the free space, which would
try to delete the csums for that range and dirty the csum root. For this to
happen we have to have already written out the csum root so it's kind of hard to
hit this case. This patch fixes this by changing the logic to only write the
dirty block groups if the dirty_cowonly_roots list is empty. This will get us
the same effect as before since we add the extent root last, and will cover the
case that we dirty some other root again but not the extent root. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Direct IO can easily pass in an buffer that is greater than
BTRFS_MAX_EXTENT_SIZE, so take this into account when reserving extents in the
delalloc reservation code. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
My patch to properly count outstanding extents wrt MAX_EXTENT_SIZE introduced a
regression when re-dirtying already dirty areas. We have logic in split to make
sure we are taking the largest space into account but didn't have it for merge,
so it was sometimes making us think we were turning a tiny extent into a huge
extent, when in reality we already had a huge extent and needed to use the other
side in our logic. This fixes the regression that was reported by a user on
list. Thanks,
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Case (oper1->seq > oper2->seq) should differ with case (oper1->seq < oper2->seq).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
This problem is uncovered by a test case: http://patchwork.ozlabs.org/patch/244297.
Fsync() can report success when it actually doesn't. When we
have several threads running fsync() at the same tiem and in one fsync() we
get a transaction abortion due to some problems(in the test case it's disk
failures), and other fsync()s may return successfully which makes userspace
programs think that data is now safely flushed into disk.
It's because that after fsyncs() fail btrfs_sync_log() due to disk failures,
they get to try btrfs_commit_transaction() where it finds that there is
already a transaction being committed, and they'll just call wait_for_commit()
and return. Note that we actually check "trans->aborted" in btrfs_end_transaction,
but it's likely that the error message is still not yet throwed out and only after
wait_for_commit() we're sure whether the transaction is committed successfully.
This add the necessary check and it now passes the test.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch fixes mips compilation warning:
fs/btrfs/disk-io.c: In function 'btrfs_check_super_valid':
fs/btrfs/disk-io.c:3927:21: warning: format '%lu' expects argument
of type 'long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Chris Mason <clm@fb.com>
With FAN_ONDIR set, the user can end up getting events, which it hasn't
marked. This was revealed with fanotify04 testcase failure on
Linux-4.0-rc1, and is a regression from 3.19, revealed with 66ba93c0d7
("fanotify: don't set FAN_ONDIR implicitly on a marks ignored mask").
# /opt/ltp/testcases/bin/fanotify04
[ ... ]
fanotify04 7 TPASS : event generated properly for type 100000
fanotify04 8 TFAIL : fanotify04.c:147: got unexpected event 30
fanotify04 9 TPASS : No event as expected
The testcase sets the adds the following marks : FAN_OPEN | FAN_ONDIR for
a fanotify on a dir. Then does an open(), followed by close() of the
directory and expects to see an event FAN_OPEN(0x20). However, the
fanotify returns (FAN_OPEN|FAN_CLOSE_NOWRITE(0x10)). This happens due to
the flaw in the check for event_mask in fanotify_should_send_event() which
does:
if (event_mask & marks_mask & ~marks_ignored_mask)
return true;
where, event_mask == (FAN_ONDIR | FAN_CLOSE_NOWRITE),
marks_mask == (FAN_ONDIR | FAN_OPEN),
marks_ignored_mask == 0
Fix this by masking the outgoing events to the user, as we already take
care of FAN_ONDIR and FAN_EVENT_ON_CHILD.
Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
Tested-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
According to a report from Yuxuan Shui, nilfs2 in kernel 3.19 got stuck
during recovery at mount time. The code path that caused the deadlock was
as follows:
nilfs_fill_super()
load_nilfs()
nilfs_salvage_orphan_logs()
* Do roll-forwarding, attach segment constructor for recovery,
and kick it.
nilfs_segctor_thread()
nilfs_segctor_thread_construct()
* A lock is held with nilfs_transaction_lock()
nilfs_segctor_do_construct()
nilfs_segctor_drop_written_files()
iput()
iput_final()
write_inode_now()
writeback_single_inode()
__writeback_single_inode()
do_writepages()
nilfs_writepage()
nilfs_construct_dsync_segment()
nilfs_transaction_lock() --> deadlock
This can happen if commit 7ef3ff2fea ("nilfs2: fix deadlock of segment
constructor over I_SYNC flag") is applied and roll-forward recovery was
performed at mount time. The roll-forward recovery can happen if datasync
write is done and the file system crashes immediately after that. For
instance, we can reproduce the issue with the following steps:
< nilfs2 is mounted on /nilfs (device: /dev/sdb1) >
# dd if=/dev/zero of=/nilfs/test bs=4k count=1 && sync
# dd if=/dev/zero of=/nilfs/test conv=notrunc oflag=dsync bs=4k
count=1 && reboot -nfh
< the system will immediately reboot >
# mount -t nilfs2 /dev/sdb1 /nilfs
The deadlock occurs because iput() can run segment constructor through
writeback_single_inode() if MS_ACTIVE flag is not set on sb->s_flags. The
above commit changed segment constructor so that it calls iput()
asynchronously for inodes with i_nlink == 0, but that change was
imperfect.
This fixes the another deadlock by deferring iput() in segment constructor
even for the case that mount is not finished, that is, for the case that
MS_ACTIVE flag is not set.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reported-by: Yuxuan Shui <yshuiv7@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It turns out that making this feature ro_compat isn't quite enough to
prevent accidental corruption on mount from older kernels. Ocfs2 (like
other file systems) will process orphaned inodes even when the user mounts
in 'ro' mode. So for the case of a filesystem not knowing the append_dio
feature, mounting the filesystem could result in orphaned-for-dio files
being deleted, which we clearly don't want.
So instead, turn this into an incompat flag.
Btw, this is kind of my fault - initially I asked that we add a flag to
cover the feature and even suggested that we use an ro flag. It wasn't
until I was looking through our commits for v4.0-rc1 that I realized we
actually want this to be incompat.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull btrfs fixes from Chris Mason:
"Outside of misc fixes, Filipe has a few fsync corners and we're
pulling in one more of Josef's fixes from production use here"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.
Btrfs: fix data loss in the fast fsync path
Btrfs: remove extra run_delayed_refs in update_cowonly_root
Btrfs: incremental send, don't rename a directory too soon
btrfs: fix lost return value due to variable shadowing
Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr
Btrfs: fix off-by-one logic error in btrfs_realloc_node
Btrfs: add missing inode update when punching hole
Btrfs: abort the transaction if we fail to update the free space cache inode
Btrfs: fix fsync race leading to ordered extent memory leaks
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU+EEgAAoJEAAOaEEZVoIVNtgP/33VZs6/5rthhMhhFt5Gonwn
2wKjrA8Tuskxg2Ij7RLSCnZNeoscF1Te7GNRQdalhO4mHoDzQx3zSHuV/STNlILg
6wNSd0pVj4FjXQN8Kv/slJsKiUu2U33lQwXNiVbgPI4CAdRAVJqYZ/7EIgf7VciQ
FojqR3/RTSW1CHuYcNL5CTkn9Pdm8cwqacXMcKBWK6t1ZiEBqaP0GD0UhpFsUeRF
RPv0ba3a+iuf/ToxhKV1fGWjJUplBUR2FMCnEiUFCnOGNPYeknWKp8wGeRp8bYA0
6Umpk/NLg8mVSzbdW3wxBQ25PRBNVBAmQutFD2aEjIMPXuKPt4IHe/os2SssbfbJ
wyCZ4LEbogcbmOlBAePtlUloy2GM3F7lQN+6GUgiLAT/I0+1VCjESsBftZP7QKRT
N6szoAAsIDNmXcaApYPZZk3JBoyiwHurbrhI23V74p91esVNFqWJaVTpAbn8TkD7
u/hGL7Loi7sXst2g9XISXvcRkcGUKKXf727Ih9wYQx8N6McP1sDgSC9PNYHz2KLo
Ha4pQo+1+t0tXGmTsGpHBouAsSntURjn5/vv8OHbi0G3hS5v06G/j0vkU3lRtMqv
koomIyShwUORLJxo6oJ3yKXQdAnz0Q78LkfDdSyq/M17G1FvOkQhw42a/LJ/A1w1
yzJ5XmwbkM5GcG3LVTLs
=ND3k
-----END PGP SIGNATURE-----
Merge tag 'locks-v4.0-3' of git://git.samba.org/jlayton/linux
Pull file locking fix from Jeff Layton:
"Just a single patch to fix a memory leak that Daniel Wagner discovered
while doing some testing with leases"
* tag 'locks-v4.0-3' of git://git.samba.org/jlayton/linux:
locks: fix fasync_struct memory leak in lease upgrade/downgrade handling
Highlights include:
- Fix a regression in the NFSv4 open state recovery code
- Fix a regression in the NFSv4 close code
- Fix regressions and side-effects of the loop-back mounted NFS fixes
in 3.18, that cause the NFS read() syscall to return EBUSY.
- Fix regressions around the readdirplus code and how it interacts with
the VFS lazy unmount changes that went into v3.18.
- Fix issues with out-of-order RPC call replies replacing updated
attributes with stale ones (particularly after a truncate()).
- Fix an underflow checking issue with RPC/RDMA credits
- Fix a number of issues with the NFSv4 delegation return/free code.
- Fix issues around stale NFSv4.1 leases when doing a mount
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU+SJ6AAoJEGcL54qWCgDyVgQQAKNsF/O2O9ip2uAHZRZvM6TS
Ev8c3Spj/FmRI1tlCcGi1zZ8uCSwvPQz3uAN0vOTMocUWjokT5sAgN5yBIxHasem
6YK7jxs9WHiP7MGyReaAFJwG/W6LZndnlNqPWPs9KiaWwKVXIsQ3uFm/Y0lr90Fi
ew16DQm0DUd4Yvv42WJR9ay7UwUPvT7wmaGIVVK2hjQqr2lx02jspt5kfrC+vsMU
OYU/0YDofb2ajs5krbah6tUHf3VDnSVrXP6if66IrukCM9S4AvowpnMJQ5QJALh+
cPlqHDm2ZzuIecpqZEgYLM73wQ2q+KBXTlDLcgYg6LjnqBEivwO9RDn6tfCwKTcS
tCFohQc9iDOj9rTZ9EQlQME6u/FdxWncpxyTsMyBk7FlcLsOQRqio/FXZhbyJGuH
gvIdIW3fPseijtejpYkxgabe6JL9NRvjv3SnOay7xHs9Vn4tRkFF2mkkQDZOG6HT
atkxQp8kB3m9gMoeAmTLdTcJkcFdk6AKnNrcyJaa1GW4msmMuGZtq/6vayKJsBZb
OIw788bDSOVpcVR+6SAC24/dutcl+WJHlSJShqvIrTKejBxPCc7IVSeg63FJsWTO
sxfXdUr3wVJ1ooDFCspeBj+3zAimIq7qDmRRs85ekgEtxrgdUldhg/VwbDjvLjmb
whXFpiCS7Ii0fVYtZvjz
=qMB7
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.0-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
- Fix a regression in the NFSv4 open state recovery code
- Fix a regression in the NFSv4 close code
- Fix regressions and side-effects of the loop-back mounted NFS fixes
in 3.18, that cause the NFS read() syscall to return EBUSY.
- Fix regressions around the readdirplus code and how it interacts
with the VFS lazy unmount changes that went into v3.18.
- Fix issues with out-of-order RPC call replies replacing updated
attributes with stale ones (particularly after a truncate()).
- Fix an underflow checking issue with RPC/RDMA credits
- Fix a number of issues with the NFSv4 delegation return/free code.
- Fix issues around stale NFSv4.1 leases when doing a mount"
* tag 'nfs-for-4.0-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (24 commits)
NFSv4.1: Clear the old state by our client id before establishing a new lease
NFSv4: Fix a race in NFSv4.1 server trunking discovery
NFS: Don't write enable new pages while an invalidation is proceeding
NFS: Fix a regression in the read() syscall
NFSv4: Ensure we skip delegations that are already being returned
NFSv4: Pin the superblock while we're returning the delegation
NFSv4: Ensure we honour NFS_DELEGATION_RETURNING in nfs_inode_set_delegation()
NFSv4: Ensure that we don't reap a delegation that is being returned
NFS: Fix stateid used for NFS v4 closes
NFSv4: Don't call put_rpccred() under the rcu_read_lock()
NFS: Don't require a filehandle to refresh the inode in nfs_prime_dcache()
NFSv3: Use the readdir fileid as the mounted-on-fileid
NFS: Don't invalidate a submounted dentry in nfs_prime_dcache()
NFSv4: Set a barrier in the update_changeattr() helper
NFS: Fix nfs_post_op_update_inode() to set an attribute barrier
NFS: Remove size hack in nfs_inode_attrs_need_update()
NFSv4: Add attribute update barriers to delegreturn and pNFS layoutcommit
NFS: Add attribute update barriers to NFS writebacks
NFS: Set an attribute barrier on all updates
NFS: Add attribute update barriers to nfs_setattr_update_inode()
...
Improper arithmetics when calculting the address of the extended ref could
lead to an out of bounds memory read and kernel panic.
Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
cc: stable@vger.kernel.org # v3.7+
Signed-off-by: Chris Mason <clm@fb.com>
When using the fast file fsync code path we can miss the fact that new
writes happened since the last file fsync and therefore return without
waiting for the IO to finish and write the new extents to the fsync log.
Here's an example scenario where the fsync will miss the fact that new
file data exists that wasn't yet durably persisted:
1. fs_info->last_trans_committed == N - 1 and current transaction is
transaction N (fs_info->generation == N);
2. do a buffered write;
3. fsync our inode, this clears our inode's full sync flag, starts
an ordered extent and waits for it to complete - when it completes
at btrfs_finish_ordered_io(), the inode's last_trans is set to the
value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
btrfs_set_inode_last_trans);
4. transaction N is committed, so fs_info->last_trans_committed is now
set to the value N and fs_info->generation remains with the value N;
5. do another buffered write, when this happens btrfs_file_write_iter
sets our inode's last_trans to the value N + 1 (that is
fs_info->generation + 1 == N + 1);
6. transaction N + 1 is started and fs_info->generation now has the
value N + 1;
7. transaction N + 1 is committed, so fs_info->last_trans_committed
is set to the value N + 1;
8. fsync our inode - because it doesn't have the full sync flag set,
we only start the ordered extent, we don't wait for it to complete
(only in a later phase) therefore its last_trans field has the
value N + 1 set previously by btrfs_file_write_iter(), and so we
have:
inode->last_trans <= fs_info->last_trans_committed
(N + 1) (N + 1)
Which made us not log the last buffered write and exit the fsync
handler immediately, returning success (0) to user space and resulting
in data loss after a crash.
This can actually be triggered deterministically and the following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy
file across directories and then fsyncs the old parent directory - this
is just to trigger a transaction commit, so moving files around isn't
directly related to the issue but it was chosen because running 'sync' for
example does more than just committing the current transaction, as it
flushes/waits for all file data to be persisted. The issue can also happen
at random periods, since the transaction kthread periodicaly commits the
current transaction (about every 30 seconds by default).
The body of the test is:
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our main test file 'foo', the one we check for data loss.
# By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
# bit from its flags (btrfs inode specific flags).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
-c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
# Now create one other file and 2 directories. We will move this second file
# from one directory to the other later because it forces btrfs to commit its
# currently open transaction if we fsync the old parent directory. This is
# necessary to trigger the data loss bug that affected btrfs.
mkdir $SCRATCH_MNT/testdir_1
touch $SCRATCH_MNT/testdir_1/bar
mkdir $SCRATCH_MNT/testdir_2
# Make sure everything is durably persisted.
sync
# Write more 8Kb of data to our file.
$XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
# Move our 'bar' file into a new directory.
mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
# Fsync our first directory. Because it had a file moved into some other
# directory, this made btrfs commit the currently open transaction. This is
# a condition necessary to trigger the data loss bug.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
# Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
# data we wrote previously to be persisted and available if a crash happens.
# This did not happen with btrfs, because of the transaction commit that
# happened when we fsynced the parent directory.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# Now check that all data we wrote before are available.
echo "File content after log replay:"
od -t x1 $SCRATCH_MNT/foo
status=0
exit
The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:
wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 8192/8192 bytes at offset 8192
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
*
0040000
Without this fix applied, the output shows the test file does not have
the second 8Kb extent that we successfully fsynced:
wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 8192/8192 bytes at offset 8192
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000
So fix this by skipping the fsync only if we're doing a full sync and
if the inode's last_trans is <= fs_info->last_trans_committed, or if
the inode is already in the log. Also remove setting the inode's
last_trans in btrfs_file_write_iter since it's useless/unreliable.
Also because btrfs_file_write_iter no longer sets inode->last_trans to
fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
bail out if last_trans is 0, otherwise something as simple as the following
example wouldn't log the second write on the last fsync:
1. write to file
2. fsync file
3. fsync file
|--> btrfs_inode_in_log() returns true and it set last_trans to 0
4. write to file
|--> btrfs_file_write_iter() no longers sets last_trans, so it
remained with a value of 0
5. fsync
|--> inode->last_trans == 0, so it bails out without logging the
second write
A test case for xfstests will be sent soon.
CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>