Sparse reports warnings at ocfs2_refcount_cache_lock()
and ocfs2_refcount_cache_unlock()
warning: context imbalance in ocfs2_refcount_cache_lock()
- wrong count at exit
warning: context imbalance in ocfs2_refcount_cache_unlock()
- unexpected unlock
The root cause is the missing annotation at ocfs2_refcount_cache_lock()
and at ocfs2_refcount_cache_unlock()
Add the missing __acquires(&rf->rf_lock) annotation to
ocfs2_refcount_cache_lock()
Add the missing __releases(&rf->rf_lock) annotation to
ocfs2_refcount_cache_unlock()
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/20200224204130.18178-1-jbi.octave@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't need 'err' in these 2 places, better to remove them.
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: ChenGang <cg.chen@huawei.com>
Cc: Richard Fontana <rfontana@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1579577836-251879-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Correct annotation from "l_next_rec" to "l_next_free_rec"
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Link: http://lkml.kernel.org/r/5e76c953-3479-1280-023c-ad05e4c75608@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no need to log twice in several functions.
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Link: http://lkml.kernel.org/r/77eec86a-f634-5b98-4f7d-0cd15185a37b@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This macro has been unused since it was introduced.
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/1579578203-254451-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This macro should be used.
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/1579577840-251956-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
O2HB_DEFAULT_BLOCK_BITS/DLM_THREAD_MAX_ASTS/DLM_MIGRATION_RETRY_MS and
OCFS2_MAX_RESV_WINDOW_BITS/OCFS2_MIN_RESV_WINDOW_BITS have been unused
since commit 66effd3c68 ("ocfs2/dlm: Do not migrate resource to a node
that is leaving the domain").
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: ChenGang <cg.chen@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Fontana <rfontana@redhat.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/1579577827-251796-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This macro is unused since commit ab09203e30 ("sysctl fs: Remove dead
binary sysctl support").
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Link: http://lkml.kernel.org/r/1579577812-251572-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We fall back to lookup+create (instead of atomic_open) in several cases:
1) we don't have write access to filesystem and O_TRUNC is
present in the flags. It's not something we want ->atomic_open() to
see - it just might go ahead and truncate the file. However, we can
pass it the flags sans O_TRUNC - eventually do_open() will call
handle_truncate() anyway.
2) we have O_CREAT | O_EXCL and we can't write to parent.
That's going to be an error, of course, but we want to know _which_
error should that be - might be EEXIST (if file exists), might be
EACCES or EROFS. Simply stripping O_CREAT (and checking if we see
ENOENT) would suffice, if not for O_EXCL. However, we used to have
->atomic_open() fully responsible for rejecting O_CREAT | O_EXCL
on existing file and just stripping O_CREAT would've disarmed
those checks. With nothing downstream to catch the problem -
FMODE_OPENED used to be "don't bother with EEXIST checks,
->atomic_open() has done those". Now EEXIST checks downstream
are skipped only if FMODE_CREATED is set - FMODE_OPENED alone
is not enough. That has eliminated the need to fall back onto
lookup+create path in this case.
3) O_WRONLY or O_RDWR when we have no write access to
filesystem, with nothing else objectionable. Fallback is
(and had always been) pointless.
IOW, we don't really need that fallback; all we need in such
cases is to trim O_TRUNC and O_CREAT properly.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
argument had been unused since 1643b43fbd (lookup_open(): lift the
"fallback to !O_CREAT" logics from atomic_open()) back in 2016
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently path_openat() has "EEXIST on O_EXCL|O_CREAT" checks done on one
of the ways out of open_last_lookups(). There are 4 cases:
1) the last component is . or ..; check is not done.
2) we had FMODE_OPENED or FMODE_CREATED set while in lookup_open();
check is not done.
3) symlink to be traversed is found; check is not done (nor
should it be)
4) everything else: check done (before complete_walk(), even).
In case (1) O_EXCL|O_CREAT ends up failing with -EISDIR - that's
open("/tmp/.", O_CREAT|O_EXCL, 0600)
Note that in the same conditions
open("/tmp", O_CREAT|O_EXCL, 0600)
would have yielded EEXIST. Either error is allowed, switching to -EEXIST
in these cases would've been more consistent.
Case (2) is more subtle; first of all, if we have FMODE_CREATED set, the
object hadn't existed prior to the call. The check should not be done in
such a case. The rest is problematic, though - we have
FMODE_OPENED set (i.e. it went through ->atomic_open() and got
successfully opened there)
FMODE_CREATED is *NOT* set
O_CREAT and O_EXCL are both set.
Any such case is a bug - either we failed to set FMODE_CREATED when we
had, in fact, created an object (no such instances in the tree) or
we have opened a pre-existing file despite having had both O_CREAT and
O_EXCL passed. One of those was, in fact caught (and fixed) while
sorting out this mess (gfs2 on cold dcache). And in such situations
we should fail with EEXIST.
Note that for (1) and (4) FMODE_CREATED is not set - for (1) there's nothing
in handle_dots() to set it, for (4) we'd explicitly checked that.
And (1), (2) and (4) are exactly the cases when we leave the loop in
the caller, with do_open() called immediately after that loop. IOW, we
can move the check over there, and make it
If we have O_CREAT|O_EXCL and after successful pathname resolution
FMODE_CREATED is *not* set, we must have run into a preexisting file and
should fail with EEXIST.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
now we can have open_last_lookups() directly from the loop in
path_openat() - the rest of do_last() never returns a symlink
to follow, so we can bloody well leave the loop first.
Rename the rest of that thing from do_last() to do_open() and
make it return an int.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pick_link() needs to push onto stack; we start with using two-element
array embedded into struct nameidata and the first time we need
more than that we switch to separately allocated array.
Allocation can fail, of course, and handling of that would be simple
enough - we need to drop 'link' and bugger off. However, the things
get more complicated in RCU mode. There we must do GFP_ATOMIC
allocation. If that fails, we try to switch to non-RCU mode and
repeat the allocation.
To switch to non-RCU mode we need to grab references to 'link' and
to everything in nameidata. The latter done by unlazy_walk();
the former - legitimize_path(). 'link' must go first - after
unlazy_walk() we are out of RCU-critical period and it's too
late to call legitimize_path() since the references in link->mnt
and link->dentry might be pointing to freed and reused memory.
So we do legitimize_path(), then unlazy_walk(). And that's where
it gets too subtle: what to do if the former fails? We MUST
do path_put(link) to avoid leaks. And we can't do that under
rcu_read_lock(). Solution in mainline was to empty then nameidata
manually, drop out of RCU mode and then do put_path().
In effect, we open-code the things eventual terminate_walk()
would've done on error in RCU mode. That looks badly out of place
and confusing. We could add a comment along the lines of the
explanation above, but... there's a simpler solution. Call
unlazy_walk() even if legitimaze_path() fails. It will take
us out of RCU mode, so we'll be able to do path_put(link).
Yes, it will do unnecessary work - attempt to grab references
on the stuff in nameidata, only to have them dropped as soon
as we return the error to upper layer and get terminate_walk()
called there. So what? We are thoroughly off the fast path
by that point - we had GFP_ATOMIC allocation fail, we had
->d_seq or mount_lock mismatch and we are about to try walking
the same path from scratch in non-RCU mode. Which will need
to do the same allocation, this time with GFP_KERNEL, so it will
be able to apply memory pressure for blocking stuff.
Compared to that the cost of several lockref_get_not_dead()
is noise. And the logics become much easier to understand
that way.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
step_into() tries to avoid grabbing and dropping mount references
on the steps that do not involve crossing mountpoints (which is
obviously the majority of cases). So it uses a local struct path
with unusual refcounting rules - path.mnt is pinned if and only if
it's not equal to nd->path.mnt.
We used to have similar beasts all over the place and we had quite
a few bugs crop up in their handling - it's easy to get confused
when changing e.g. cleanup on failure exits (or adding a new check,
etc.)
Now that's mostly gone - the step_into() instance (which is what
we need them for) is the only one left. It is exposed to mount
traversal and it's (shortly) seen by pick_link(). Since pick_link()
needs to store it in link stack, where the normal rules apply,
it has to make sure that mount is pinned regardless of nd->path.mnt
value. That's done on all calls of pick_link() and very early
in those. Let's do that in the caller (step_into()) instead -
that way the fewer places need to be aware of such struct path
instances.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The only remaining caller (path_pts()) should be using follow_down()
anyway. And clean path_pts() a bit.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: choose_mountpoint(). Wrapper around choose_mountpoint_rcu(),
similar to lookup_mnt() vs. __lookup_mnt(). follow_dotdot() switched to
it. Now we don't grab mount_lock exclusive anymore; note that the
primitive used non-RCU mount traversals in other direction (lookup_mnt())
doesn't bother with that either - it uses mount_lock seqcount instead.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The loops in follow_dotdot{_rcu()} are doing the same thing:
we have a mount and we want to find out how far up the chain
of mounts do we need to go.
We follow the chain of mount until we find one that is not
directly overmounting the root of another mount. If such
a mount is found, we want the location it's mounted upon.
If we run out of chain (i.e. get to a mount that is not
mounted on anything else) or run into process' root, we
report failure.
On success, we want (in RCU case) d_seq of resulting location
sampled or (in non-RCU case) references to that location
acquired.
This commit introduces such primitive for RCU case and
switches follow_dotdot_rcu() to it; non-RCU case will be
go in the next commit.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Change nd->path only after the loop is done and only in case we hadn't
ended up finding ourselves in root. Same for NO_XDEV check.
That separates the "check how far back do we need to go through the
mount stack" logics from the rest of .. traversal.
NOTE: path_get/path_put introduced here are temporary. They will
go away later in the series.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Change nd->path only after the loop is done and only in case we hadn't
ended up finding ourselves in root. Same for NO_XDEV check. Don't
recheck mount_lock on each step either.
That separates the "check how far back do we need to go through the
mount stack" logics from the rest of .. traversal.
Note that the sequence for d_seq/d_inode here is
* sample mount_lock seqcount
...
* sample d_seq
* fetch d_inode
* verify mount_lock seqcount
The last step makes sure that d_inode value we'd got matches d_seq -
it dentry is guaranteed to have been a mountpoint through the
entire thing, so its d_inode must have been stable.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The logics in both of them is the same:
while true
if in process' root // uncommon
break
if *not* in mount root // normal case
find the parent
return
if at absolute root // very uncommon
break
move to underlying mountpoint
report that we are in root
Pull the common path out of the loop:
if in process' root // uncommon
goto in_root
if unlikely(in mount root)
while true
if at absolute root
goto in_root
move to underlying mountpoint
if in process' root
goto in_root
if in mount root
break;
find the parent // we are not in mount root
return
in_root:
report that we are in root
The reason for that transformation is that we get to keep the
common path straight *and* get a separate block for "move
through underlying mountpoints", which will allow to sanitize
NO_XDEV handling there. What's more, the pared-down loops
will be easier to deal with - in particular, non-RCU case
has no need to grab mount_lock and rewriting it to the
form that wouldn't do that is a non-trivial change. Better
do that with less stuff getting in the way...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
lift step_into() into handle_dots() (where they merge with each other);
have follow_... return dentry and pass inode/seq to the caller.
[braino fix folded; kudos to Qian Cai <cai@lca.pw> for reporting it]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull crypto updates from Herbert Xu:
"API:
- Fix out-of-sync IVs in self-test for IPsec AEAD algorithms
Algorithms:
- Use formally verified implementation of x86/curve25519
Drivers:
- Enhance hwrng support in caam
- Use crypto_engine for skcipher/aead/rsa/hash in caam
- Add Xilinx AES driver
- Add uacce driver
- Register zip engine to uacce in hisilicon
- Add support for OCTEON TX CPT engine in marvell"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (162 commits)
crypto: af_alg - bool type cosmetics
crypto: arm[64]/poly1305 - add artifact to .gitignore files
crypto: caam - limit single JD RNG output to maximum of 16 bytes
crypto: caam - enable prediction resistance in HRWNG
bus: fsl-mc: add api to retrieve mc version
crypto: caam - invalidate entropy register during RNG initialization
crypto: caam - check if RNG job failed
crypto: caam - simplify RNG implementation
crypto: caam - drop global context pointer and init_done
crypto: caam - use struct hwrng's .init for initialization
crypto: caam - allocate RNG instantiation descriptor with GFP_DMA
crypto: ccree - remove duplicated include from cc_aead.c
crypto: chelsio - remove set but not used variable 'adap'
crypto: marvell - enable OcteonTX cpt options for build
crypto: marvell - add the Virtual Function driver for CPT
crypto: marvell - add support for OCTEON TX CPT engine
crypto: marvell - create common Kconfig and Makefile for Marvell
crypto: arm/neon - memzero_explicit aes-cbc key
crypto: bcm - Use scnprintf() for avoiding potential buffer overflow
crypto: atmel-i2c - Fix wakeup fail
...
Using a separate function, ext4_set_errno() to set the errno is
problematic because it doesn't do the right thing once
s_last_error_errorcode is non-zero. It's also less racy to set all of
the error information all at once. (Also, as a bonus, it shrinks code
size slightly.)
Link: https://lore.kernel.org/r/20200329020404.686965-1-tytso@mit.edu
Fixes: 878520ac45 ("ext4: save the error code which triggered...")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Refactor nfs_lock_and_join_requests() in order to separate out the
subrequest merging into its own function nfs_lock_and_join_group()
that can be used by O_DIRECT.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If we have to split the request up into subrequests, we have to submit
the request pointed to by the function call parameter last, in case
there is an error or other issue that causes us to exit before the
last request is submitted. The reason is that the caller is expected
to perform cleanup in those cases.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up nfs_lock_and_join_requests() to simplify the calculation
of the range covered by the page group, taking into account the
presence of mirrors.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We need to trust that desc->pg_mirror_idx is set correctly, whether
or not mirroring is enabled.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If we just set the mirror count to 1 without first clearing out
the mirrors, we can leak queued up requests.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
nfs_direct_write_scan_commit_list() will lock the request and bump
the reference count, but we also need to account for the reference
that was taken when we initially added the request to the commit list.
Fixes: fb5f7f20cd ("NFS: commit errors should be fatal")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We need to ensure that we create the mirror requests before calling
nfs_pageio_add_request_mirror() on the request we are adding.
Otherwise, we can end up with a use-after-free if the call to
nfs_pageio_add_request_mirror() triggers I/O.
Fixes: c917cfaf9b ("NFS: Fix up NFS I/O subrequest creation")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When a subrequest is being detached from the subgroup, we want to
ensure that it is not holding the group lock, or in the process
of waiting for the group lock.
Fixes: 5b2b5187fa ("NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Replace the 32bit exec_id with a 64bit exec_id to make it impossible
to wrap the exec_id counter. With care an attacker can cause exec_id
wrap and send arbitrary signals to a newly exec'd parent. This
bypasses the signal sending checks if the parent changes their
credentials during exec.
The severity of this problem can been seen that in my limited testing
of a 32bit exec_id it can take as little as 19s to exec 65536 times.
Which means that it can take as little as 14 days to wrap a 32bit
exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7
days. Even my slower timing is in the uptime of a typical server.
Which means self_exec_id is simply a speed bump today, and if exec
gets noticably faster self_exec_id won't even be a speed bump.
Extending self_exec_id to 64bits introduces a problem on 32bit
architectures where reading self_exec_id is no longer atomic and can
take two read instructions. Which means that is is possible to hit
a window where the read value of exec_id does not match the written
value. So with very lucky timing after this change this still
remains expoiltable.
I have updated the update of exec_id on exec to use WRITE_ONCE
and the read of exec_id in do_notify_parent to use READ_ONCE
to make it clear that there is no locking between these two
locations.
Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
Fixes: 2.3.23pre2
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When we detach a subrequest from the list, we must also release the
reference it holds to the parent.
Fixes: 5b2b5187fa ("NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull networking updates from David Miller:
"Highlights:
1) Fix the iwlwifi regression, from Johannes Berg.
2) Support BSS coloring and 802.11 encapsulation offloading in
hardware, from John Crispin.
3) Fix some potential Spectre issues in qtnfmac, from Sergey
Matyukevich.
4) Add TTL decrement action to openvswitch, from Matteo Croce.
5) Allow paralleization through flow_action setup by not taking the
RTNL mutex, from Vlad Buslov.
6) A lot of zero-length array to flexible-array conversions, from
Gustavo A. R. Silva.
7) Align XDP statistics names across several drivers for consistency,
from Lorenzo Bianconi.
8) Add various pieces of infrastructure for offloading conntrack, and
make use of it in mlx5 driver, from Paul Blakey.
9) Allow using listening sockets in BPF sockmap, from Jakub Sitnicki.
10) Lots of parallelization improvements during configuration changes
in mlxsw driver, from Ido Schimmel.
11) Add support to devlink for generic packet traps, which report
packets dropped during ACL processing. And use them in mlxsw
driver. From Jiri Pirko.
12) Support bcmgenet on ACPI, from Jeremy Linton.
13) Make BPF compatible with RT, from Thomas Gleixnet, Alexei
Starovoitov, and your's truly.
14) Support XDP meta-data in virtio_net, from Yuya Kusakabe.
15) Fix sysfs permissions when network devices change namespaces, from
Christian Brauner.
16) Add a flags element to ethtool_ops so that drivers can more simply
indicate which coalescing parameters they actually support, and
therefore the generic layer can validate the user's ethtool
request. Use this in all drivers, from Jakub Kicinski.
17) Offload FIFO qdisc in mlxsw, from Petr Machata.
18) Support UDP sockets in sockmap, from Lorenz Bauer.
19) Fix stretch ACK bugs in several TCP congestion control modules,
from Pengcheng Yang.
20) Support virtual functiosn in octeontx2 driver, from Tomasz
Duszynski.
21) Add region operations for devlink and use it in ice driver to dump
NVM contents, from Jacob Keller.
22) Add support for hw offload of MACSEC, from Antoine Tenart.
23) Add support for BPF programs that can be attached to LSM hooks,
from KP Singh.
24) Support for multiple paths, path managers, and counters in MPTCP.
From Peter Krystad, Paolo Abeni, Florian Westphal, Davide Caratti,
and others.
25) More progress on adding the netlink interface to ethtool, from
Michal Kubecek"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2121 commits)
net: ipv6: rpl_iptunnel: Fix potential memory leak in rpl_do_srh_inline
cxgb4/chcr: nic-tls stats in ethtool
net: dsa: fix oops while probing Marvell DSA switches
net/bpfilter: remove superfluous testing message
net: macb: Fix handling of fixed-link node
net: dsa: ksz: Select KSZ protocol tag
netdevsim: dev: Fix memory leak in nsim_dev_take_snapshot_write
net: stmmac: add EHL 2.5Gbps PCI info and PCI ID
net: stmmac: add EHL PSE0 & PSE1 1Gbps PCI info and PCI ID
net: stmmac: create dwmac-intel.c to contain all Intel platform
net: dsa: bcm_sf2: Support specifying VLAN tag egress rule
net: dsa: bcm_sf2: Add support for matching VLAN TCI
net: dsa: bcm_sf2: Move writing of CFP_DATA(5) into slicing functions
net: dsa: bcm_sf2: Check earlier for FLOW_EXT and FLOW_MAC_EXT
net: dsa: bcm_sf2: Disable learning for ASP port
net: dsa: b53: Deny enslaving port 7 for 7278 into a bridge
net: dsa: b53: Prevent tagged VLAN on port 7 for 7278
net: dsa: b53: Restore VLAN entries upon (re)configuration
net: dsa: bcm_sf2: Fix overflow checks
hv_netvsc: Remove unnecessary round_up for recv_completion_cnt
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl6Ch6wUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPdcg/9FDMS/n0Xl1HQBUyu26EwLu3aUpNE
BdghXW1LKSTp7MrOENE60PGzZSAiC07ci1DqFd7zfLPZf2q5IwPwOBa/Avy8z95V
oHKqcMT6WO1SPOm/PxZn16FCKyET4gZDTXvHBAyiyFsbk36R522ZY615P9T3eLu/
ZA1NFsSjj68SqMCUlAWfeqjcbQiX63bryEpugOIg0qWy7R/+rtWxj9TjriZ+v9tq
uC45UcjBqphpmoPG8BifA3jjyclwO3DeQb5u7E8//HPPraGeB19ntsymUg7ljoGk
NrqCkZtv6E+FRCDTR5f0O7M1T4BWJodxw2NwngnTwKByLC25EZaGx80o+VyMt0eT
Pog+++JZaa5zZr2OYOtdlPVMLc2ALL6p/8lHOqFU3GKfIf04hWOm6/Lb2IWoXs3f
CG2b6vzoXYyjbF0Q7kxadb8uBY2S1Ds+CVu2HMBBsXsPdwbbtFWOT/6aRAQu61qn
PW+f47NR8By3SO6nMzWts2SZEERZNIEdSKeXHuR7As1jFMXrHLItreb4GCSPay5h
2bzRpxt2br5CDLh7Jv2pZnHtUqBWOSbslTix77+Z/hPKaNowvD9v3tc5hX87rDmB
dYXROD6/KoyXFYDcMdphlvORFhqGqd5bEYuHHum/VjSIRh237+/nxFY/vZ4i4bzU
2fvpAmUlVX1c4rw=
=LlWA
-----END PGP SIGNATURE-----
Merge tag 'selinux-pr-20200330' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull SELinux updates from Paul Moore:
"We've got twenty SELinux patches for the v5.7 merge window, the
highlights are below:
- Deprecate setting /sys/fs/selinux/checkreqprot to 1.
This flag was originally created to deal with legacy userspace and
the READ_IMPLIES_EXEC personality flag. We changed the default from
1 to 0 back in Linux v4.4 and now we are taking the next step of
deprecating it, at some point in the future we will take the final
step of rejecting 1.
- Allow kernfs symlinks to inherit the SELinux label of the parent
directory. In order to preserve backwards compatibility this is
protected by the genfs_seclabel_symlinks SELinux policy capability.
- Optimize how we store filename transitions in the kernel, resulting
in some significant improvements to policy load times.
- Do a better job calculating our internal hash table sizes which
resulted in additional policy load improvements and likely general
SELinux performance improvements as well.
- Remove the unused initial SIDs (labels) and improve how we handle
initial SIDs.
- Enable per-file labeling for the bpf filesystem.
- Ensure that we properly label NFS v4.2 filesystems to avoid a
temporary unlabeled condition.
- Add some missing XFS quota command types to the SELinux quota
access controls.
- Fix a problem where we were not updating the seq_file position
index correctly in selinuxfs.
- We consolidate some duplicated code into helper functions.
- A number of list to array conversions.
- Update Stephen Smalley's email address in MAINTAINERS"
* tag 'selinux-pr-20200330' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: clean up indentation issue with assignment statement
NFS: Ensure security label is set for root inode
MAINTAINERS: Update my email address
selinux: avtab_init() and cond_policydb_init() return void
selinux: clean up error path in policydb_init()
selinux: remove unused initial SIDs and improve handling
selinux: reduce the use of hard-coded hash sizes
selinux: Add xfs quota command types
selinux: optimize storage of filename transitions
selinux: factor out loop body from filename_trans_read()
security: selinux: allow per-file labeling for bpffs
selinux: generalize evaluate_cond_node()
selinux: convert cond_expr to array
selinux: convert cond_av_list to array
selinux: convert cond_list to array
selinux: sel_avc_get_stat_idx should increase position index
selinux: allow kernfs symlinks to inherit parent directory context
selinux: simplify evaluate_cond_node()
Documentation,selinux: deprecate setting checkreqprot to 1
selinux: move status variables out of selinux_ss
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl6DXLwACgkQiiy9cAdy
T1GqXQwAiJJpu3nBTtZeY9ZcybrpQnLve8H3Y/v/dmxJuu8hXcoEcpGPyzx+etlT
7X7pb1lwfw1n/1p2LGzRxigZUkG86QQ+Qe2D87elA2DtJ3zagIbQg/Jq/nrIK7/U
DE+2IJUGh/Q8LS9gXwv95k4+P3iTM1GYoJHmDS+Hnp2EJ+PABBc55ZUe12+wpHYx
EE58pkKe7uOc8+F+I8ySprJNgGsh4MT4hpWLIGXCDSROFBnYbBwN/xERKIJwh2zX
y6WCWQb18FvoyxqHNTbVz+NayPslAu64GdY8L85Ke/eslguFDcklAb0BNhGe86bH
3l0rM4ghWkHLxG44lAA2QO2ljoUJKUH7/HzKEJ6camm0fg2EUDO04No+k0Mmj6Ye
qCi1d7fSbSyPS0ctNICCZnjhCRwDtIiEvQ4hghh1m18ZNuipduSu2tMeRl60DnKp
ToAJBTzZMuItRPxZcWQCsihpkzFvG3dCsSL2J2P9esiwp+fXC66difCND6mfBT05
FQedw4H0
=8fDc
-----END PGP SIGNATURE-----
Merge tag '5.7-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
"First part of cifs/smb3 changes for merge window (others are still
being tested). Various RDMA (smbdirect) fixes, addition of SMB3.1.1
POSIX support in readdir, 3 fixes for stable, and a fix for flock.
Summary:
New feature:
- SMB3.1.1 POSIX support in readdir
Fixes:
- various RDMA (smbdirect) fixes
- fix for flock
- fallocate fix
- some improved mount warnings
- two timestamp related fixes
- reconnect fix
- three fixes for stable"
* tag '5.7-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6: (28 commits)
cifs: update internal module version number
cifs: Allocate encryption header through kmalloc
cifs: smbd: Check and extend sender credits in interrupt context
cifs: smbd: Calculate the correct maximum packet size for segmented SMBDirect send/receive
smb3: use SMB2_SIGNATURE_SIZE define
CIFS: Fix bug which the return value by asynchronous read is error
CIFS: check new file size when extending file by fallocate
SMB3: Minor cleanup of protocol definitions
SMB3: Additional compression structures
SMB3: Add new compression flags
cifs: smb2pdu.h: Replace zero-length array with flexible-array member
cifs: clear PF_MEMALLOC before exiting demultiplex thread
cifs: cifspdu.h: Replace zero-length array with flexible-array member
CIFS: Warn less noisily on default mount
fs/cifs: fix gcc warning in sid_to_id
cifs: allow unlock flock and OFD lock across fork
cifs: do d_move in rename
cifs: add SMB2_open() arg to return POSIX data
cifs: plumb smb2 POSIX dir enumeration
cifs: add smb2 POSIX info level
...
are related to corruption that occurs when journals are replayed.
For example:
1. A node fails while writing to the file system.
2. Other nodes use the metadata that was once used by the failed node.
3. When the node returns to the cluster, its journal is replayed,
but the older metadata blocks overwrite the changes from step 2.
- Fixed the recovery sequence to prevent corruption during journal replay.
- Many bug fixes found during recovery testing.
- New improved file system withdraw sequence.
- Fixed how resource group buffers are managed.
- Fixed how metadata revokes are tracked and written.
- Improve processing of IO errors hit by daemons like logd and quotad.
- Improved error checking in metadata writes.
- Fixed how qadata quota data structures are managed.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE89F0ZrnZapxy/9qS14th09/3ejsFAl6Db/QACgkQ14th09/3
ejvVTgf+IdHXfmpv3ftah8lDDpbsnSKZYRC1NW7skQB+NVG9KtJhtzy1nldaMqMv
s8wQ5aGKrfBfmzg8IZ9Pt3dCItFqC5d8IqcO0M0FtNuyN+27ETUUMnqBf1NwL6wI
iAm/+ncZ/BiZN2P8MgXV3OgRGvaC9ebmz860+nthwyJT+6y8d8Qab7pUfyix5e0d
oTgDhEJqF0DOrGsrlS5rxjTU+RMixtepsAW958D4Eks28OlyduRAj6fAMDoLN2/E
WoDpX6iKeczH0lOZxnIVQOkCztDaa0jDlK2JK7sJRBMpNxj77aUn4cffY+b/A4kk
sR5gjsiHoesdAMEpHIXSdEcYMIstIg==
=VEKB
-----END PGP SIGNATURE-----
Merge tag 'gfs2-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Bob Peterson:
"We've got a lot of patches (39) for this merge window. Most of these
patches are related to corruption that occurs when journals are
replayed. For example:
1. A node fails while writing to the file system.
2. Other nodes use the metadata that was once used by the failed
node.
3. When the node returns to the cluster, its journal is replayed, but
the older metadata blocks overwrite the changes from step 2.
Summary:
- Fixed the recovery sequence to prevent corruption during journal
replay.
- Many bug fixes found during recovery testing.
- New improved file system withdraw sequence.
- Fixed how resource group buffers are managed.
- Fixed how metadata revokes are tracked and written.
- Improve processing of IO errors hit by daemons like logd and
quotad.
- Improved error checking in metadata writes.
- Fixed how qadata quota data structures are managed"
* tag 'gfs2-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (39 commits)
gfs2: Fix oversight in gfs2_ail1_flush
gfs2: change from write to read lock for sd_log_flush_lock in journal replay
gfs2: instrumentation wrt ail1 stuck
gfs2: don't lock sd_log_flush_lock in try_rgrp_unlink
gfs2: Remove unnecessary gfs2_qa_{get,put} pairs
gfs2: Split gfs2_rsqa_delete into gfs2_rs_delete and gfs2_qa_put
gfs2: Change inode qa_data to allow multiple users
gfs2: eliminate gfs2_rsqa_alloc in favor of gfs2_qa_alloc
gfs2: Switch to list_{first,last}_entry
gfs2: Clean up inode initialization and teardown
gfs2: Additional information when gfs2_ail1_flush withdraws
gfs2: leaf_dealloc needs to allocate one more revoke
gfs2: allow journal replay to hold sd_log_flush_lock
gfs2: don't allow releasepage to free bd still used for revokes
gfs2: flesh out delayed withdraw for gfs2_log_flush
gfs2: Do proper error checking for go_sync family of glops functions
gfs2: Don't demote a glock until its revokes are written
gfs2: drain the ail2 list after io errors
gfs2: Withdraw in gfs2_ail1_flush if write_cache_pages fails
gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl6CDIMACgkQxWXV+ddt
WDuJ9g/+NTVt+OXAX3G4VLAIR6EjugREAmiHPlojM7scKsmkBuH9BN35+2EPj+yS
rSmdL01nOH3gyqe+RzAc1EEiujH/9uDpkNf4zE1tGtj9m5Useqj8ZNmiG/BN0PmR
OJZkVb8DXUHEXIFscHjQJPP60kFZoqIovS7qZbDh4992+p98lTiUUEI6SPanVYeR
QysXxmafty03hQMFW93ohFZemwAELVVI44nHxxcmOHT5BbIIopXrkInkkchB9I6b
l+tIJx1gjL6k0D3v/TTqRuD+wGCE8InJgtiuEOf0WkHp2YXUlSDaKAnF/j9Le4oe
eOgc50LtA3YNGmZ2m5vTeRjBeU9qUPWjJWJ2urp87oIrxX5x7B5Hsjxdnn28P0yZ
dl/dt9HxeCKFgaRrMZYETYq9VBt0IMxiOIG9w5fukB9qnC6Dd05dXyQB0slg0+l1
chn5p0FtMS74cvXB32jW7N0fwxWNt6KI4zBvomabJGYZQd6+dyDO8l8Od86vvve/
w7KgRy7CFBjc9JOCyLTvS8eEhu/qAVc07phSblpdNnyzPFjWWTdZySON/qQYvUCf
cGDiq+5+1d1+kWuEjtYNzvxon2AaAfg7UBZm5FrjN735ojTQXqm2vi3rrurcU5AZ
ItmiU6DMre5EGZ+hfWgSPXDkeqx/JYbtDuUwWbNg6svTXaKKnmI=
=1m9l
-----END PGP SIGNATURE-----
Merge tag 'for-5.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"A number of core changes that make things work better in general, code
is simpler and cleaner.
Core changes:
- per-inode file extent tree, for in memory tracking of contiguous
extent ranges to make sure i_size adjustments are accurate
- tree root structures are protected by reference counts, replacing
SRCU that did not cover some cases
- leak detector for tree root structures
- per-transaction pinned extent tracking
- buffer heads are replaced by bios for super block access
- speedup of extent back reference resolution, on an example test
scenario the runtime of send went down from a hour to minutes
- factor out locking scheme used for subvolume writer and NOCOW
exclusion, abstracted as DREW lock, double reader-writer exclusion
(allow either readers or writers)
- cleanup and abstract extent allocation policies, preparation for
zoned device support
- make reflink/clone_range work on inline extents
- add more cancellation point for relocation, improves long response
from 'balance cancel'
- add page migration callback for data pages
- switch to guid for uuids, with additional cleanups of the interface
- make ranged full fsyncs more efficient
- removal of obsolete ioctl flag BTRFS_SUBVOL_CREATE_ASYNC
- remove b-tree readahead from delayed refs paths, avoiding seek and
read unnecessary blocks
Features:
- v2 of ioctl to delete subvolumes, allowing to delete by id and more
future extensions
Fixes:
- fix qgroup rescan worker that could block umount
- fix crash during unmount due to race with delayed inode workers
- fix dellaloc flushing logic that could create unnecessary chunks
under heavy load
- fix missing file extent item for hole after ranged fsync
- several fixes in relocation error handling
Other:
- more documentation of relocation, device replace, space
reservations
- many random cleanups"
* tag 'for-5.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (210 commits)
btrfs: fix missing semaphore unlock in btrfs_sync_file
btrfs: use nofs allocations for running delayed items
btrfs: sysfs: Use scnprintf() instead of snprintf()
btrfs: do not resolve backrefs for roots that are being deleted
btrfs: track reloc roots based on their commit root bytenr
btrfs: restart relocate_tree_blocks properly
btrfs: reloc: reorder reservation before root selection
btrfs: do not readahead in build_backref_tree
btrfs: do not use readahead for running delayed refs
btrfs: Remove async_transid from btrfs_mksubvol/create_subvol/create_snapshot
btrfs: Remove transid argument from btrfs_ioctl_snap_create_transid
btrfs: Remove BTRFS_SUBVOL_CREATE_ASYNC support
btrfs: kill the subvol_srcu
btrfs: make btrfs_cleanup_fs_roots use the radix tree lock
btrfs: don't take an extra root ref at allocation time
btrfs: hold a ref on the root on the dead roots list
btrfs: make inodes hold a ref on their roots
btrfs: move the root freeing stuff into btrfs_put_root
btrfs: move ino_cache_inode dropping out of btrfs_free_fs_root
btrfs: make the extent buffer leak check per fs info
...
Add an ioctl FS_IOC_GET_ENCRYPTION_NONCE which retrieves a file's
encryption nonce. This makes it easier to write automated tests which
verify that fscrypt is doing the encryption correctly.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXoIg/RQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK2mZAQDjEil0Kf8AqZhjPuJSRrbifkzEPfu+
4EmERSyBZ5OCLgEA155kKnL5jiz7b5DRS9wGEw+drGpW8I7WfhTGv/XjoQs=
=2jU9
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fscrypt updates from Eric Biggers:
"Add an ioctl FS_IOC_GET_ENCRYPTION_NONCE which retrieves a file's
encryption nonce.
This makes it easier to write automated tests which verify that
fscrypt is doing the encryption correctly"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
ubifs: wire up FS_IOC_GET_ENCRYPTION_NONCE
f2fs: wire up FS_IOC_GET_ENCRYPTION_NONCE
ext4: wire up FS_IOC_GET_ENCRYPTION_NONCE
fscrypt: add FS_IOC_GET_ENCRYPTION_NONCE ioctl
Add below two callback interfaces in struct f2fs_compress_ops:
int (*init_decompress_ctx)(struct decompress_io_ctx *dic);
void (*destroy_decompress_ctx)(struct decompress_io_ctx *dic);
Which will be used by zstd compress algorithm later.
In addition, this patch adds callback function pointer check, so that
specified algorithm can avoid defining unneeded functions.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use LZ4 as default compression algorithm, as compared to LZO, it shows
almost the same compression ratio and much better decompression speed.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
{cic,dic}.ref should be initialized to number of compressed pages,
let's initialize it directly rather than doing w/
f2fs_set_compressed_page().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Multipage read flow should consider fsverity, so it needs to use
f2fs_readpage_limit() instead of i_size_read() to check EOF condition.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fix gcc warnings:
In file included from fs/f2fs/dir.c:15:0:
fs/f2fs/xattr.h:157:13: warning: 'f2fs_destroy_xattr_caches' defined but not used [-Wunused-function]
static void f2fs_destroy_xattr_caches(struct f2fs_sb_info *sbi) { }
^~~~~~~~~~~~~~~~~~~~~~~~~
fs/f2fs/xattr.h:156:12: warning: 'f2fs_init_xattr_caches' defined but not used [-Wunused-function]
static int f2fs_init_xattr_caches(struct f2fs_sb_info *sbi) { return 0; }
Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: a999150f4f ("f2fs: use kmem_cache pool during inline xattr lookups")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
On image that has verity and compression feature, if compressed pages
and non-compressed pages are mixed in one bio, we may double unlock
non-compressed page in below flow:
- f2fs_post_read_work
- f2fs_decompress_work
- f2fs_decompress_bio
- __read_end_io
- unlock_page
- fsverity_enqueue_verify_work
- f2fs_verity_work
- f2fs_verify_bio
- unlock_page
So it should skip handling non-compressed page in f2fs_decompress_work()
if verity is on.
Besides, add missing dec_page_count() in f2fs_verify_bio().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_inode_info.flags is unsigned long variable, it has 32 bits
in 32bit architecture, since we introduced FI_MMAP_FILE flag
when we support data compression, we may access memory cross
the border of .flags field, corrupting .i_sem field, result in
below deadlock.
To fix this issue, let's expand .flags as an array to grab enough
space to store new flags.
Call Trace:
__schedule+0x8d0/0x13fc
? mark_held_locks+0xac/0x100
schedule+0xcc/0x260
rwsem_down_write_slowpath+0x3ab/0x65d
down_write+0xc7/0xe0
f2fs_drop_nlink+0x3d/0x600 [f2fs]
f2fs_delete_inline_entry+0x300/0x440 [f2fs]
f2fs_delete_entry+0x3a1/0x7f0 [f2fs]
f2fs_unlink+0x500/0x790 [f2fs]
vfs_unlink+0x211/0x490
do_unlinkat+0x483/0x520
sys_unlink+0x4a/0x70
do_fast_syscall_32+0x12b/0x683
entry_SYSENTER_32+0xaa/0x102
Fixes: 4c8ff7095b ("f2fs: support data compression")
Tested-by: Ondrej Jirman <megous@megous.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If both compression and fsverity feature is on, generic/572 will
report below NULL pointer dereference bug.
BUG: kernel NULL pointer dereference, address: 0000000000000018
RIP: 0010:f2fs_verity_work+0x60/0x90 [f2fs]
#PF: supervisor read access in kernel mode
Workqueue: fsverity_read_queue f2fs_verity_work [f2fs]
RIP: 0010:f2fs_verity_work+0x60/0x90 [f2fs]
Call Trace:
process_one_work+0x16c/0x3f0
worker_thread+0x4c/0x440
? rescuer_thread+0x350/0x350
kthread+0xf8/0x130
? kthread_unpark+0x70/0x70
ret_from_fork+0x35/0x40
There are two issue in f2fs_verity_work():
- it needs to traverse and verify all pages in bio.
- if pages in bio belong to non-compressed cluster, accessing
decompress IO context stored in page private will cause NULL
pointer dereference.
Fix them.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_decompress_end_io(), we should clear PG_error flag before page
unlock, otherwise reread will fail due to the flag as described in
commit fb7d70db30 ("f2fs: clear PageError on the read path").
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_tmpfile(), parent inode's encryption info is only used when
inheriting encryption context to its child inode, however, we have
already called fscrypt_get_encryption_info() in fscrypt_inherit_context()
to get the encryption info, so just removing unneeded one in
f2fs_tmpfile().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Data flush can generate heavy IO and cause long latency during
flush, so it's not appropriate to trigger it in foreground
operation.
And also, we may face below potential deadlock during data flush:
- f2fs_write_multi_pages
- f2fs_write_raw_pages
- f2fs_write_single_data_page
- f2fs_balance_fs
- f2fs_balance_fs_bg
- f2fs_sync_dirty_inodes
- filemap_fdatawrite -- stuck on flush same cluster
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Merge below two conditions into f2fs_may_encrypt() for cleanup
- IS_ENCRYPTED()
- DUMMY_ENCRYPTION_ENABLED()
Check IS_ENCRYPTED(inode) condition in f2fs_init_inode_metadata()
is enough since we have already set encrypt flag in f2fs_new_inode().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should always check F2FS_I(inode)->cp_task condition in prior to other
conditions in __should_serialize_io() to avoid deadloop described in
commit 040d2bb318 ("f2fs: fix to avoid deadloop if data_flush is on"),
however we break this rule when we support compression, fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This lock can be a contention with multi 4k random read IO with single inode.
example) fio --output=test --name=test --numjobs=60 --filename=/media/samsung960pro/file_test --rw=randread --bs=4k
--direct=1 --time_based --runtime=7 --ioengine=libaio --iodepth=256 --group_reporting --size=10G
With this commit, it remove that possible lock contention.
Signed-off-by: Dongjoo Seo <commisori28@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When using NFSv4.2, the security label for the root inode should be set
via a call to nfs_setsecurity() during the mount process, otherwise the
inode will appear as unlabeled for up to acdirmin seconds. Currently
the label for the root inode is allocated, retrieved, and freed entirely
witin nfs4_proc_get_root().
Add a field for the label to the nfs_fattr struct, and allocate & free
the label in nfs_get_root(), where we also add a call to
nfs_setsecurity(). Note that for the call to nfs_setsecurity() to
succeed, it's necessary to also move the logic calling
security_sb_{set,clone}_security() from nfs_get_tree_common() down into
nfs_get_root()... otherwise the SBLABEL_MNT flag will not be set in the
super_block's security flags and nfs_setsecurity() will silently fail.
Reported-by: Richard Haines <richard_c_haines@btinternet.com>
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Tested-by: Stephen Smalley <sds@tycho.nsa.gov>
[PM: fixed 80-char line width problems]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- Continued user-access cleanups in the futex code.
- percpu-rwsem rewrite that uses its own waitqueue and atomic_t
instead of an embedded rwsem. This addresses a couple of
weaknesses, but the primary motivation was complications on the -rt
kernel.
- Introduce raw lock nesting detection on lockdep
(CONFIG_PROVE_RAW_LOCK_NESTING=y), document the raw_lock vs. normal
lock differences. This too originates from -rt.
- Reuse lockdep zapped chain_hlocks entries, to conserve RAM
footprint on distro-ish kernels running into the "BUG:
MAX_LOCKDEP_CHAIN_HLOCKS too low!" depletion of the lockdep
chain-entries pool.
- Misc cleanups, smaller fixes and enhancements - see the changelog
for details"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t
thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_t
Documentation/locking/locktypes: Minor copy editor fixes
Documentation/locking/locktypes: Further clarifications and wordsmithing
m68knommu: Remove mm.h include from uaccess_no.h
x86: get rid of user_atomic_cmpxchg_inatomic()
generic arch_futex_atomic_op_inuser() doesn't need access_ok()
x86: don't reload after cmpxchg in unsafe_atomic_op2() loop
x86: convert arch_futex_atomic_op_inuser() to user_access_begin/user_access_end()
objtool: whitelist __sanitizer_cov_trace_switch()
[parisc, s390, sparc64] no need for access_ok() in futex handling
sh: no need of access_ok() in arch_futex_atomic_op_inuser()
futex: arch_futex_atomic_op_inuser() calling conventions change
completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all()
lockdep: Add posixtimer context tracing bits
lockdep: Annotate irq_work
lockdep: Add hrtimer context tracing bits
lockdep: Introduce wait-type checks
completion: Use simple wait queues
sched/swait: Prepare usage in completions
...
Pull EFI updates from Ingo Molnar:
"The EFI changes in this cycle are much larger than usual, for two
(positive) reasons:
- The GRUB project is showing signs of life again, resulting in the
introduction of the generic Linux/UEFI boot protocol, instead of
x86 specific hacks which are increasingly difficult to maintain.
There's hope that all future extensions will now go through that
boot protocol.
- Preparatory work for RISC-V EFI support.
The main changes are:
- Boot time GDT handling changes
- Simplify handling of EFI properties table on arm64
- Generic EFI stub cleanups, to improve command line handling, file
I/O, memory allocation, etc.
- Introduce a generic initrd loading method based on calling back
into the firmware, instead of relying on the x86 EFI handover
protocol or device tree.
- Introduce a mixed mode boot method that does not rely on the x86
EFI handover protocol either, and could potentially be adopted by
other architectures (if another one ever surfaces where one
execution mode is a superset of another)
- Clean up the contents of 'struct efi', and move out everything that
doesn't need to be stored there.
- Incorporate support for UEFI spec v2.8A changes that permit
firmware implementations to return EFI_UNSUPPORTED from UEFI
runtime services at OS runtime, and expose a mask of which ones are
supported or unsupported via a configuration table.
- Partial fix for the lack of by-VA cache maintenance in the
decompressor on 32-bit ARM.
- Changes to load device firmware from EFI boot service memory
regions
- Various documentation updates and minor code cleanups and fixes"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
efi/libstub/arm: Fix spurious message that an initrd was loaded
efi/libstub/arm64: Avoid image_base value from efi_loaded_image
partitions/efi: Fix partition name parsing in GUID partition entry
efi/x86: Fix cast of image argument
efi/libstub/x86: Use ULONG_MAX as upper bound for all allocations
efi: Fix a mistype in comments mentioning efivar_entry_iter_begin()
efi/libstub: Avoid linking libstub/lib-ksyms.o into vmlinux
efi/x86: Preserve %ebx correctly in efi_set_virtual_address_map()
efi/x86: Ignore the memory attributes table on i386
efi/x86: Don't relocate the kernel unless necessary
efi/x86: Remove extra headroom for setup block
efi/x86: Add kernel preferred address to PE header
efi/x86: Decompress at start of PE image load address
x86/boot/compressed/32: Save the output address instead of recalculating it
efi/libstub/x86: Deal with exit() boot service returning
x86/boot: Use unsigned comparison for addresses
efi/x86: Avoid using code32_start
efi/x86: Make efi32_pe_entry() more readable
efi/x86: Respect 32-bit ABI in efi32_pe_entry()
efi/x86: Annotate the LOADED_IMAGE_PROTOCOL_GUID with SYM_DATA
...
Pull RCU updates from Ingo Molnar:
"The main changes in this cycle were:
- Make kfree_rcu() use kfree_bulk() for added performance
- RCU updates
- Callback-overload handling updates
- Tasks-RCU KCSAN and sparse updates
- Locking torture test and RCU torture test updates
- Documentation updates
- Miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
rcu: Make rcu_barrier() account for offline no-CBs CPUs
rcu: Mark rcu_state.gp_seq to detect concurrent writes
Documentation/memory-barriers: Fix typos
doc: Add rcutorture scripting to torture.txt
doc/RCU/rcu: Use https instead of http if possible
doc/RCU/rcu: Use absolute paths for non-rst files
doc/RCU/rcu: Use ':ref:' for links to other docs
doc/RCU/listRCU: Update example function name
doc/RCU/listRCU: Fix typos in a example code snippets
doc/RCU/Design: Remove remaining HTML tags in ReST files
doc: Add some more RCU list patterns in the kernel
rcutorture: Set KCSAN Kconfig options to detect more data races
rcutorture: Manually clean up after rcu_barrier() failure
rcutorture: Make rcu_torture_barrier_cbs() post from corresponding CPU
rcuperf: Measure memory footprint during kfree_rcu() test
rcutorture: Annotation lockless accesses to rcu_torture_current
rcutorture: Add READ_ONCE() to rcu_torture_count and rcu_torture_batch
rcutorture: Fix stray access to rcu_fwd_cb_nodelay
rcutorture: Fix rcu_torture_one_read()/rcu_torture_writer() data race
rcutorture: Make kvm-find-errors.sh abort on bad directory
...
In “ubifs_check_node”, when the value of "node_len" is abnormal,
the code will goto label of "out_len" for execution. Then, in the
following "ubifs_dump_node", if inode type is "UBIFS_DATA_NODE",
in "print_hex_dump", an out-of-bounds access may occur due to the
wrong "ch->len".
Therefore, when the value of "node_len" is abnormal, data length
should to be adjusted to a reasonable safe range. At this time,
structured data is not credible, so dump the corrupted data directly
for analysis.
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Richard Weinberger <richard@nod.at>
Memory leak occurs when files with extended attributes are added to
orphan list.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Fixes: 988bec4131 ("ubifs: orphan: Handle xattrs like files")
Signed-off-by: Richard Weinberger <richard@nod.at>
When inodes with extended attributes are evicted, xent is not freed in one
exit branch.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Fixes: 9ca2d73264 ("ubifs: Limit number of xattrs per inode")
Signed-off-by: Richard Weinberger <richard@nod.at>
Here is the "big" set of driver core changes for 5.7-rc1.
Nothing huge in here, just lots of little firmware core changes and use
of new apis, a libfs fix, a debugfs api change, and some driver core
deferred probe rework.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXoHLIg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yle2ACgjJJzRJl9Ckae3ms+9CS4OSFFZPsAoKSrXmFc
Z7goYQdZo1zz8c0RYDrJ
=Y91m
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the "big" set of driver core changes for 5.7-rc1.
Nothing huge in here, just lots of little firmware core changes and
use of new apis, a libfs fix, a debugfs api change, and some driver
core deferred probe rework.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (44 commits)
Revert "driver core: Set fw_devlink to "permissive" behavior by default"
driver core: Set fw_devlink to "permissive" behavior by default
driver core: Replace open-coded list_last_entry()
driver core: Read atomic counter once in driver_probe_done()
libfs: fix infoleak in simple_attr_read()
driver core: Add device links from fwnode only for the primary device
platform/x86: touchscreen_dmi: Add info for the Chuwi Vi8 Plus tablet
platform/x86: touchscreen_dmi: Add EFI embedded firmware info support
Input: icn8505 - Switch to firmware_request_platform for retreiving the fw
Input: silead - Switch to firmware_request_platform for retreiving the fw
selftests: firmware: Add firmware_request_platform tests
test_firmware: add support for firmware_request_platform
firmware: Add new platform fallback mechanism and firmware_request_platform()
Revert "drivers: base: power: wakeup.c: Use built-in RCU list checking"
drivers: base: power: wakeup.c: Use built-in RCU list checking
component: allow missing unbind callback
debugfs: remove return value of debugfs_create_file_size()
debugfs: Check module state before warning in {full/open}_proxy_open()
firmware: fix a double abort case with fw_load_sysfs_fallback
arch_topology: Fix putting invalid cpu clk
...
Orphans are allowed to point to deleted inodes.
So -ENOENT is not a fatal error.
Reported-by: Кочетков Максим <fido_max@inbox.ru>
Reported-and-tested-by: "Christian Berger" <Christian.Berger@de.bosch.com>
Tested-by: Karl Olsen <karl@micro-technic.com>
Tested-by: Jef Driesen <jef.driesen@niko.eu>
Fixes: ee1438ce5d ("ubifs: Check link count of inodes when killing orphans.")
Signed-off-by: Richard Weinberger <richard@nod.at>
- Improve failure paths (chenqiwu)
- Fix ftrace position index (Vasily Averin)
- Use proper flexible-array member (Gustavo A. R. Silva)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl6Bc00WHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJvUOEACudEIGbwcMkBOtWOlEDGMzMx1q
4NMOuewTCAUYVw7JqAuXhilVbATbxX3r46wah3qoF786/94tNTL//43WuhjSHL+W
6aBpL42vm4i3NNPw9nHNU6bSDiXsdQRLg0pTrQHQDUzlpR63PrgoVUiyxwS5Hoaq
Php2qyLT4gtWi6zMxFJtLAuzJ5ye24odr4jep/BdGifUY5NfMXPbnhqij/3YTLTV
B0XjgCCn3a/WyXnV9iKBacHVbp9qMHfY9BHKvIhmaOFR7Ef6TOW/Q1QgO2mOMJnY
mD8w9Usz9+DGxXzZJRPHTn2Pd0kelMONIq5Tbt0va617KgWiyAxMlXZHiu+ZSMWj
rI4piMUP1aP2+bCm0ST9FpP8lMoDmI/Wl2GtgUwxdbdMF+tbLFMLd8Y+xLIu0WSR
TCkzCtnM/3zU4dPeOoptiIxWYyyoy2RXEThmeOBnibOZkNstcaVY03rogahW/H6Z
m97cKMnMoVrEVFSEZCzwHuWTaJ6LTUT4lXz2X8VN3Yro994qcQut0R1IChyCA0t6
SgzvspbzBxHoIPxw03Ef82D5fAiWgTsdQ2lfkUy2j6zeJS5S9o4kuAjTBFDnnDfv
kqOTDy0CxzDp0nB2x6cjSxVCxxOxZUVXglQj8X0hLIi/smu3zTxBGZSoZ8BA+6xt
mY5bx6opEG/Oteg73Q==
=g7H5
-----END PGP SIGNATURE-----
Merge tag 'pstore-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore updates from Kees Cook:
"These mostly some minor cleanups and a bug fix for an ftrace corner
case:
- Improve failure paths (chenqiwu)
- Fix ftrace position index (Vasily Averin)
- Use proper flexible-array member (Gustavo A. R. Silva)"
* tag 'pstore-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
pstore/ram: Replace zero-length array with flexible-array member
pstore: pstore_ftrace_seq_next should increase position index
pstore/ram: remove unnecessary ramoops_unregister_dummy()
pstore/platform: fix potential mem leak if pstore_init_fs failed
- Convert radix tree usage to XArray;
- Fix shrink scan count on multiple filesystem instances;
- Better handling for specific corrupted images;
- Update my email address in MAINTAINERS.
-----BEGIN PGP SIGNATURE-----
iIwEABYIADQWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXoFRvBYcZ2FveGlhbmcy
NUBodWF3ZWkuY29tAAoJEDk3MdwfteYEswMBAMtsyo6TqWPToKt/eAJMbvt5vRGf
y4XGEx67a1Ds7/LqAQCtOs+0HMWlK3F2DDljpA7Tg2QvRBJwFlhET6YZOAIcDQ==
=XsWD
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"Updates with a XArray adaptation, several fixes for shrinker and
corrupted images are ready for this cycle.
All commits have been stress tested with no noticeable smoke out and
have been in linux-next as well.
Summary:
- Convert radix tree usage to XArray
- Fix shrink scan count on multiple filesystem instances
- Better handling for specific corrupted images
- Update my email address in MAINTAINERS"
* tag 'erofs-for-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
MAINTAINERS: erofs: update my email address
erofs: handle corrupted images whose decompressed size less than it'd be
erofs: use LZ4_decompress_safe() for full decoding
erofs: correct the remaining shrink objects
erofs: convert workstn to XArray
- Lots of RST conversion work by Mauro, Daniel ALmeida, and others.
Maybe someday we'll get to the end of this stuff...maybe...
- Some organizational work to bring some order to the core-api manual.
- Various new docs and additions to the existing documentation.
- Typo fixes, warning fixes, ...
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl6BLf4PHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5YLhkIAIhcg6gxp0oZZ3KDfQyhvej0EWQGVDNkmloQ
O1VOSV3RJsZL9HwN9xSNnNfN5+hw5RUYVbn1s201uj6kovZY9qcTpHP2LCizUeGb
eFkSTmzkyAuAbJjuVLgMPDerJPEew0HnudiToeSpQeoIL1WB6YGd4/5H/cN1KLex
8ggjllcY0wOgbiFffmK6+tavDv7vT0lKTdwKRYh2nxu7zrPVVd1ZnW+RtntdTVQt
i+xwV6/YdWtg5C53IwBPpeyubX40vqaIjU8rzpLq5SCVbsZN14sSR709m1AYCOK0
i4VDWEhfA2XBi6Nycl5U0czuGziwoHrTgSCkS1mmSDujnpgfKM8=
=6YOS
-----END PGP SIGNATURE-----
Merge tag 'docs-5.7' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"This has been a busy cycle for documentation work.
Highlights include:
- Lots of RST conversion work by Mauro, Daniel ALmeida, and others.
Maybe someday we'll get to the end of this stuff...maybe...
- Some organizational work to bring some order to the core-api
manual.
- Various new docs and additions to the existing documentation.
- Typo fixes, warning fixes, ..."
* tag 'docs-5.7' of git://git.lwn.net/linux: (123 commits)
Documentation: x86: exception-tables: document CONFIG_BUILDTIME_TABLE_SORT
MAINTAINERS: adjust to filesystem doc ReST conversion
docs: deprecated.rst: Add BUG()-family
doc: zh_CN: add translation for virtiofs
doc: zh_CN: index files in filesystems subdirectory
docs: locking: Drop :c:func: throughout
docs: locking: Add 'need' to hardirq section
docs: conf.py: avoid thousands of duplicate label warning on Sphinx
docs: prevent warnings due to autosectionlabel
docs: fix reference to core-api/namespaces.rst
docs: fix pointers to io-mapping.rst and io_ordering.rst files
Documentation: Better document the softlockup_panic sysctl
docs: hw-vuln: tsx_async_abort.rst: get rid of an unused ref
docs: perf: imx-ddr.rst: get rid of a warning
docs: filesystems: fuse.rst: supress a Sphinx warning
docs: translations: it: avoid duplicate refs at programming-language.rst
docs: driver.rst: supress two ReSt warnings
docs: trace: events.rst: convert some new stuff to ReST format
Documentation: Add io_ordering.rst to driver-api manual
Documentation: Add io-mapping.rst to driver-api manual
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJEMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpie7D/9gN4zhykYDfcgamfxMtTbpla2PdTnWoJxP
fjy/Nx2FySakmccaiCGQSQ1rzD1L67UQkJgEH6hPTomJvA4FaOmJ+ZSaExMy55LH
ZT+nD3zQ9SCuA0DEpfxbsCP1tbnoXSMQNt8Tyh0x8PAoxp5bI0eRczOju1QWLWTS
tjBEMZNipN6krrV9RPWT0S5Z31/yGr/sXprCSHFV9Ypzwrx58Tj2i6F9gR7FVbLs
nV2/O8taEn0sMQIz8TVHKol/TBalluGrC4M/bOeS3faP3BPN4TT24Gtc0LAKEibk
F49/SX7FzwhOdl43Bdkbe2bbL86p+zOLSf0IMBwMm0DJl4aiOljRUYTSYRolgGgm
Ebw9QhemTwbxxeD2nEriA4EAeYvTx69RDlN2eVilwwfJ48Xz9fVm3GNYG7LISeON
k3/TyZOBQH2SZ2Hc3oF2Mq9j1UPHXZHUUsUNlNcN+aM9SFHcWkRi6xZWemTJHJZ4
zFss5RZHo0+RLBa8rrx8xaO8iWrc73+FuRhr9eSsmyPIj+OZ4ezEFRRRHwtk2fgv
dZvD413AyCI1c+3LlBusESMsrtXyY8p9O9buNTzHy3ZUtHe0ERmYV2m/a83A5pXo
Kia/5aJbPIC61bAkCCkiVo+W9OASJ6o5+3CXl5sM9lGTbDXjcofzewmd+RHPestx
xVbzeR9UIw==
=bYLJ
-----END PGP SIGNATURE-----
Merge tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"Here are the io_uring changes for this merge window. Light on new
features this time around (just splice + buffer selection), lots of
cleanups, fixes, and improvements to existing support. In particular,
this contains:
- Cleanup fixed file update handling for stack fallback (Hillf)
- Re-work of how pollable async IO is handled, we no longer require
thread offload to handle that. Instead we rely using poll to drive
this, with task_work execution.
- In conjunction with the above, allow expendable buffer selection,
so that poll+recv (for example) no longer has to be a split
operation.
- Make sure we honor RLIMIT_FSIZE for buffered writes
- Add support for splice (Pavel)
- Linked work inheritance fixes and optimizations (Pavel)
- Async work fixes and cleanups (Pavel)
- Improve io-wq locking (Pavel)
- Hashed link write improvements (Pavel)
- SETUP_IOPOLL|SETUP_SQPOLL improvements (Xiaoguang)"
* tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block: (54 commits)
io_uring: cleanup io_alloc_async_ctx()
io_uring: fix missing 'return' in comment
io-wq: handle hashed writes in chains
io-uring: drop 'free_pfile' in struct io_file_put
io-uring: drop completion when removing file
io_uring: Fix ->data corruption on re-enqueue
io-wq: close cancel gap for hashed linked work
io_uring: make spdxcheck.py happy
io_uring: honor original task RLIMIT_FSIZE
io-wq: hash dependent work
io-wq: split hashing and enqueueing
io-wq: don't resched if there is no work
io-wq: remove duplicated cancel code
io_uring: fix truncated async read/readv and write/writev retry
io_uring: dual license io_uring.h uapi header
io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled
io_uring: Fix unused function warnings
io_uring: add end-of-bits marker and build time verify it
io_uring: provide means of removing buffers
io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJCoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvziEACqQC+QRKiqR6X5yaPWJ9LqjKE7lfI1PUb7
0a1z1mKuf8d6z0qNleUwdSOEaS5zJiswou2K8GLvEtTQH41QYsQkxc9GLjAyTveK
szAyzZaa3BNUy9hkczm9i2arv3fI8XoTE3JvRM0e9wL8fBJDYCtKtHFJvF4hisOQ
ydaJlU6tcwzd9bdV7K5dLwBxu3AeAJjzS3Tyfw25u9N9O/btUxJ91RTqBb2+Xeoz
AVasfRlAqf/CzdjxCCmDgWE2QM4852pAeQ7UJJBGISNWNoiwkezMg+6HD0jEOLee
bQ8uDyQdihIWTY+/zQasotX8/71uLV8QgtjWLXR9zrjrubIBWHGzoWSQ4kPg5DfQ
bJmKO0VvWN2sshZEpWvzzAFGYxZViNphbK2Pb4hKOcv7jtMcC8mmEogh/7EqbD/n
KB3IM9qVoXM8INm5o0dTy5uDRJxiHiHYkqsZaKz55BB/R4Geym5TINT3nXgxhQrn
JoSwp4zdm3/NJOySruDi2eETqWJC2bsz3FsQSyCQTPOuP0nLtFKBb1UKHpmYTCXG
H4LCyCKFJ6s006qBcdaNPZBw1mrSNwoxEulHnpYA4BFfPeXi72yrnMZQkdwWONpW
LIVuD0hBm8X/pulbvEEdjzXBqZVkqK3xFX+uX5+bnwwaUKddXAC/h9SQKpBP2Mbb
AeZToMklKw==
=6Glq
-----END PGP SIGNATURE-----
Merge tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- Online capacity resizing (Balbir)
- Number of hardware queue change fixes (Bart)
- null_blk fault injection addition (Bart)
- Cleanup of queue allocation, unifying the node/no-node API
(Christoph)
- Cleanup of genhd, moving code to where it makes sense (Christoph)
- Cleanup of the partition handling code (Christoph)
- disk stat fixes/improvements (Konstantin)
- BFQ improvements (Paolo)
- Various fixes and improvements
* tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block: (72 commits)
block: return NULL in blk_alloc_queue() on error
block: move bio_map_* to blk-map.c
Revert "blkdev: check for valid request queue before issuing flush"
block: simplify queue allocation
bcache: pass the make_request methods to blk_queue_make_request
null_blk: use blk_mq_init_queue_data
block: add a blk_mq_init_queue_data helper
block: move the ->devnode callback to struct block_device_operations
block: move the part_stat* helpers from genhd.h to a new header
block: move block layer internals out of include/linux/genhd.h
block: move guard_bio_eod to bio.c
block: unexport get_gendisk
block: unexport disk_map_sector_rcu
block: unexport disk_get_part
block: mark part_in_flight and part_in_flight_rw static
block: mark block_depr static
block: factor out requeue handling from dispatch code
block/diskstats: replace time_in_queue with sum of request times
block/diskstats: accumulate all per-cpu counters in one pass
block/diskstats: more accurate approximation of io_ticks for slow disks
...
Ordinarily, function gfs2_ail1_start_one issues a write request
for one item on the ail1 list, then returns -EBUSY. This makes the
caller, gfs2_ail1_flush, loop around and start another. However,
it was not clearing the -EBUSY return code each time through the loop.
So on rare occasions, like when the wbc runs out of nr_to_write, it
remained set to -EBUSY, which triggered an error and withdraw.
This patch sets the return code to 0 each time through the restart
loop so this won't happen anymore.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Clang warns:
fs/notify/fanotify/fanotify.c:28:23: warning: self-comparison always
evaluates to true [-Wtautological-compare]
return fsid1->val[0] == fsid1->val[0] && fsid2->val[1] == fsid2->val[1];
^
fs/notify/fanotify/fanotify.c:28:57: warning: self-comparison always
evaluates to true [-Wtautological-compare]
return fsid1->val[0] == fsid1->val[0] && fsid2->val[1] == fsid2->val[1];
^
2 warnings generated.
The intention was clearly to compare val[0] and val[1] in the two
different fsid structs. Fix it otherwise this function always returns
true.
Fixes: afc894c784 ("fanotify: Store fanotify handles differently")
Link: https://github.com/ClangBuiltLinux/linux/issues/952
Link: https://lore.kernel.org/r/20200327171030.30625-1-natechancellor@gmail.com
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When encryption is used, smb2_transform_hdr is defined on the stack and is
passed to the transport. This doesn't work with RDMA as the buffer needs to
be DMA'ed.
Fix it by using kmalloc.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When a RDMA packet is received and server is extending send credits, we should
check and unblock senders immediately in IRQ context. Doing it in a worker
queue causes unnecessary delay and doesn't save much CPU on the receive path.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The packet size needs to take account of SMB2 header size and possible
encryption header size. This is only done when signing is used and it is for
RDMA send/receive, not read/write.
Also remove the dead SMBD code in smb2_negotiate_r(w)size.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Improve readability and maintainability by replacing a hardcoded string
allocation and formatting by the use of the kasprintf() helper.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
ext4_fill_super doublechecks the number of groups before mounting; if
that check fails, the resulting error message prints the group count
from the ext4_sb_info sbi, which hasn't been set yet. Print the freshly
computed group count instead (which at that point has just been computed
in "blocks_count").
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Fixes: 4ec1102813 ("ext4: Add sanity checks for the superblock before mounting the filesystem")
Link: https://lore.kernel.org/r/8b957cd1513fcc4550fe675c10bcce2175c33a49.1585431964.git.josh@joshtriplett.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If ext4_fill_super detects an invalid number of inodes per group, the
resulting error message printed the number of blocks per group, rather
than the number of inodes per group. Fix it to print the correct value.
Fixes: cd6bb35bf7 ("ext4: use more strict checks for inodes_per_block on mount")
Link: https://lore.kernel.org/r/8be03355983a08e5d4eed480944613454d7e2550.1585434649.git.josh@joshtriplett.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently on calling echo 3 > drop_caches on host machine, we see
FS corruption in the guest. This happens on Power machine where
blocksize < pagesize.
So as a temporary workaound don't enable dioread_nolock by default
for blocksize < pagesize until we identify the root cause.
Also emit a warning msg in case if this mount option is manually
enabled for blocksize < pagesize.
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200327200744.12473-1-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If the FLUSH_SYNC flag is set, nfs_initiate_pgio() will currently
wait for completion, and then return the status of the I/O operation.
What we actually want to report in nfs_pageio_doio() is whether or
not the RPC call was launched successfully, whereas actual I/O
status is intended handled in the reply callbacks.
Since FLUSH_SYNC is never set by any of the callers anyway, let's
just remove that code altogether.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Bit spinlocks are problematic if PREEMPT_RT is enabled, because they
disable preemption, which is undesired for latency reasons and breaks when
regular spinlocks are taken within the bit_spinlock locked region because
regular spinlocks are converted to 'sleeping spinlocks' on RT.
PREEMPT_RT replaced the bit spinlocks with regular spinlocks to avoid this
problem. The replacement was done conditionaly at compile time, but
Christoph requested to do an unconditional conversion.
Jan suggested to move the spinlock into a existing padding hole which
avoids a size increase of struct buffer_head on production kernels.
As a benefit the lock gains lockdep coverage.
[ bigeasy: Remove the wrapper and use always spinlock_t and move it into
the padding hole ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Link: https://lkml.kernel.org/r/20191118132824.rclhrbujqh4b4g4d@linutronix.de
Move from requesting only full file layout segments, to requesting
layout segments that match our I/O size. This means the server is
still free to return a full file layout, but we will no longer
error out if it does not.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When starting to read or write with a layout segment, check that the
range matches our request.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Check that the number of mirrors, and the mirror information matches
before deciding to merge layout segments in pNFS/flexfiles.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fix up pnfs_layout_mark_request_commit() to alway reschedule the write
if the layout segment is invalid. Also minor cleanup.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Move the pNFS commit related operations into a separate structure
that can be carried by the pnfs_ds_commit_info.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Lift filelayout_search_commit_reqs() into the generic pnfs/nfs code,
and add support for commit arrays.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Enable adding and lookup of per-layout segment commits in filelayout
and flexfilelayout.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Ensure that both the file and flexfiles layout types clean up when
freeing the layout segments.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add support for scanning the full list of per-layout segment commit
arrays to nfs_clear_pnfs_ds_commit_verifiers().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Instead of trying to save the commit verifiers and checking them against
previous writes, adopt the same strategy as for buffered writes, of
just checking the verifiers at commit time.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add a pNFS callback to allow the O_DIRECT code to release the DS
commitinfo when freeing the dreq.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add support for scanning the full list of per-layout segment commit
arrays to pnfs_generic_commit_pagelist().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add support for scanning the full list of per-layout segment commit
arrays to pnfs_generic_recover_commit_reqs().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add support for scanning the full list of per-layout segment commit
arrays to pnfs_generic_scan_commit_lists()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When we have multiple layout segments with different lists of mirrored
data, we need to track the commits on a per layout segment basis.
This patch adds a list to support this tracking in struct
pnfs_ds_commit_info.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Function gfs2_recover_func grabs the sd_log_flush_lock rw_semaphore in
write mode. This is unnecessary because we only need to prevent log flush
from using sd_log_bio bio while it does. Therefore, a read lock will be
enough. This is a small step in cleaning up log flush.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, if the ail1 flush got stuck for some reason, there
were no clues as to why. This patch introduces a check for getting
stuck for more than a minute, and if it happens, it dumps the items
still remaining on the ail1 list.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
In function try_rgrp_unlink, we added a temporary lock of the
sd_log_flush_lock while searching the bitmaps. This protected us from
problems in which dinodes being freed were still in a state of flux
because the rgrp was in an active transaction. It was a kludge.
Now that we've straightened out the code for inode eviction, deletes,
and all the recovery mess, we no longer need this kludge.
This patch removes it, and should improve performance.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
We now get the quota data structure when opening a file writable and put it
when closing that writable file descriptor, so there no longer is a need for
gfs2_qa_{get,put} while we're holding a writable file descriptor.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Keeping reservations and quotas separate helps reviewing the code.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, multiple users called gfs2_qa_alloc which allocated
a qadata structure to the inode, if quotas are turned on. Later, in
file close or evict, the structure was deleted with gfs2_qa_delete.
But there can be several competing processes who need access to the
structure. There were races between file close (release) and the others.
Thus, a release could delete the structure out from under a process
that relied upon its existence. For example, chown.
This patch changes the management of the qadata structures to be
a get/put scheme. Function gfs2_qa_alloc has been changed to gfs2_qa_get
and if the structure is allocated, the count essentially starts out at
1. Function gfs2_qa_delete has been renamed to gfs2_qa_put, and the
last guy to decrement the count to 0 frees the memory.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, multiple callers called gfs2_rsqa_alloc to force
the existence of a reservations structure and a quota data structure
if needed. However, now the reservations are handled separately, so
the quota data is only the quota data. So we eliminate the one in
favor of just calling gfs2_qa_alloc directly.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Replace open-coded versions of list_first_entry and list_last_entry with those
functions.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
When allocating a new inode, mark the iopen glock holder as uninitialized to
make sure gfs2_evict_inode won't fail after an incomplete create or lookup. In
gfs2_evict_inode, allow the inode glock to be NULL and remove the duplicate
iopen glock teardown code. In gfs2_inode_lookup, don't tear down things that
gfs2_evict_inode will already tear down.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
It clarifies the code slightly to use SMB2_SIGNATURE_SIZE
define rather than 16.
Suggested-by: Henning Schild <henning.schild@siemens.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Cleanup io_alloc_async_ctx() a bit, add a new __io_alloc_async_ctx(),
so io_setup_async_rw() won't need to check whether async_ctx is true
or false again.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A proper way to handle O_NONBLOCK would be making the requests and
responses happen asynchronously, but this would require serious code
refactoring.
Link: http://lkml.kernel.org/r/20200205003457.24340-2-l29ah@cock.li
Signed-off-by: Sergey Alirzaev <l29ah@cock.li>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
$ sed -e 's/^ /\t/' -i */Kconfig
Link: http://lkml.kernel.org/r/20191120134340.16770-1-krzk@kernel.org
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
When it's probing all of a fileserver's interfaces to find which one is
best to use, afs_do_probe_fileserver() takes a lock on the server record
and notes the pointer to the address list.
It doesn't, however, pin the address list, so as soon as it drops the
lock, there's nothing to stop the address list from being freed under
us.
Fix this by taking a ref on the address list inside the locked section
and dropping it at the end of the function.
Fixes: 3bf0fb6f33 ("afs: Probe multiple fileservers simultaneously")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
leak fixes, marked for stable.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl59DCwTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi5oGB/943a7gIBV52PD3MGCnI8RWjgHkk3d0
en2JNI6i7hf7GD7GplMGkc0D8INBJhCZo1mwzX36QXYA3BeXKARkNXvEE+AZ4dX5
XbUiPE5WuUwxcT9sE9rTiCurx1ToN/XUlA27Vbt9J67U08w5BjJ3utO1LuW7z2ME
NPx6aw6tdwIEeNJBo4ge8y9vPKevtXqhkCbzSb2kn+tMhoMPuJ3RIj8kWIF7mYWZ
ofwOFoDnOfQuH+9ZA/mT4jL7ifR0am5QptHSD9kxge2mKlc0pmoABZK6sWNPOslg
jQaEiefH77K/IxRyAsQNM7iHbUzKpZGbqAHx92MU0redUjUWNdCDGUmF
=c01Y
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.6-rc8' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A patch for a rather old regression in fullness handling and two
memory leak fixes, marked for stable"
* tag 'ceph-for-5.6-rc8' of git://github.com/ceph/ceph-client:
ceph: fix memory leak in ceph_cleanup_snapid_map()
libceph: fix alloc_msg_with_page_vector() memory leaks
ceph: check POOL_FLAG_FULL/NEARFULL in addition to OSDMAP_FULL/NEARFULL
I noticed that fsfreeze can take a very long time to freeze an XFS if
there happens to be a GETFSMAP caller running in the background. I also
happened to notice the following in dmesg:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 43492 at fs/xfs/xfs_super.c:853 xfs_quiesce_attr+0x83/0x90 [xfs]
Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables bfq iptable_filter sch_fq_codel ip_tables x_tables nfsv4 af_packet [last unloaded: xfs]
CPU: 2 PID: 43492 Comm: xfs_io Not tainted 5.6.0-rc4-djw #rc4
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:xfs_quiesce_attr+0x83/0x90 [xfs]
Code: 7c 07 00 00 85 c0 75 22 48 89 df 5b e9 96 c1 00 00 48 c7 c6 b0 2d 38 a0 48 89 df e8 57 64 ff ff 8b 83 7c 07 00 00 85 c0 74 de <0f> 0b 48 89 df 5b e9 72 c1 00 00 66 90 0f 1f 44 00 00 41 55 41 54
RSP: 0018:ffffc900030f3e28 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff88802ac54000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81e4a6f0 RDI: 00000000ffffffff
RBP: ffff88807859f070 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000010 R12: 0000000000000000
R13: ffff88807859f388 R14: ffff88807859f4b8 R15: ffff88807859f5e8
FS: 00007fad1c6c0fc0(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0c7d237000 CR3: 0000000077f01003 CR4: 00000000001606a0
Call Trace:
xfs_fs_freeze+0x25/0x40 [xfs]
freeze_super+0xc8/0x180
do_vfs_ioctl+0x70b/0x750
? __fget_files+0x135/0x210
ksys_ioctl+0x3a/0xb0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x1a0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
These two things appear to be related. The assertion trips when another
thread initiates a fsmap request (which uses an empty transaction) after
the freezer waited for m_active_trans to hit zero but before the the
freezer executes the WARN_ON just prior to calling xfs_log_quiesce.
The lengthy delays in freezing happen because the freezer calls
xfs_wait_buftarg to clean out the buffer lru list. Meanwhile, the
GETFSMAP caller is continuing to grab and release buffers, which means
that it can take a very long time for the buffer lru list to empty out.
We fix both of these races by calling sb_start_write to obtain freeze
protection while using empty transactions for GETFSMAP and for metadata
scrubbing. The other two users occur during mount, during which time we
cannot fs freeze.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If the bio_add_page() call fails, we proceed to write out a
partially constructed log buffer. This corrupts the physical log
such that log recovery is not possible. Worse, persistent
occurrences of this error eventually lead to a BUG_ON() failure in
bio_split() as iclogs wrap the end of the physical log, which
triggers log recovery on subsequent mount.
Rather than warn about writing out a corrupted log buffer, shutdown
the fs as is done for any log I/O related error. This preserves the
consistency of the physical log such that log recovery succeeds on a
subsequent mount. Note that this was observed on a 64k page debug
kernel without upstream commit 59bb47985c ("mm, sl[aou]b:
guarantee natural alignment for kmalloc(power-of-two)"), which
demonstrated frequent iclog bio overflows due to unaligned (slab
allocated) iclog data buffers.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we're checking bestfree information in directory blocks, always
drop the block buffer at the end of the function. We should always
release resources when we're done using them.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The dirattr btree checking code uses the altpath substructure of the
dirattr state structure to check the sibling pointers of dir/attr tree
blocks. At the end of sibling checks, xfs_da3_path_shift could have
changed multiple levels of buffer pointers in the altpath structure.
Although we release the leaf level buffer, this isn't enough -- we also
need to release the node buffers that are unique to the altpath.
Not releasing all of the altpath buffers leaves them locked to the
transaction. This is suboptimal because we should release resources
when we don't need them anymore. Fix the function to loop all levels of
the altpath, and fix the return logic so that we always run the loop.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When quotacheck runs, it zeroes all the timer fields in every dquot.
Unfortunately, it also does this to the root dquot, which erases any
preconfigured grace intervals and warning limits that the administrator
may have set. Worse yet, the incore copies of those variables remain
set. This cache coherence problem manifests itself as the grace
interval mysteriously being reset back to the defaults at the /next/
mount.
Fix it by not resetting the root disk dquot's timer and warning fields.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The patch "ext4: make dioread_nolock the default" (244adf6426) causes
generic/422 to fail when run in kvm-xfstests' ext3conv test case. This
applies both the dioread_nolock and nodelalloc mount options, a
combination not previously tested by kvm-xfstests. The failure occurs
because the dioread_nolock code path splits a previously fallocated
multiblock extent into a series of single block extents when overwriting
a portion of that extent. That causes allocation of an extent tree leaf
node and a reshuffling of extents. Once writeback is completed, the
individual extents are recombined into a single extent, the extent is
moved again, and the leaf node is deleted. The difference in block
utilization before and after writeback due to the leaf node triggers the
failure.
The original reason for this behavior was to avoid ENOSPC when handling
I/O completions during writeback in the dioread_nolock code paths when
delayed allocation is disabled. It may no longer be necessary, because
code was added in the past to reserve extra space to solve this problem
when delayed allocation is enabled, and this code may also apply when
delayed allocation is disabled. Until this can be verified, don't use
the dioread_nolock code paths if delayed allocation is disabled.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200319150028.24592-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Under some circumstances we may encounter a filesystem error on a
read-only block device, and if we try to save the error info to the
superblock and commit it, we'll wind up with a noisy error and
backtrace, i.e.:
[ 3337.146838] EXT4-fs error (device pmem1p2): ext4_get_journal_inode:4634: comm mount: inode #0: comm mount: iget: illegal inode #
------------[ cut here ]------------
generic_make_request: Trying to write to read-only block-device pmem1p2 (partno 2)
WARNING: CPU: 107 PID: 115347 at block/blk-core.c:788 generic_make_request_checks+0x6b4/0x7d0
...
To avoid this, commit the error info in the superblock only if the
block device is writable.
Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/4b6e774d-cc00-3469-7abb-108eb151071a@sandeen.net
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When ext4 is running on a filesystem without a journal, it tries not to
reuse recently deleted inodes to provide better chances for filesystem
recovery in case of crash. However this logic forbids reuse of freed
inodes for up to 5 minutes and especially for filesystems with smaller
number of inodes can lead to ENOSPC errors returned when allocating new
inodes.
Fix the problem by allowing to reuse recently deleted inode if there's
no other inode free in the scanned range.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200318121317.31941-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Refactor pnfs_generic_commit_pagelist() to simplify the conversion
to layout segment based commit lists.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Just allocate the array at the end of the layout segment structure,
instead of allocating it as a separate array of pointers.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Overlapping header include additions in macsec.c
A bug fix in 'net' overlapping with the removal of 'version'
string in ena_netdev.c
Overlapping test additions in selftests Makefile
Overlapping PCI ID table adjustments in iwlwifi driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
For FAN_DIR_MODIFY event, allocate a variable size event struct to store
the dir entry name along side the directory file handle.
At this point, name info reporting is not yet implemented, so trying to
set FAN_DIR_MODIFY in mark mask will return -EINVAL.
Link: https://lore.kernel.org/r/20200319151022.31456-14-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Pull networking fixes from David Miller:
1) Fix deadlock in bpf_send_signal() from Yonghong Song.
2) Fix off by one in kTLS offload of mlx5, from Tariq Toukan.
3) Add missing locking in iwlwifi mvm code, from Avraham Stern.
4) Fix MSG_WAITALL handling in rxrpc, from David Howells.
5) Need to hold RTNL mutex in tcindex_partial_destroy_work(), from Cong
Wang.
6) Fix producer race condition in AF_PACKET, from Willem de Bruijn.
7) cls_route removes the wrong filter during change operations, from
Cong Wang.
8) Reject unrecognized request flags in ethtool netlink code, from
Michal Kubecek.
9) Need to keep MAC in reset until PHY is up in bcmgenet driver, from
Doug Berger.
10) Don't leak ct zone template in act_ct during replace, from Paul
Blakey.
11) Fix flushing of offloaded netfilter flowtable flows, also from Paul
Blakey.
12) Fix throughput drop during tx backpressure in cxgb4, from Rahul
Lakkireddy.
13) Don't let a non-NULL skb->dev leave the TCP stack, from Eric
Dumazet.
14) TCP_QUEUE_SEQ socket option has to update tp->copied_seq as well,
also from Eric Dumazet.
15) Restrict macsec to ethernet devices, from Willem de Bruijn.
16) Fix reference leak in some ethtool *_SET handlers, from Michal
Kubecek.
17) Fix accidental disabling of MSI for some r8169 chips, from Heiner
Kallweit.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (138 commits)
net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} build
net: ena: Add PCI shutdown handler to allow safe kexec
selftests/net/forwarding: define libs as TEST_PROGS_EXTENDED
selftests/net: add missing tests to Makefile
r8169: re-enable MSI on RTL8168c
net: phy: mdio-bcm-unimac: Fix clock handling
cxgb4/ptp: pass the sign of offset delta in FW CMD
net: dsa: tag_8021q: replace dsa_8021q_remove_header with __skb_vlan_pop
net: cbs: Fix software cbs to consider packet sending time
net/mlx5e: Do not recover from a non-fatal syndrome
net/mlx5e: Fix ICOSQ recovery flow with Striding RQ
net/mlx5e: Fix missing reset of SW metadata in Striding RQ reset
net/mlx5e: Enhance ICOSQ WQE info fields
net/mlx5_core: Set IB capability mask1 to fix ib_srpt connection failure
selftests: netfilter: add nfqueue test case
netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress
netfilter: nft_fwd_netdev: validate family and chain type
netfilter: nft_set_rbtree: Detect partial overlaps on insertion
netfilter: nft_set_rbtree: Introduce and use nft_rbtree_interval_start()
netfilter: nft_set_pipapo: Separate partial and complete overlap cases on insertion
...
A single fix in this pull request to correctly handle the size of
read-only zone files (from me).
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCXnrJJAAKCRDdoc3SxdoY
dgZXAQDK88T4sdtFq1Fl1PuP+oyzHml+xgNo0djZQOdicnD6qQD8CgMGDFQQG4dv
Ral+67qEyvUABGt0Vkmy29wuN8El6wQ=
=+1D9
-----END PGP SIGNATURE-----
Merge tag 'zonefs-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fix from Damien Le Moal:
"A single fix from me to correctly handle the size of read-only zone
files"
* tag 'zonefs-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonfs: Fix handling of read-only zones
These macros are just used by a few files. Move them out of genhd.h,
which is included everywhere into a new standalone header.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is bio layer functionality and not related to buffer heads.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ordered ops are started twice in sync file, once outside of inode mutex
and once inside, taking the dio semaphore. There was one error path
missing the semaphore unlock.
Fixes: aab15e8ec2 ("Btrfs: fix rare chances for data loss when doing a fast fsync")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ add changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>