Commit Graph

71646 Commits

Author SHA1 Message Date
Linus Torvalds 0cc2ea8ceb Some highlights:
- add tracepoints for callbacks and for client creation and
 	  destruction
 	- cache the mounts used for server-to-server copies
 	- expose callback information in /proc/fs/nfsd/clients/*/info
 	- don't hold locks unnecessarily while waiting for commits
 	- update NLM to use xdr_stream, as we have for NFSv2/v3/v4
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAmDlvjIVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+0MoP/RJ8Q7zwIz6WFHn3bCRaEXpnnkAH
 mmMfELhmgvH0V5nXWbb2rAfhllY+/zeWtf8QHSEKUPCnVLmB7WeXKdjXSy7EnYJ8
 R8DuuuII85McIrg93nJ8hxm4wXTaTZKXpS4Vxkuxc6YKxoeJoXOaTjbgRLIw8mfX
 w4wPfjAsnROboVxvDHUmBS9zNKaAi2dZ0jH2x2eS7eZSWzoJC30yd+pFSxyYoOac
 3fZUntDskQDGIpXHuTf53WcaK7h1bUHrwS7Joez8Z0ctg4vcbJsfdhKZUZwAxOZh
 3xWAgm3PFcze5xqHuX8BYBThHfB3uTeygZQRb3zI9sG2UQtQfundrtlxZRSjMMkC
 cwlSi2SQNL66EBIgOcS3U/9OeorLALnnRax1KWMWjpFzaBJJQTJDumwLRx4zogI1
 Ouiu0fI+hApck+L+qCzJMidA2wxOBsDzH471YiGiqQSmgNZc6wBc+aC/JKN8QAWb
 jG53vvpa3gCZa8Rs3KyOoUvtcCCdiQc+nljbzqtVfIvvGa9MSixufa+U5fojLEO7
 i8aangK+mteMxrrejEKvRu1efDIfpFq0HW7ev1mzW2Jl/AguDXM5XUeGK2mMMPtc
 WqT3arbtGVcXJN+Oh5TzTVuED/DecyO0Fig77G+WJTiWONgoHfs+E5nC4aHSpohn
 bMpmQMIOmTa5zgQP
 =BQyR
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.14' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:

 - add tracepoints for callbacks and for client creation and destruction

 - cache the mounts used for server-to-server copies

 - expose callback information in /proc/fs/nfsd/clients/*/info

 - don't hold locks unnecessarily while waiting for commits

 - update NLM to use xdr_stream, as we have for NFSv2/v3/v4

* tag 'nfsd-5.14' of git://linux-nfs.org/~bfields/linux: (69 commits)
  nfsd: fix NULL dereference in nfs3svc_encode_getaclres
  NFSD: Prevent a possible oops in the nfs_dirent() tracepoint
  nfsd: remove redundant assignment to pointer 'this'
  nfsd: Reduce contention for the nfsd_file nf_rwsem
  lockd: Update the NLMv4 SHARE results encoder to use struct xdr_stream
  lockd: Update the NLMv4 nlm_res results encoder to use struct xdr_stream
  lockd: Update the NLMv4 TEST results encoder to use struct xdr_stream
  lockd: Update the NLMv4 void results encoder to use struct xdr_stream
  lockd: Update the NLMv4 FREE_ALL arguments decoder to use struct xdr_stream
  lockd: Update the NLMv4 SHARE arguments decoder to use struct xdr_stream
  lockd: Update the NLMv4 SM_NOTIFY arguments decoder to use struct xdr_stream
  lockd: Update the NLMv4 nlm_res arguments decoder to use struct xdr_stream
  lockd: Update the NLMv4 UNLOCK arguments decoder to use struct xdr_stream
  lockd: Update the NLMv4 CANCEL arguments decoder to use struct xdr_stream
  lockd: Update the NLMv4 LOCK arguments decoder to use struct xdr_stream
  lockd: Update the NLMv4 TEST arguments decoder to use struct xdr_stream
  lockd: Update the NLMv4 void arguments decoder to use struct xdr_stream
  lockd: Update the NLMv1 SHARE results encoder to use struct xdr_stream
  lockd: Update the NLMv1 nlm_res results encoder to use struct xdr_stream
  lockd: Update the NLMv1 TEST results encoder to use struct xdr_stream
  ...
2021-07-07 12:50:08 -07:00
Pavel Begunkov c32aace0cf io_uring: fix drain alloc fail return code
After a recent change io_drain_req() started to fail requests with
result=0 in case of allocation failure, where it should be and have
been -ENOMEM.

Fixes: 76cc33d791 ("io_uring: refactor io_req_defer()")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e068110ac4293e0c56cfc4d280d0f22b9303ec08.1625682153.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-07-07 12:49:32 -06:00
Steve French 2a780e8b64 CIFS: Clarify SMB1 code for delete
Coverity also complains about the way we calculate the offset
(starting from the address of a 4 byte array within the
header structure rather than from the beginning of the struct
plus 4 bytes) for SMB1 SetFileDisposition (which is used to
unlink a file by setting the delete on close flag).  This
changeset doesn't change the address but makes it slightly
clearer.

Addresses-Coverity: 711524 ("Out of bounds write")
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-07-07 11:53:17 -05:00
Steve French e3973ea3a7 CIFS: Clarify SMB1 code for SetFileSize
Coverity also complains about the way we calculate the offset
(starting from the address of a 4 byte array within the header
structure rather than from the beginning of the struct plus
4 bytes) for setting the file size using SMB1. This changeset
doesn't change the address but makes it slightly clearer.

Addresses-Coverity: 711525 ("Out of bounds write")
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-07-07 11:52:53 -05:00
Filipe Manana ea32af47f0 btrfs: zoned: fix wrong mutex unlock on failure to allocate log root tree
When syncing the log, if we fail to allocate the root node for the log
root tree:

1) We are unlocking fs_info->tree_log_mutex, but at this point we have
   not yet locked this mutex;

2) We have locked fs_info->tree_root->log_mutex, but we end up not
   unlocking it;

So fix this by unlocking fs_info->tree_root->log_mutex instead of
fs_info->tree_log_mutex.

Fixes: e75f9fd194 ("btrfs: zoned: move log tree node allocation out of log_root_tree->log_mutex")
CC: stable@vger.kernel.org # 5.13+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 18:27:44 +02:00
Johannes Thumshirn 9cc0b837e1 btrfs: don't block if we can't acquire the reclaim lock
If we can't acquire the reclaim_bgs_lock on block group reclaim, we
block until it is free. This can potentially stall for a long time.

While reclaim of block groups is necessary for a good user experience on
a zoned file system, there still is no need to block as it is best
effort only, just like when we're deleting unused block groups.

CC: stable@vger.kernel.org # 5.13
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 17:42:57 +02:00
Naohiro Aota abb99cfdaf btrfs: properly split extent_map for REQ_OP_ZONE_APPEND
Damien reported a test failure with btrfs/209. The test itself ran fine,
but the fsck ran afterwards reported a corrupted filesystem.

The filesystem corruption happens because we're splitting an extent and
then writing the extent twice. We have to split the extent though, because
we're creating too large extents for a REQ_OP_ZONE_APPEND operation.

When dumping the extent tree, we can see two EXTENT_ITEMs at the same
start address but different lengths.

$ btrfs inspect dump-tree /dev/nullb1 -t extent
...
   item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
           refs 1 gen 7 flags DATA
           extent data backref root FS_TREE objectid 257 offset 786432 count 1
   item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
           refs 1 gen 7 flags DATA
           extent data backref root FS_TREE objectid 257 offset 786432 count 1

The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in
extract_ordered_extent(). Since extract_ordered_extent() uses
create_io_em() to split an existing extent_map, we will have
split->orig_start != split->start. Then, it will be logged with non-zero
"extent data offset". Finally, the logged entries are replayed into
a duplicated EXTENT_ITEM.

Introduce and use proper splitting function for extent_map. The function is
intended to be simple and specific usage for extract_ordered_extent() e.g.
not supporting compression case (we do not allow splitting compressed
extent_map anyway).

There was a question raised by Qu, in summary why we want to split the
extent map (and not the bio):

The problem is not the limit on the zone end, which as you mention is
the same as the block group end. The problem is that data write use zone
append (ZA) operations. ZA BIOs cannot be split so a large extent may
need to be processed with multiple ZA BIOs, While that is also true for
regular writes, the major difference is that ZA are "nameless" write
operation giving back the written sectors on completion. And ZA
operations may be reordered by the block layer (not intentionally
though). Combine both of these characteristics and you can see that the
data for a large extent may end up being shuffled when written resulting
in data corruption and the impossibility to map the extent to some start
sector.

To avoid this problem, zoned btrfs uses the principle "one data extent
== one ZA BIO". So large extents need to be split. This is unfortunate,
but we can revisit this later and optimize, e.g. merge back together the
fragments of an extent once written if they actually were written
sequentially in the zone.

Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
Fixes: d22002fd37 ("btrfs: zoned: split ordered extent when bio is sent")
CC: stable@vger.kernel.org # 5.12+
CC: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 17:42:45 +02:00
Filipe Manana 79bd37120b btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.

However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.

This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.

The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.

For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.

The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.

CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 17:42:41 +02:00
Filipe Manana 1cb3db1cf3 btrfs: fix deadlock with concurrent chunk allocations involving system chunks
When a task attempting to allocate a new chunk verifies that there is not
currently enough free space in the system space_info and there is another
task that allocated a new system chunk but it did not finish yet the
creation of the respective block group, it waits for that other task to
finish creating the block group. This is to avoid exhaustion of the system
chunk array in the superblock, which is limited, when we have a thundering
herd of tasks allocating new chunks. This problem was described and fixed
by commit eafa4fd0ad ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations").

However there are two very similar scenarios where this can lead to a
deadlock:

1) Task B allocated a new system chunk and task A is waiting on task B
   to finish creation of the respective system block group. However before
   task B ends its transaction handle and finishes the creation of the
   system block group, it attempts to allocate another chunk (like a data
   chunk for an fallocate operation for a very large range). Task B will
   be unable to progress and allocate the new chunk, because task A set
   space_info->chunk_alloc to 1 and therefore it loops at
   btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
   and set space_info->chunk_alloc to 0, but task A is waiting on task B
   to finish creation of the new system block group, therefore resulting
   in a deadlock;

2) Task B allocated a new system chunk and task A is waiting on task B to
   finish creation of the respective system block group. By the time that
   task B enter the final phase of block group allocation, which happens
   at btrfs_create_pending_block_groups(), when it modifies the extent
   tree, the device tree or the chunk tree to insert the items for some
   new block group, it needs to allocate a new chunk, so it ends up at
   btrfs_chunk_alloc() and keeps looping there because task A has set
   space_info->chunk_alloc to 1, but task A is waiting for task B to
   finish creation of the new system block group and release the reserved
   system space, therefore resulting in a deadlock.

In short, the problem is if a task B needs to allocate a new chunk after
it previously allocated a new system chunk and if another task A is
currently waiting for task B to complete the allocation of the new system
chunk.

Unfortunately this deadlock scenario introduced by the previous fix for
the system chunk array exhaustion problem does not have a simple and short
fix, and requires a big change to rework the chunk allocation code so that
chunk btree updates are all made in the first phase of chunk allocation.
And since this deadlock regression is being frequently hit on zoned
filesystems and the system chunk array exhaustion problem is triggered
in more extreme cases (originally observed on PowerPC with a node size
of 64K when running the fallocate tests from stress-ng), revert the
changes from that commit. The next patch in the series, with a subject
of "btrfs: rework chunk allocation to avoid exhaustion of the system
chunk array" does the necessary changes to fix the system chunk array
exhaustion problem.

Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
Fixes: eafa4fd0ad ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 17:42:40 +02:00
Johannes Thumshirn 5f93e776c6 btrfs: zoned: print unusable percentage when reclaiming block groups
When we're automatically reclaiming a zone, because its zone_unusable
value is above the reclaim threshold, we're only logging how much
percent of the zone's capacity are used, but not how much of the
capacity is unusable.

Also print the percentage of the unusable space in the block group
before we're reclaiming it.

Example:

  BTRFS info (device sdg): reclaiming chunk 230686720 with 13% used 86% unusable

CC: stable@vger.kernel.org # 5.13
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 17:42:37 +02:00
David Sterba 54afaae34e btrfs: zoned: fix types for u64 division in btrfs_reclaim_bgs_work
The types in calculation of the used percentage in the reclaiming
messages are both u64, though bg->length is either 1GiB (non-zoned) or
the zone size in the zoned mode. The upper limit on zone size is 8GiB so
this could theoretically overflow in the future, right now the values
fit.

Fixes: 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.13
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 17:42:34 +02:00
Jaegeuk Kim 28607bf3aa f2fs: drop dirty node pages when cp is in error status
Otherwise, writeback is going to fall in a loop to flush dirty inode forever
before getting SBI_CLOSING.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-06 22:05:06 -07:00
J. Bruce Fields ab1016d39c nfsd: fix NULL dereference in nfs3svc_encode_getaclres
In error cases the dentry may be NULL.

Before 20798dfe24, the encoder also checked dentry and
d_really_is_positive(dentry), but that looks like overkill to me--zero
status should be enough to guarantee a positive dentry.

This isn't the first time we've seen an error-case NULL dereference
hidden in the initialization of a local variable in an xdr encoder.  But
I went back through the other recent rewrites and didn't spot any
similar bugs.

Reported-by: JianHong Yin <jiyin@redhat.com>
Reviewed-by: Chuck Lever III <chuck.lever@oracle.com>
Fixes: 20798dfe24 ("NFSD: Update the NFSv3 GETACL result encoder...")
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:44 -04:00
Chuck Lever 7b08cf62b1 NFSD: Prevent a possible oops in the nfs_dirent() tracepoint
The double copy of the string is a mistake, plus __assign_str()
uses strlen(), which is wrong to do on a string that isn't
guaranteed to be NUL-terminated.

Fixes: 6019ce0742 ("NFSD: Add a tracepoint to record directory entry encoding")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:44 -04:00
Colin Ian King e34c0ce913 nfsd: remove redundant assignment to pointer 'this'
The pointer 'this' is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:44 -04:00
Trond Myklebust 474bc33469 nfsd: Reduce contention for the nfsd_file nf_rwsem
When flushing out the unstable file writes as part of a COMMIT call, try
to perform most of of the data writes and waits outside the semaphore.

This means that if the client is sending the COMMIT as part of a memory
reclaim operation, then it can continue performing I/O, with contention
for the lock occurring only once the data sync is finished.

Fixes: 5011af4c69 ("nfsd: Fix stable writes")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
 Tested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:44 -04:00
Chuck Lever 0ff5b50ab1 lockd: Update the NLMv4 SHARE results encoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:44 -04:00
Chuck Lever 447c14d489 lockd: Update the NLMv4 nlm_res results encoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:44 -04:00
Chuck Lever 1beef1473c lockd: Update the NLMv4 TEST results encoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:44 -04:00
Chuck Lever ec757e423b lockd: Update the NLMv4 void results encoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:44 -04:00
Chuck Lever 3049e974a7 lockd: Update the NLMv4 FREE_ALL arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:44 -04:00
Chuck Lever 7cf96b6d01 lockd: Update the NLMv4 SHARE arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:44 -04:00
Chuck Lever bc3665fd71 lockd: Update the NLMv4 SM_NOTIFY arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:43 -04:00
Chuck Lever b4c24b5a41 lockd: Update the NLMv4 nlm_res arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:43 -04:00
Chuck Lever d76d8c25ce lockd: Update the NLMv4 UNLOCK arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:43 -04:00
Chuck Lever 1e1f38dcf3 lockd: Update the NLMv4 CANCEL arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:43 -04:00
Chuck Lever 0e5977af4f lockd: Update the NLMv4 LOCK arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:43 -04:00
Chuck Lever 345b4159a0 lockd: Update the NLMv4 TEST arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:43 -04:00
Chuck Lever 7956521aac lockd: Update the NLMv4 void arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:43 -04:00
Chuck Lever 529ca3a116 lockd: Update the NLMv1 SHARE results encoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:43 -04:00
Chuck Lever e96735a698 lockd: Update the NLMv1 nlm_res results encoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:43 -04:00
Chuck Lever adf98a4850 lockd: Update the NLMv1 TEST results encoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:43 -04:00
Chuck Lever e26ec898b6 lockd: Update the NLMv1 void results encoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:43 -04:00
Chuck Lever 14e105256b lockd: Update the NLMv1 FREE_ALL arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:43 -04:00
Chuck Lever 890939e126 lockd: Update the NLMv1 SHARE arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:42 -04:00
Chuck Lever 137e05e2f7 lockd: Update the NLMv1 SM_NOTIFY arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:42 -04:00
Chuck Lever 16ddcabe62 lockd: Update the NLMv1 nlm_res arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:42 -04:00
Chuck Lever c27045d302 lockd: Update the NLMv1 UNLOCK arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:42 -04:00
Chuck Lever f4e08f3ac8 lockd: Update the NLMv1 CANCEL arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:42 -04:00
Chuck Lever c1adb8c672 lockd: Update the NLMv1 LOCK arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:42 -04:00
Chuck Lever 2fd0c67aab lockd: Update the NLMv1 TEST arguments decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:42 -04:00
Chuck Lever cc1029b512 lockd: Update the NLMv1 void argument decoder to use struct xdr_stream
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:42 -04:00
Chuck Lever a6a63ca565 lockd: Common NLM XDR helpers
Add a .h file containing xdr_stream-based XDR helpers common to both
NLMv3 and NLMv4.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:42 -04:00
Chuck Lever a9ad1a8090 lockd: Create a simplified .vs_dispatch method for NLM requests
To enable xdr_stream-based encoding and decoding, create a bespoke
RPC dispatch function for the lockd service.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:42 -04:00
J. Bruce Fields 05570a2b01 nfsd: rpc_peeraddr2str needs rcu lock
I'm not even sure cl_xprt can change here, but we're getting "suspicious
RCU usage" warnings, and other rpc_peeraddr2str callers are taking the
rcu lock.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:42 -04:00
ChenXiaoSong 5823e40055 nfs_common: fix doc warning
Fix gcc W=1 warning:

fs/nfs_common/grace.c:91: warning: Function parameter or member 'net' not described in 'locks_in_grace'

Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:41 -04:00
Wei Yongjun 54185267e1 NFSD: Fix error return code in nfsd4_interssc_connect()
'status' has been overwritten to 0 after nfsd4_ssc_setup_dul(), this
cause 0 will be return in vfs_kern_mount() error case. Fix to return
nfserr_nodev in this error.

Fixes: f4e44b3933 ("NFSD: delay unmount source's export after inter-server copy completed.")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:41 -04:00
Dai Ngo f47dc2d301 nfsd: fix kernel test robot warning in SSC code
Fix by initializing pointer nfsd4_ssc_umount_item with NULL instead of 0.
Replace return value of nfsd4_ssc_setup_dul with __be32 instead of int.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:41 -04:00
Dave Wysochanski 3518c8666f nfsd4: Expose the callback address and state of each NFS4 client
In addition to the client's address, display the callback channel
state and address in the 'info' file.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:41 -04:00
J. Bruce Fields 934bd07fae nfsd: move fsnotify on client creation outside spinlock
This was causing a "sleeping function called from invalid context"
warning.

I don't think we need the set_and_test_bit() here; clients move from
unconfirmed to confirmed only once, under the client_lock.

The (conf == unconf) is a way to check whether we're in that confirming
case, hopefully that's not too obscure.

Fixes: 472d155a06 "nfsd: report client confirmation status in "info" file"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-07-06 20:14:41 -04:00
Linus Torvalds 8e4f3e1517 fuse update for 5.14
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYORvYQAKCRDh3BK/laaZ
 PCfvAQCbU+PW2RbwlqjZMet6w9qorh29XYe786P5pNRVbMYCygD+N45l66Sbd/Rz
 7M7ioVDseyTW4dnLhb8SzSNB0zr6jQs=
 =MDvD
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Fixes for virtiofs submounts

 - Misc fixes and cleanups

* tag 'fuse-update-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  virtiofs: Fix spelling mistakes
  fuse: use DIV_ROUND_UP helper macro for calculations
  fuse: fix illegal access to inode with reused nodeid
  fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)
  fuse: Make fuse_fill_super_submount() static
  fuse: Switch to fc_mount() for submounts
  fuse: Call vfs_get_tree() for submounts
  fuse: add dedicated filesystem context ops for submounts
  virtiofs: propagate sync() to file server
  fuse: reject internal errno
  fuse: check connected before queueing on fpq->io
  fuse: ignore PG_workingset after stealing
  fuse: Fix infinite loop in sget_fc()
  fuse: Fix crash if superblock of submount gets killed early
  fuse: Fix crash in fuse_dentry_automount() error path
2021-07-06 11:17:41 -07:00
Linus Torvalds 729437e334 Orangefs: and adjustment and a fix
The readahead adjustment was suggested by Matthew Wilcox and looks like
 how I should have written it in the first place... the "df fix" was
 suggested by Walt Ligon, some Orangefs users have been complaining
 about whacky df output...
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEIGSFVdO6eop9nER2z0QOqevODb4FAmDkWagACgkQz0QOqevO
 Db4q5Q/7BbtqEY8Ht33n72Kat+q2QU/sDmv4TyDeN6B8Ooq/0XhQUxBk4O258+x8
 Yc60nQrap1n1VikScSP29SsfYPgtR4axjzJVwwgT/q4P+6XJsQNAsWwWNzpRWKA3
 eGRfHVErRwLUN15W7vi449II/+AB6iz3sdVZd9eX9duu5g2CRN+yb5zUmaEVmxKc
 u0c6RD1h7E6wuPrshyTkuxFzpdg/Mr5Tz2CPoV31L9XYh4wTRtpwPM1O4036SvAf
 uQfRVGmzX1fNxVdp7h1ikcH67cOVO3HRDW0pA/Gk3GbyAqplCOuLcmO+MHEta/NW
 MIGfsRtcCmfDRGPuYSw85h9ETc2vWlVsPBu4fP5VoFtTHTkZIafWrW4aappIBMzh
 gHye3k6iZyFQnyYfSvhebp3JHJm0CEVHgRfhIsWai4Gtl4mdu2G73YD+m3x2GzxW
 aAk4XthQ0ML0e/VZMnhROn/XS72HT9zm45mxKiF/rks62UCqytTCqIl3df/CRNKf
 TVLTBWLZ459k+RiX8KF8eShbis0ryDbLFcMccGMF4WD9lJpjsmiKSe/PLw5iKRjc
 F4dm6YVhdnK1bph48XlNU+rhMP2Mkpw7i/tf5Nw6Z15F6zoFAs0bnn2nI7kKk9rm
 T/ezJU7rCs50+y2Imxbeb13A+swTFgXdgNuy6mrD6gHCSdHoupU=
 =MLJA
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.14-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Marshall:
 "A read-ahead adjustment and a fix.

  The readahead adjustment was suggested by Matthew Wilcox and looks
  like how I should have written it in the first place... the "df fix"
  was suggested by Walt Ligon, some Orangefs users have been complaining
  about whacky df output..."

* tag 'for-linus-5.14-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: fix orangefs df output.
  orangefs: readahead adjustment
2021-07-06 11:12:42 -07:00
Linus Torvalds 7a5e9a17b2 Description for this pull request:
- Improved compatibility issue with exfat from some camera vendors.
 - Do not need to release root inode on error path.
 -----BEGIN PGP SIGNATURE-----
 
 iQJMBAABCgA2FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmDhB1QYHG5hbWphZS5q
 ZW9uQHNhbXN1bmcuY29tAAoJEGcL+wNRRCEIf/sQAMtOahaIoZzdF/5Iprk4dP5b
 y8l0C1dM+Ijkc/F0VvfXfdAbRzbXYfd7hh5JyJGXQYdvo6NxjEQ9xSEWPbYxEBJg
 zUc2jKG1Qgmcy7XKAqcFDZjogBFtAqYsDseN0RDiRpUFwrFkT2wE4pC3TLVTK2+b
 GKDxoQdzoXp/BfyHJ03YXxxzwDyBmYK7/7yulLA1nVOiNrRcN7foImylZ+6j24u0
 3IwxqWfbubl5kH9ABn77ZWI1yhwTRHczhIBh4GBDO/6/+HOsvFOoednQ3i/7lx9q
 d2JZqRYyV0Nrxnfy1LAkJ4G4UiZURaLM7UEQUDMFxwqYPAmfW2FiAFx8OvoPoTz6
 ZxYOUSFg/GZOuYSXa5/NF/Gf2qO1sBdFvqnqXn81sJK0L/5BW98YYayCMkdWtvCn
 KFWTi7FCtcotJwsirrYom4rsXx5YfSDCCiLhhwo4j43MepA8krDdLcuBxRRp6LRC
 QZEd7k/gdHvcqxFYYeYM4U5c6m3ASvoT6vJ9kA+P39NwWjrE2wVH1QP3HPVTm1w7
 7P4K/mTrE62VcWEzimtbTtNLfmknPA9eEuzxg7mC9QJdN0VczJ24BXxKRqJg95fd
 yjVNwbsxkwUHXt6GITVGFJkE3E7GIVS9/Ket6VjMZAcSACJjznnL1nnlklu1b6MS
 k40sJJ1o0VLB5JmKQxY/
 =U4e0
 -----END PGP SIGNATURE-----

Merge tag 'exfat-for-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat

Pull exfat updates from Namjae Jeon:

 - Improved compatibility issue with exfat from some camera vendors.

 - Do not need to release root inode on error path.

* tag 'exfat-for-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
  exfat: handle wrong stream entry size in exfat_readdir()
  exfat: avoid incorrectly releasing for root inode
2021-07-06 11:06:04 -07:00
Jaegeuk Kim c9ebd3df43 f2fs: initialize page->private when using for our internal use
We need to guarantee it's initially zero. Otherwise, it'll hurt entire flag
operations.

Fixes: b763f3bedc ("f2fs: restructure f2fs page.private layout")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-05 17:17:30 -07:00
Linus Torvalds f5c13f1fde Driver core changes for 5.14-rc1
Here is the small set of driver core and debugfs updates for 5.14-rc1.
 
 Included in here are:
 	- debugfs api cleanups (touched some drivers)
 	- devres updates
 	- tiny driver core updates and tweaks
 
 Nothing major in here at all, and all have been in linux-next for a
 while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYOM7jA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yloDQCfZOlLYXF+2KgXJQqevNnRiu7/B1gAn3aCX6xh
 UWVUfu5LDIXi2uFERRT1
 =Ze3R
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core changes from Greg KH:
 "Here is the small set of driver core and debugfs updates for 5.14-rc1.

  Included in here are:

   - debugfs api cleanups (touched some drivers)

   - devres updates

   - tiny driver core updates and tweaks

  Nothing major in here at all, and all have been in linux-next for a
  while with no reported issues"

* tag 'driver-core-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (27 commits)
  docs: ABI: testing: sysfs-firmware-memmap: add some memmap types.
  devres: Enable trace events
  devres: No need to call remove_nodes() when there none present
  devres: Use list_for_each_safe_from() in remove_nodes()
  devres: Make locking straight forward in release_nodes()
  kernfs: move revalidate to be near lookup
  drivers/base: Constify static attribute_group structs
  firmware_loader: remove unneeded 'comma' macro
  devcoredump: remove contact information
  driver core: Drop helper devm_platform_ioremap_resource_wc()
  component: Rename 'dev' to 'parent'
  component: Drop 'dev' argument to component_match_realloc()
  device property: Don't check for NULL twice in the loops
  driver core: auxiliary bus: Fix typo in the docs
  drivers/base/node.c: make CACHE_ATTR define static DEVICE_ATTR_RO
  debugfs: remove return value of debugfs_create_ulong()
  debugfs: remove return value of debugfs_create_bool()
  scsi: snic: debugfs: remove local storage of debugfs files
  b43: don't save dentries for debugfs
  b43legacy: don't save dentries for debugfs
  ...
2021-07-05 13:51:41 -07:00
Linus Torvalds eed0218e8c Char / Misc driver updates for 5.14-rc1
Here is the big set of char / misc and other driver subsystem updates
 for 5.14-rc1.  Included in here are:
 	- habanna driver updates
 	- fsl-mc driver updates
 	- comedi driver updates
 	- fpga driver updates
 	- extcon driver updates
 	- interconnect driver updates
 	- mei driver updates
 	- nvmem driver updates
 	- phy driver updates
 	- pnp driver updates
 	- soundwire driver updates
 	- lots of other tiny driver updates for char and misc drivers
 
 This is looking more and more like the "various driver subsystems mushed
 together" tree...
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYOM8jQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymECgCg0yL+8WxDKO5Gg5llM5PshvLB1rQAn0y5pDgg
 nw78LV3HQ0U7qaZBtI91
 =x+AR
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char / misc driver updates from Greg KH:
 "Here is the big set of char / misc and other driver subsystem updates
  for 5.14-rc1. Included in here are:

   - habanalabs driver updates

   - fsl-mc driver updates

   - comedi driver updates

   - fpga driver updates

   - extcon driver updates

   - interconnect driver updates

   - mei driver updates

   - nvmem driver updates

   - phy driver updates

   - pnp driver updates

   - soundwire driver updates

   - lots of other tiny driver updates for char and misc drivers

  This is looking more and more like the "various driver subsystems
  mushed together" tree...

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (292 commits)
  mcb: Use DEFINE_RES_MEM() helper macro and fix the end address
  PNP: moved EXPORT_SYMBOL so that it immediately followed its function/variable
  bus: mhi: pci-generic: Add missing 'pci_disable_pcie_error_reporting()' calls
  bus: mhi: Wait for M2 state during system resume
  bus: mhi: core: Fix power down latency
  intel_th: Wait until port is in reset before programming it
  intel_th: msu: Make contiguous buffers uncached
  intel_th: Remove an unused exit point from intel_th_remove()
  stm class: Spelling fix
  nitro_enclaves: Set Bus Master for the NE PCI device
  misc: ibmasm: Modify matricies to matrices
  misc: vmw_vmci: return the correct errno code
  siox: Simplify error handling via dev_err_probe()
  fpga: machxo2-spi: Address warning about unused variable
  lkdtm/heap: Add init_on_alloc tests
  selftests/lkdtm: Enable various testable CONFIGs
  lkdtm: Add CONFIG hints in errors where possible
  lkdtm: Enable DOUBLE_FAULT on all architectures
  lkdtm/heap: Add vmalloc linear overflow test
  lkdtm/bugs: XFAIL UNALIGNED_LOAD_STORE_WRITE
  ...
2021-07-05 13:42:16 -07:00
Steve French 90810c25cf smb3: fix typo in header file
Although it compiles, the test robot correctly noted:
  'cifsacl.h' file not found with <angled> include; use "quotes" instead

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-07-05 10:59:53 -05:00
Namjae Jeon 1e5654de0f exfat: handle wrong stream entry size in exfat_readdir()
The compatibility issue between linux exfat and exfat of some camera
company was reported from Florian. In their exfat, if the number of files
exceeds any limit, the DataLength in stream entry of the directory is
no longer updated. So some files created from camera does not show in
linux exfat. because linux exfat doesn't allow that cpos becomes larger
than DataLength of stream entry. This patch check DataLength in stream
entry only if the type is ALLOC_NO_FAT_CHAIN and add the check ensure
that dentry offset does not exceed max dentries size(256 MB) to avoid
the circular FAT chain issue.

Fixes: ca06197382 ("exfat: add directory operations")
Cc: stable@vger.kernel.org # v5.9
Reported-by: Florian Cramer <flrncrmr@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Tested-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2021-07-04 09:33:00 +09:00
Linus Torvalds 58ec9059b3 Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs name lookup updates from Al Viro:
 "Small namei.c patch series, mostly to simplify the rules for nameidata
  state. It's actually from the previous cycle - but I didn't post it
  for review in time...

  Changes visible outside of fs/namei.c: file_open_root() calling
  conventions change, some freed bits in LOOKUP_... space"

* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  namei: make sure nd->depth is always valid
  teach set_nameidata() to handle setting the root as well
  take LOOKUP_{ROOT,ROOT_GRABBED,JUMPED} out of LOOKUP_... space
  switch file_open_root() to struct path
2021-07-03 11:41:14 -07:00
Linus Torvalds d3acb15a3a Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
 "iov_iter cleanups and fixes.

  There are followups, but this is what had sat in -next this cycle. IMO
  the macro forest in there became much thinner and easier to follow..."

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  csum_and_copy_to_pipe_iter(): leave handling of csum_state to caller
  clean up copy_mc_pipe_to_iter()
  pipe_zero(): we don't need no stinkin' kmap_atomic()...
  iov_iter: clean csum_and_copy_...() primitives up a bit
  copy_page_from_iter(): don't need kmap_atomic() for kvec/bvec cases
  copy_page_to_iter(): don't bother with kmap_atomic() for bvec/kvec cases
  iterate_xarray(): only of the first iteration we might get offset != 0
  pull handling of ->iov_offset into iterate_{iovec,bvec,xarray}
  iov_iter: make iterator callbacks use base and len instead of iovec
  iov_iter: make the amount already copied available to iterator callbacks
  iov_iter: get rid of separate bvec and xarray callbacks
  iov_iter: teach iterate_{bvec,xarray}() about possible short copies
  iterate_bvec(): expand bvec.h macro forest, massage a bit
  iov_iter: unify iterate_iovec and iterate_kvec
  iov_iter: massage iterate_iovec and iterate_kvec to logics similar to iterate_bvec
  iterate_and_advance(): get rid of magic in case when n is 0
  csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter()
  iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant
  [xarray] iov_iter_npages(): just use DIV_ROUND_UP()
  iov_iter_npages(): don't bother with iterate_all_kinds()
  ...
2021-07-03 11:30:04 -07:00
Linus Torvalds f92a322a63 Merge branch 'work.d_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs d_path() updates from Al Viro:
 "d_path.c refactoring"

* 'work.d_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  getcwd(2): clean up error handling
  d_path: prepend_path() is unlikely to return non-zero
  d_path: prepend_path(): lift the inner loop into a new helper
  d_path: prepend_path(): lift resetting b in case when we'd return 3 out of loop
  d_path: prepend_path(): get rid of vfsmnt
  d_path: introduce struct prepend_buffer
  d_path: make prepend_name() boolean
  d_path: lift -ENAMETOOLONG handling into callers of prepend_path()
  d_path: don't bother with return value of prepend()
  getcwd(2): saner logics around prepend_path() call
  d_path: get rid of path_with_deleted()
  d_path: regularize handling of root dentry in __dentry_path()
  d_path: saner calling conventions for __dentry_path()
  d_path: "\0" is {0,0}, not {0}
2021-07-03 11:23:40 -07:00
Linus Torvalds 757fa80f4e Tracing updates for 5.14:
- Added option for per CPU threads to the hwlat tracer
 
  - Have hwlat tracer handle hotplug CPUs
 
  - New tracer: osnoise, that detects latency caused by interrupts, softirqs
    and scheduling of other tasks.
 
  - Added timerlat tracer that creates a thread and measures in detail what
    sources of latency it has for wake ups.
 
  - Removed the "success" field of the sched_wakeup trace event.
    This has been hardcoded as "1" since 2015, no tooling should be looking
    at it now. If one exists, we can revert this commit, fix that tool and
    try to remove it again in the future.
 
  - tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids.
 
  - New boot command line option "tp_printk_stop", as tp_printk causes trace
    events to write to console. When user space starts, this can easily live
    lock the system. Having a boot option to stop just after boot up is
    useful to prevent that from happening.
 
  - Have ftrace_dump_on_oops boot command line option take numbers that match
    the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops.
 
  - Bootconfig clean ups, fixes and enhancements.
 
  - New ktest script that tests bootconfig options.
 
  - Add tracepoint_probe_register_may_exist() to register a tracepoint
    without triggering a WARN*() if it already exists. BPF has a path from
    user space that can do this. All other paths are considered a bug.
 
  - Small clean ups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYN8YPhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qhxLAP9Mo5hHv7Hg6W7Ddv77rThm+qclsMR/
 yW0P+eJpMm4+xAD8Cq03oE1DimPK+9WZBKU5rSqAkqG6CjgDRw6NlIszzQQ=
 =WEPR
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:

 - Added option for per CPU threads to the hwlat tracer

 - Have hwlat tracer handle hotplug CPUs

 - New tracer: osnoise, that detects latency caused by interrupts,
   softirqs and scheduling of other tasks.

 - Added timerlat tracer that creates a thread and measures in detail
   what sources of latency it has for wake ups.

 - Removed the "success" field of the sched_wakeup trace event. This has
   been hardcoded as "1" since 2015, no tooling should be looking at it
   now. If one exists, we can revert this commit, fix that tool and try
   to remove it again in the future.

 - tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids.

 - New boot command line option "tp_printk_stop", as tp_printk causes
   trace events to write to console. When user space starts, this can
   easily live lock the system. Having a boot option to stop just after
   boot up is useful to prevent that from happening.

 - Have ftrace_dump_on_oops boot command line option take numbers that
   match the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops.

 - Bootconfig clean ups, fixes and enhancements.

 - New ktest script that tests bootconfig options.

 - Add tracepoint_probe_register_may_exist() to register a tracepoint
   without triggering a WARN*() if it already exists. BPF has a path
   from user space that can do this. All other paths are considered a
   bug.

 - Small clean ups and fixes

* tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (49 commits)
  tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
  tracing: Simplify & fix saved_tgids logic
  treewide: Add missing semicolons to __assign_str uses
  tracing: Change variable type as bool for clean-up
  trace/timerlat: Fix indentation on timerlat_main()
  trace/osnoise: Make 'noise' variable s64 in run_osnoise()
  tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing
  tracing: Fix spelling in osnoise tracer "interferences" -> "interference"
  Documentation: Fix a typo on trace/osnoise-tracer
  trace/osnoise: Fix return value on osnoise_init_hotplug_support
  trace/osnoise: Make interval u64 on osnoise_main
  trace/osnoise: Fix 'no previous prototype' warnings
  tracing: Have osnoise_main() add a quiescent state for task rcu
  seq_buf: Make trace_seq_putmem_hex() support data longer than 8
  seq_buf: Fix overflow in seq_buf_putmem_hex()
  trace/osnoise: Support hotplug operations
  trace/hwlat: Support hotplug operations
  trace/hwlat: Protect kdata->kthread with get/put_online_cpus
  trace: Add timerlat tracer
  trace: Add osnoise tracer
  ...
2021-07-03 11:13:22 -07:00
Steve French b019e1187c CIFS: Clarify SMB1 code for UnixSetPathInfo
Coverity also complains about the way we calculate the offset
(starting from the address of a 4 byte array within the
header structure rather than from the beginning of the struct
plus 4 bytes) for doing SetPathInfo (setattr) when using the Unix
extensions.  This doesn't change the address but makes it
slightly clearer.

Addresses-Coverity: 711528 ("Out of bounds read")
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-07-02 18:36:26 -05:00
Steve French ded2d99cef CIFS: Clarify SMB1 code for UnixCreateSymLink
Coverity also complains about the way we calculate the offset
(starting from the address of a 4 byte array within the
header structure rather than from the beginning of the struct
plus 4 bytes) for creating SMB1 symlinks when using the Unix
extensions.  This doesn't change the address but
makes it slightly clearer.

Addresses-Coverity: 711530 ("Out of bounds read")
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-07-02 18:36:24 -05:00
Steve French 819f916c83 cifs: clarify SMB1 code for UnixCreateHardLink
Coverity complains about the way we calculate the offset
(starting from the address of a 4 byte array within the
header structure rather than from the beginning of the struct
plus 4 bytes).  This doesn't change the address but
makes it slightly clearer.

Addresses-Coverity: 711529 ("Out of bounds read")
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-07-02 18:36:23 -05:00
Steve French 01cf30825c cifs: make locking consistent around the server session status
There were three places where we were not taking the spinlock
around updates to server->tcpStatus when it was being modified.
To be consistent (also removes Coverity warning) and to remove
possibility of race best to lock all places where it is updated.
Two of the three were in initialization of the field and can't
race - but added lock around the other.

Addresses-Coverity: 1399512 ("Data race condition")
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-07-02 18:35:25 -05:00
Linus Torvalds 9f7b640f00 New code for 5.14:
- Refactor the buffer cache to use bulk page allocation
 - Convert agnumber-based AG iteration to walk per-AG structures
 - Clean up some unit conversions and other code warts
 - Reduce spinlock contention in the directio fastpath
 - Collapse all the inode cache walks into a single function
 - Remove indirect function calls from the inode cache walk code
 - Dramatically reduce the number of cache flushes sent when writing log
   buffers
 - Preserve inode sickness reports for longer
 - Rename xfs_eofblocks since it controls inode cache walks
 - Refactor the extended attribute code to prepare it for the addition
   of log intent items to make xattrs fully transactional
 - A few fixes to earlier large patchsets
 - Log recovery fixes so that we don't accidentally mark the log clean
   when log intent recovery fails
 - Fix some latent SOB errors
 - Clean up shutdown messages that get logged to dmesg
 - Fix a regression in the online shrink code
 - Fix a UAF in the buffer logging code if the fs goes offline
 - Fix uninitialized error variables
 - Fix a UAF in the CIL when commited log item callbacks race with a
   shutdown
 - Fix a bug where the CIL could hang trying to push part of the log ring
   buffer that hasn't been filled yet
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmDXP38ACgkQ+H93GTRK
 tOsKzw//eHvEgeyBo7ek06GDsUph2kQVR9AJWE7MNMiBFxlmL8R9H225xJK7Qmcr
 YswcyEeDq8cNXbXDA249ueuMb+DxhZPY68hPK5BJ3KsbvL2RZV0lJCbk492l4cgb
 IvBJiG/MDo55km83tdr81AlmFYQM7rSQz5MbVogGxxsnp0ul3VpIrJZba8kPRDQ1
 mZzH2fdlnE9Ozw/CfvjSgT1pySyFpxNeTRucYXUQil1hL1AGTBw7rGGNnccS090y
 u/EawQ4WJ131m8O3+WomUmaGyZFlWvTpHzukKxvrEvZ6AG+HpIhMcbZ5J6nkRTY4
 xxhUBG2qNKIcgPmPwAGmx1cylcsOCNKQgp+fko9tAZjEkgT5cbCpqpjGgjNB0RCf
 pB0PY6idCFl9hmBpVgMWz2AZ9IsDmK54qufmLtzq/zN8cThzt6A95UUR0rGu5Kd8
 CUmmdQTYl0GqlTTszCO2rw1+zRtcasMpBVmeYHDxy00bd1dHLUJ6o8DuXRYTTQti
 J/6CZVVD56jieRb+uvrOq4mhiPR2kynciiu1dXdY5kx79kKom6HMBBvtTl8b9kmh
 smWihfip7BTpz5vFzcwFmMxFwzW3K4LnDZl7qEGqXDEIHOL+pRWazU2yN3JZRGyd
 z4SQMJuER0HTTA0yO09c3/CX9onorhjUIMgQ9U25l1hdyFna0+o=
 =08Q9
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.14-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "Most of the work this cycle has been on refactoring various parts of
  the codebase. The biggest non-cleanup changes are (1) reducing the
  number of cache flushes sent when writing the log; (2) a substantial
  number of log recovery fixes; and (3) I started accepting pull
  requests from contributors if the commits in their branches match
  what's been sent to the list.

  For a week or so I /had/ staged a major cleanup of the logging code
  from Dave Chinner, but it exposed so many lurking bugs in other parts
  of the logging and log recovery code that I decided to defer that
  patchset until we can address those latent bugs.

  Larger cleanups this time include walking the incore inode cache (me)
  and rework of the extended attribute code (Allison) to prepare it for
  adding logged xattr updates (and directory tree parent pointers) in
  future releases.

  Summary:

   - Refactor the buffer cache to use bulk page allocation

   - Convert agnumber-based AG iteration to walk per-AG structures

   - Clean up some unit conversions and other code warts

   - Reduce spinlock contention in the directio fastpath

   - Collapse all the inode cache walks into a single function

   - Remove indirect function calls from the inode cache walk code

   - Dramatically reduce the number of cache flushes sent when writing
     log buffers

   - Preserve inode sickness reports for longer

   - Rename xfs_eofblocks since it controls inode cache walks

   - Refactor the extended attribute code to prepare it for the addition
     of log intent items to make xattrs fully transactional

   - A few fixes to earlier large patchsets

   - Log recovery fixes so that we don't accidentally mark the log clean
     when log intent recovery fails

   - Fix some latent SOB errors

   - Clean up shutdown messages that get logged to dmesg

   - Fix a regression in the online shrink code

   - Fix a UAF in the buffer logging code if the fs goes offline

   - Fix uninitialized error variables

   - Fix a UAF in the CIL when commited log item callbacks race with a
     shutdown

   - Fix a bug where the CIL could hang trying to push part of the log
     ring buffer that hasn't been filled yet"

* tag 'xfs-5.14-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (102 commits)
  xfs: don't wait on future iclogs when pushing the CIL
  xfs: Fix a CIL UAF by getting get rid of the iclog callback lock
  xfs: remove callback dequeue loop from xlog_state_do_iclog_callbacks
  xfs: don't nest icloglock inside ic_callback_lock
  xfs: Initialize error in xfs_attr_remove_iter
  xfs: fix endianness issue in xfs_ag_shrink_space
  xfs: remove dead stale buf unpin handling code
  xfs: hold buffer across unpin and potential shutdown processing
  xfs: force the log offline when log intent item recovery fails
  xfs: fix log intent recovery ENOSPC shutdowns when inactivating inodes
  xfs: shorten the shutdown messages to a single line
  xfs: print name of function causing fs shutdown instead of hex pointer
  xfs: fix type mismatches in the inode reclaim functions
  xfs: separate primary inode selection criteria in xfs_iget_cache_hit
  xfs: refactor the inode recycling code
  xfs: add iclog state trace events
  xfs: xfs_log_force_lsn isn't passed a LSN
  xfs: Fix CIL throttle hang when CIL space used going backwards
  xfs: journal IO cache flush reductions
  xfs: remove need_start_rec parameter from xlog_write()
  ...
2021-07-02 14:30:27 -07:00
Linus Torvalds e4aa67b023 JFS fixes for 5.14
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAmDeGXwACgkQNqiEXrVA
 jGTONw/+J9mdqQW4zrrIotoO49wKpPv0Ub/XJqT+cDrae91w/lrOct2g0U12aG/e
 zkVeGDhWF4iqsVhyAPx1U6Dh0/rg3JlCSr3l2NYk7WKGwQSOFINtrXPMUfz1P9h1
 M/Cvz9XIkIzATeEd6yrGdQyw5bcSQu3uF8o71ufdKJfeNkAxlhigw690vePMXPFp
 YSmBgLVM1ntp2apJyJiFYVNJqfbDL2bVWG3a+ENGXjD+b/8IsShTKM6VyEXRWGKa
 M1RjEW7i5kvoPtmbV5tLkySete+Bumv8ZRGoAv306tgABjWRkAZN31EzdZfBe5Io
 8pK5USGaSVauP9E0wXV4irfaAuYPp1uAsZqzhxbRXLtTsLKJKrZKxjq9kZON3FPd
 tGY1v462zBrj79oN++Z7/bDxwbcfJ+OPid2Fev6SlsYfTndTp1cseLDsef574/vo
 1P6DgCHNabHWZcjeI2ubMIM+W3+pAwjWUeeLOedCa7IZ6VKTwx+8RPEDNAxMq7lO
 6fs+NUtbEd/veaTmicJnllvld1HcqS+H/sPSS7tHhTtoMTVG2bdpzyvvxGzKOxq+
 IQNnNOLwg6LsH3z+uenU+fhnm5Z/b+iZsMG3LgXALu6JCDaZf/ygvyQfhhJS5n5r
 MJfWP2ZSpEf3yYcsoKvgFafVXv6p2fnibfB/2ecmyQ6Lz54U6hc=
 =li00
 -----END PGP SIGNATURE-----

Merge tag 'jfs-5.14' of git://github.com/kleikamp/linux-shaggy

Pull jfs updates from David Kleikamp:
 "JFS fixes for 5.14"

* tag 'jfs-5.14' of git://github.com/kleikamp/linux-shaggy:
  jfs: Avoid field-overflowing memcpy()
  jfs: remove unnecessary oom message
  jfs: fix GPF in diFree
  fs/jfs: Fix missing error code in lmLogInit()
  jfs: Remove trailing semicolon in macros
  fs: Fix typo issue
2021-07-02 14:25:17 -07:00
Linus Torvalds ced4cca754 configfs updates for Linux 5.13
- fix a memleak in configfs_release_bin_file (Chung-Chiang Cheng)
  - implement the .read_iter and .write_iter (Bart Van Assche)
  - minor cleanups (Bart Van Assche, me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmDfOe8LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYNbuA//dTCwt9U8+XIDeWatRJY2BFSTorNhaeayYiGDDnVo
 7NVN/vw5qtt7UHo7hFBi918gh9IGSIbgyYZIZAV4/qCqnveNMnD/Zuwd8j0FHzu4
 OvmjORj52cpzXiyJ87vJ+fGVjqe9/Vtrzr4IFyhJZQnz+c1L0NOgReAiLJtE+1C/
 8RvecwCvIxZVjw8SRjOZNCd/DFBx6nPbwsh0ZLkAiL6TQkV5erOUmFyoOAiXRQwB
 H+6fcrjGteoXXciGmvRtfIaJEO5o2yE4pXEsoYv2aDBF5dtpq/wti8CV7bWSIXiV
 HI90UrVDO+bK94DKC0JmYFEAxi7HmX073OWu0xBgXjpUnc1kepJ+9DJ74Y43QwwF
 C0eHKu4b/nP/50c/7UxqUSmS23c9YiJawkYZjxbnxklN9QhRs8v61htOO6i9CFnu
 KjePcLs07sn6AEa16taCV/+iwfgyul047jegrZZSTNioBPgpdMX+j2fzt2FTWioJ
 zfVLsiBqvKDKXEkulsTMMFmO4Un2kcX7cTX7v9DclGvpCuEoOpPUDERS5H8+DkqV
 q+O2HPohFdFItcWge58eCvZ0OJ43Px07Mpn6WH48VLCDLTg7H7Z/85yEO6Wg+a6V
 JlgwdoAlHo5BHfS2w7PtkblbaJH9MQRc6dvSuv/sE1x8AouWze7wkfVb0ZH0X1GF
 Xjo=
 =dquw
 -----END PGP SIGNATURE-----

Merge tag 'configfs-5.13' of git://git.infradead.org/users/hch/configfs

Pull configfs updates from Christoph Hellwig:

 - fix a memleak in configfs_release_bin_file (Chung-Chiang Cheng)

 - implement the .read_iter and .write_iter (Bart Van Assche)

 - minor cleanups (Bart Van Assche, me)

* tag 'configfs-5.13' of git://git.infradead.org/users/hch/configfs:
  configfs: simplify configfs_release_bin_file
  configfs: fix memleak in configfs_release_bin_file
  configfs: implement the .read_iter and .write_iter methods
  configfs: drop pointless kerneldoc comments
  configfs: fix the kerneldoc comment for configfs_create_bin_file
2021-07-02 14:13:21 -07:00
Linus Torvalds 71bd934101 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "190 patches.

  Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
  vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
  migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
  zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
  core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
  signals, exec, kcov, selftests, compress/decompress, and ipc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
  ipc/util.c: use binary search for max_idx
  ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
  ipc: use kmalloc for msg_queue and shmid_kernel
  ipc sem: use kvmalloc for sem_undo allocation
  lib/decompressors: remove set but not used variabled 'level'
  selftests/vm/pkeys: exercise x86 XSAVE init state
  selftests/vm/pkeys: refill shadow register after implicit kernel write
  selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
  selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
  kcov: add __no_sanitize_coverage to fix noinstr for all architectures
  exec: remove checks in __register_bimfmt()
  x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
  hfsplus: report create_date to kstat.btime
  hfsplus: remove unnecessary oom message
  nilfs2: remove redundant continue statement in a while-loop
  kprobes: remove duplicated strong free_insn_page in x86 and s390
  init: print out unknown kernel parameters
  checkpatch: do not complain about positive return values starting with EPOLL
  checkpatch: improve the indented label test
  checkpatch: scripts/spdxcheck.py now requires python3
  ...
2021-07-02 12:08:10 -07:00
Fengnan Chang 151b1982be f2fs: compress: add nocompress extensions support
When we create a directory with enable compression, all file write into
directory will try to compress.But sometimes we may know, new file
cannot meet compression ratio requirements.
We need a nocompress extension to skip those files to avoid unnecessary
compress page test.

After add nocompress_extension, the priority should be:
dir_flag < comp_extention,nocompress_extension < comp_file_flag,
no_comp_file_flag.

Priority in between FS_COMPR_FL, FS_NOCOMP_FS, extensions:
   * compress_extension=so; nocompress_extension=zip; chattr +c dir;
     touch dir/foo.so; touch dir/bar.zip; touch dir/baz.txt; then foo.so
     and baz.txt should be compresse, bar.zip should be non-compressed.
     chattr +c dir/bar.zip can enable compress on bar.zip.
   * compress_extension=so; nocompress_extension=zip; chattr -c dir;
     touch dir/foo.so; touch dir/bar.zip; touch dir/baz.txt; then foo.so
     should be compresse, bar.zip and baz.txt should be non-compressed.
     chattr+c dir/bar.zip; chattr+c dir/baz.txt; can enable compress on
     bar.zip and baz.txt.

Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-01 18:34:31 -07:00
Stephen Brennan cd84bbbac1 ext4: use ext4_grp_locked_error in mb_find_extent
Commit 5d1b1b3f49 ("ext4: fix BUG when calling ext4_error with locked
block group") introduces ext4_grp_locked_error to handle unlocking a
group in error cases. Otherwise, there is a possibility of a sleep while
atomic. However, since 43c73221b3 ("ext4: replace BUG_ON with WARN_ON
in mb_find_extent()"), mb_find_extent() has contained a ext4_error()
call while a group spinlock is held. Replace this with
ext4_grp_locked_error.

Fixes: 43c73221b3 ("ext4: replace BUG_ON with WARN_ON in mb_find_extent()")
Cc: <stable@vger.kernel.org> # 4.14+
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Link: https://lore.kernel.org/r/20210623232114.34457-1-stephen.s.brennan@oracle.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-07-01 17:14:07 -04:00
Pavel Begunkov e09ee51060 io_uring: fix exiting io_req_task_work_add leaks
If one entered io_req_task_work_add() not seeing PF_EXITING, it will set
a ->task_state bit and try task_work_add(), which may fail by that
moment. If that happens the function would try to cancel the request.

However, in a meanwhile there might come other io_req_task_work_add()
callers, which will see the bit set and leave their requests in the
list, which will never be executed.

Don't propagate an error, but clear the bit first and then fallback
all requests that we can splice from the list. The callback functions
have to be able to deal with PF_EXITING, so poll and apoll was modified
via changing io_poll_rewait().

Fixes: 7cbf1722d5 ("io_uring: provide FIFO ordering for task_work")
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/060002f19f1fdbd130ba24aef818ea4d3080819b.1625142209.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-07-01 13:40:32 -06:00
Pavel Begunkov 5b0a6acc73 io_uring: simplify task_work func
Since we don't really use req->task_work anymore, get rid of it together
with the nasty ->func aliasing between ->io_task_work and ->task_work,
and hide ->fallback_node inside of io_task_work.

Also, as task_work is gone now, replace the callback type from
task_work_func_t to a function taking io_kiocb to avoid casting and
simplify code.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-07-01 13:40:23 -06:00
Pavel Begunkov 9011bf9a13 io_uring: fix stuck fallback reqs
When task_work_add() fails, we use ->exit_task_work to queue the work.
That will be run only in the cancellation path, which happens either
when the ctx is dying or one of tasks with inflight requests is exiting
or executing. There is a good chance that such a request would just get
stuck in the list potentially hodling a file, all io_uring rsrc
recycling or some other resources. Nothing terrible, it'll go away at
some point, but we don't want to lock them up for longer than needed.

Replace that hand made ->exit_task_work with delayed_work + llist
inspired by fput_many().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-07-01 13:40:17 -06:00
Linus Torvalds c288d9cd71 for-5.14/io_uring-2021-06-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDc0SsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpt6BD/9ZHzfDCPVfd0ivftRYOS2GECZetR79Ar8h
 RdMPi4iSw1+gFbqKVpD38tBvJsvKq0/jVPap056KaxbdsU+JCG1dYfllbsK4tpVO
 uqDHDIX3YODmkRkqc6IkPtdxvATS3JBGdP9GOSEobgygOPs9qBXXd4tdP+2Hyzna
 0DUIGnuO3VTfBa03OiHoIeC9Zo3o9sCIYHY1RG+awL/+43WeMQ8ZFX4CfYhgWTTp
 HyzzHdu67xAoVgsIyg5dTuV+Vo3cEtHXg8IKbDXoDQ1SAyr96Uw5pJ7JEq80mqyZ
 sxth6PlS0Safh6xJ5w6ZLIAaPz1fkL+2q1CeQAhkeWLuk6FImt0ZoxElbSkW+pTv
 LqVcXl+F65QhNQZ195OgPI+RGTSA5VQlAW0QwGnmEeket7pzqkdcLPVy1CO8ANiT
 t5qn8ZZmNFLMhnWGU/3zNyHQLvwKslxucpvrkS9Ol/6Yh9yKVEsAXC7yrk3F3dAF
 i4RAExbbUzUJKvRkm3kXpwUg2BGtJD7Achw9HD+prFKdt08XDLvffC8GnCqWd/4a
 zL9oSHR2OLV7RAQpDQ5cC7ZjWV20xsd2NM+pHk5rzbb9udQZWrJ2K218sqKkgFme
 MJzYIt/a79KvBh9+TvJBhx1adnmE4gBH4X23ZMN1hMXzoyVY5mELHbe5LwBSkibx
 A1DxPKv4vg==
 =FEZd
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:

 - Multi-queue iopoll improvement (Fam)

 - Allow configurable io-wq CPU masks (me)

 - renameat/linkat tightening (me)

 - poll re-arm improvement (Olivier)

 - SQPOLL race fix (Olivier)

 - Cancelation unification (Pavel)

 - SQPOLL cleanups (Pavel)

 - Enable file backed buffers for shmem/memfd (Pavel)

 - A ton of cleanups and performance improvements (Pavel)

 - Followup and misc fixes (Colin, Fam, Hao, Olivier)

* tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block: (83 commits)
  io_uring: code clean for kiocb_done()
  io_uring: spin in iopoll() only when reqs are in a single queue
  io_uring: pre-initialise some of req fields
  io_uring: refactor io_submit_flush_completions
  io_uring: optimise hot path restricted checks
  io_uring: remove not needed PF_EXITING check
  io_uring: mainstream sqpoll task_work running
  io_uring: refactor io_arm_poll_handler()
  io_uring: reduce latency by reissueing the operation
  io_uring: add IOPOLL and reserved field checks to IORING_OP_UNLINKAT
  io_uring: add IOPOLL and reserved field checks to IORING_OP_RENAMEAT
  io_uring: refactor io_openat2()
  io_uring: simplify struct io_uring_sqe layout
  io_uring: update sqe layout build checks
  io_uring: fix code style problems
  io_uring: refactor io_sq_thread()
  io_uring: don't change sqpoll creds if not needed
  io_uring: Create define to modify a SQPOLL parameter
  io_uring: Fix race condition when sqp thread goes to sleep
  io_uring: improve in tctx_task_work() resubmission
  ...
2021-07-01 12:16:24 -07:00
Linus Torvalds 911a2997a5 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmDcl7AACgkQnJ2qBz9k
 QNnsBQf+LBAPsfykQ/f8EdHErO1lfbVTmwf2g/JzTkjrIVZTZ6Ic47aCIiFxgHU2
 Js9ufaPxpsbbopzpn2PAoCUzxNsZDqgXtnC03MOUAqoSFbAvgLHz2sQwjqeYJUGQ
 P6n7VipEA/qBVpQI5zeCUhHYcahoNrRjSLzaFnE2Z8CrQYQ6Ry9gVEhduvu2OTru
 62cWlAWlTJfx/FcR1Y0F/ZznnNSKMiAHcEe3F6Beztplg2ooq+z6FclJYrkmnxMq
 SXSOsqTCdi1/oFx36NpvLkykrIS9I7N/iqCnKwbm6X+nyZZKyAwYZhWVqkbozPPu
 +u1Ppq8o0IuWwEA6/UAmxgAO3m/Gkw==
 =tn0h
 -----END PGP SIGNATURE-----

Merge tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull misc fs updates from Jan Kara:
 "The new quotactl_fd() syscall (remake of quotactl_path() syscall that
  got introduced & disabled in 5.13 cycle), and couple of udf, reiserfs,
  isofs, and writeback fixes and cleanups"

* tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  writeback: fix obtain a reference to a freeing memcg css
  quota: remove unnecessary oom message
  isofs: remove redundant continue statement
  quota: Wire up quotactl_fd syscall
  quota: Change quotactl_path() systcall to an fd-based one
  reiserfs: Remove unneed check in reiserfs_write_full_page()
  udf: Fix NULL pointer dereference in udf_symlink function
  reiserfs: add check for invalid 1st journal block
2021-07-01 12:06:39 -07:00
Alexey Dobriyan bae7702a17 exec: remove checks in __register_bimfmt()
Delete NULL check, all callers pass valid pointer.

Delete ->load_binary check -- failure to provide hook in a custom module
will be very noticeable at the very first execve call.

Link: https://lkml.kernel.org/r/YK1Gy1qXaLAR+tPl@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:06 -07:00
Chung-Chiang Cheng c3eb84092b hfsplus: report create_date to kstat.btime
The create_date field of inode in hfsplus is corresponding to
kstat.btime and could be reported in statx.

Link: https://lkml.kernel.org/r/20210416172147.8736-1-cccheng@synology.com
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:06 -07:00
Zhen Lei 7dcae11f4c hfsplus: remove unnecessary oom message
Fixes scripts/checkpatch.pl warning:
WARNING: Possible unnecessary 'out of memory' message

Remove it can help us save a bit of memory.

Link: https://lkml.kernel.org/r/20210617084944.1279-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:06 -07:00
Colin Ian King f4048e5aa1 nilfs2: remove redundant continue statement in a while-loop
The continue statement at the end of the while-loop is redundant,
remove it.

Addresses-Coverity: ("Continue has no effect")
Link: https://lkml.kernel.org/r/20210621100519.10257-1-colin.king@canonical.com
Link: https://lkml.kernel.org/r/1624557664-17159-1-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:06 -07:00
Andy Shevchenko cc72181a65 seq_file: drop unused *_escape_mem_ascii()
There are no more users of the seq_escape_mem_ascii() followed by
string_escape_mem_ascii().

Remove them for good.

Link: https://lkml.kernel.org/r/20210504180819.73127-16-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:05 -07:00
Andy Shevchenko c0546391c2 nfsd: avoid non-flexible API in seq_quote_mem()
The seq_escape_mem_ascii() is completely non-flexible and shouldn't be
used.  Replace it with properly called seq_escape_mem().

Link: https://lkml.kernel.org/r/20210504180819.73127-15-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:05 -07:00
Andy Shevchenko fc3de02eae seq_file: convert seq_escape() to use seq_escape_str()
Convert seq_escape() to use seq_escape_str() rather than open coding it.

Note, for now we leave it as an exported symbol due to some old code that
can't tolerate ctype.h being (indirectly) included.

Link: https://lkml.kernel.org/r/20210504180819.73127-14-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:05 -07:00
Andy Shevchenko 1d31aa172a seq_file: introduce seq_escape_mem()
Introduce seq_escape_mem() to allow users to pass additional parameters to
string_escape_mem().

Link: https://lkml.kernel.org/r/20210504180819.73127-12-andriy.shevchenko@linux.intel.com
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:05 -07:00
Kalesh Singh 3845f256a8 procfs/dmabuf: add inode number to /proc/*/fdinfo
And 'ino' field to /proc/<pid>/fdinfo/<FD> and
/proc/<pid>/task/<tid>/fdinfo/<FD>.

The inode numbers can be used to uniquely identify DMA buffers in user
space and avoids a dependency on /proc/<pid>/fd/* when accounting
per-process DMA buffer sizes.

Link: https://lkml.kernel.org/r/20210308170651.919148-2-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Christian König <christian.koenig@amd.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Helge Deller <deller@gmx.de>
Cc: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:04 -07:00
Kalesh Singh 7bc3fa0172 procfs: allow reading fdinfo with PTRACE_MODE_READ
Android captures per-process system memory state when certain low memory
events (e.g a foreground app kill) occur, to identify potential memory
hoggers.  In order to measure how much memory a process actually consumes,
it is necessary to include the DMA buffer sizes for that process in the
memory accounting.  Since the handle to DMA buffers are raw FDs, it is
important to be able to identify which processes have FD references to a
DMA buffer.

Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
/proc/<pid>/fdinfo -- both are only readable by the process owner, as
follows:

  1. Do a readlink on each FD.
  2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
  3. stat the file to get the dmabuf inode number.
  4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.

Accessing other processes' fdinfo requires root privileges.  This limits
the use of the interface to debugging environments and is not suitable for
production builds.  Granting root privileges even to a system process
increases the attack surface and is highly undesirable.

Since fdinfo doesn't permit reading process memory and manipulating
process state, allow accessing fdinfo under PTRACE_MODE_READ_FSCRED.

Link: https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Jann Horn <jannh@google.com>
Acked-by: Christian König <christian.koenig@amd.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:04 -07:00
Marcelo Henrique Cerri d238692b4b proc: Avoid mixing integer types in mem_rw()
Use size_t when capping the count argument received by mem_rw(). Since
count is size_t, using min_t(int, ...) can lead to a negative value
that will later be passed to access_remote_vm(), which can cause
unexpected behavior.

Since we are capping the value to at maximum PAGE_SIZE, the conversion
from size_t to int when passing it to access_remote_vm() as "len"
shouldn't be a problem.

Link: https://lkml.kernel.org/r/20210512125215.3348316-1-marcelo.cerri@canonical.com
Reviewed-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Souza Cascardo <cascardo@canonical.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:04 -07:00
Alistair Popple af5cdaf822 mm: remove special swap entry functions
Patch series "Add support for SVM atomics in Nouveau", v11.

Introduction
============

Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory.  To support atomic operations to
a shared virtual memory page such a device needs access to that page which
is exclusive of the CPU.  This series introduces a mechanism to
temporarily unmap pages granting exclusive access to a device.

These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag.  A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .

Implementation
==============

Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry.  The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.

Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the
CPU finalising the entry.

Patches
=======

Patches 1 & 2 refactor existing migration and device private entry
functions.

Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().

Patch 5 renames some existing code but does not introduce functionality.

Patch 6 is a small clean-up to swap entry handling in copy_pte_range().

Patch 7 contains the bulk of the implementation for device exclusive
memory.

Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.

Patch 9 is a cleanup for the Nouveau SVM implementation.

Patch 10 contains the implementation of atomic access for the Nouveau
driver.

Testing
=======

This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic.
Without this series the test fails as there is no way of write-protecting
the page mapping which results in the device clobbering CPU writes.  For
reference the test is available at
https://ozlabs.org/~apopple/opencl_svm_atomics/

Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.

This patch (of 10):

Remove multiple similar inline functions for dealing with different types
of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn.  Instead of multiple inline functions to obtain a struct page
for each swap entry type use a common function pfn_swap_entry_to_page().
Also open-code the various entry_to_pfn() functions as this results is
shorter code that is easier to understand.

Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 11:06:03 -07:00
Ye Bin 558d6450c7 ext4: fix WARN_ON_ONCE(!buffer_uptodate) after an error writing the superblock
If a writeback of the superblock fails with an I/O error, the buffer
is marked not uptodate.  However, this can cause a WARN_ON to trigger
when we attempt to write superblock a second time.  (Which might
succeed this time, for cerrtain types of block devices such as iSCSI
devices over a flaky network.)

Try to detect this case in flush_stashed_error_work(), and also change
__ext4_handle_dirty_metadata() so we always set the uptodate flag, not
just in the nojournal case.

Before this commit, this problem can be repliciated via:

1. dmsetup  create dust1 --table  '0 2097152 dust /dev/sdc 0 4096'
2. mount  /dev/mapper/dust1  /home/test
3. dmsetup message dust1 0 addbadblock 0 10
4. cd /home/test
5. echo "XXXXXXX" > t

After a few seconds, we got following warning:

[   80.654487] end_buffer_async_write: bh=0xffff88842f18bdd0
[   80.656134] Buffer I/O error on dev dm-0, logical block 0, lost async page write
[   85.774450] EXT4-fs error (device dm-0): ext4_check_bdev_write_error:193: comm kworker/u16:8: Error while async write back metadata
[   91.415513] mark_buffer_dirty: bh=0xffff88842f18bdd0
[   91.417038] ------------[ cut here ]------------
[   91.418450] WARNING: CPU: 1 PID: 1944 at fs/buffer.c:1092 mark_buffer_dirty.cold+0x1c/0x5e
[   91.440322] Call Trace:
[   91.440652]  __jbd2_journal_temp_unlink_buffer+0x135/0x220
[   91.441354]  __jbd2_journal_unfile_buffer+0x24/0x90
[   91.441981]  __jbd2_journal_refile_buffer+0x134/0x1d0
[   91.442628]  jbd2_journal_commit_transaction+0x249a/0x3240
[   91.443336]  ? put_prev_entity+0x2a/0x200
[   91.443856]  ? kjournald2+0x12e/0x510
[   91.444324]  kjournald2+0x12e/0x510
[   91.444773]  ? woken_wake_function+0x30/0x30
[   91.445326]  kthread+0x150/0x1b0
[   91.445739]  ? commit_timeout+0x20/0x20
[   91.446258]  ? kthread_flush_worker+0xb0/0xb0
[   91.446818]  ret_from_fork+0x1f/0x30
[   91.447293] ---[ end trace 66f0b6bf3d1abade ]---

Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20210615090537.3423231-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-07-01 10:58:04 -04:00
Collin Fijalkovich eb6ecbed0a mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs
Transparent huge pages are supported for read-only non-shmem files, but
are only used for vmas with VM_DENYWRITE.  This condition ensures that
file THPs are protected from writes while an application is running
(ETXTBSY).  Any existing file THPs are then dropped from the page cache
when a file is opened for write in do_dentry_open().  Since sys_mmap
ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
produced by execve().

Systems that make heavy use of shared libraries (e.g.  Android) are unable
to apply VM_DENYWRITE through the dynamic linker, preventing them from
benefiting from the resultant reduced contention on the TLB.

This patch reduces the constraint on file THPs allowing use with any
executable mapping from a file not opened for write (see
inode_is_open_for_write()).  It also introduces additional conditions to
ensure that files opened for write will never be backed by file THPs.

Restricting the use of THPs to executable mappings eliminates the risk
that a read-only file later opened for write would encounter significant
latencies due to page cache truncation.

The ld linker flag '-z max-page-size=(hugepage size)' can be used to
produce executables with the necessary layout.  The dynamic linker must
map these file's segments at a hugepage size aligned vma for the mapping
to be backed with THPs.

Comparison of the performance characteristics of 4KB and 2MB-backed
libraries follows; the Android dex2oat tool was used to AOT compile an
example application on a single ARM core.

4KB Pages:
==========

count              event_name            # count / runtime
598,995,035,942    cpu-cycles            # 1.800861 GHz
 81,195,620,851    raw-stall-frontend    # 244.112 M/sec
347,754,466,597    iTLB-loads            # 1.046 G/sec
  2,970,248,900    iTLB-load-misses      # 0.854122% miss rate

Total test time: 332.854998 seconds.

2MB Pages:
==========

count              event_name            # count / runtime
592,872,663,047    cpu-cycles            # 1.800358 GHz
 76,485,624,143    raw-stall-frontend    # 232.261 M/sec
350,478,413,710    iTLB-loads            # 1.064 G/sec
    803,233,322    iTLB-load-misses      # 0.229182% miss rate

Total test time: 329.826087 seconds

A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use:

/apex/com.android.art/lib64/libart.so
FilePmdMapped:      4096 kB

/apex/com.android.art/lib64/libart-compiler.so
FilePmdMapped:      2048 kB

Link: https://lkml.kernel.org/r/20210406000930.3455850-1-cfijalkovich@google.com
Signed-off-by: Collin Fijalkovich <cfijalkovich@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Song Liu <song@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:29 -07:00
David Hildenbrand c6d9eee2a6 fs/proc/kcore: use page_offline_(freeze|thaw)
Let's properly synchronize with drivers that set PageOffline().
Unfreeze/thaw every now and then, so drivers that want to set
PageOffline() can make progress.

Link: https://lkml.kernel.org/r/20210526093041.8800-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:28 -07:00
David Hildenbrand 0daa322b8f fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages
Let's avoid reading:

1) Offline memory sections: the content of offline memory sections is
   stale as the memory is effectively unused by the kernel.  On s390x with
   standby memory, offline memory sections (belonging to offline storage
   increments) are not accessible.  With virtio-mem and the hyper-v
   balloon, we can have unavailable memory chunks that should not be
   accessed inside offline memory sections.  Last but not least, offline
   memory sections might contain hwpoisoned pages which we can no longer
   identify because the memmap is stale.

2) PG_offline pages: logically offline pages that are documented as
   "The content of these pages is effectively stale.  Such pages should
   not be touched (read/write/dump/save) except by their owner.".
   Examples include pages inflated in a balloon or unavailble memory
   ranges inside hotplugged memory sections with virtio-mem or the hyper-v
   balloon.

3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
   As documented: "Accessing is not safe since it may cause another
   machine check.  Don't touch!"

Introduce is_page_hwpoison(), adding a comment that it is inherently racy
but best we can really do.

Reading /proc/kcore now performs similar checks as when reading
/proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
It's also similar to hibernation code, however, we don't skip hwpoisoned
pages when processing pages in kernel/power/snapshot.c:saveable_page()
yet.

Note 1: we can race against memory offlining code, especially memory going
offline and getting unplugged: however, we will properly tear down the
identity mapping and handle faults gracefully when accessing this memory
from kcore code.

Note 2: we can race against drivers setting PageOffline() and turning
memory inaccessible in the hypervisor.  We'll handle this in a follow-up
patch.

Link: https://lkml.kernel.org/r/20210526093041.8800-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:28 -07:00
David Hildenbrand 2711032c64 fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM
Let's resturcture the code, using switch-case, and checking pfn_is_ram()
only when we are dealing with KCORE_RAM.

Link: https://lkml.kernel.org/r/20210526093041.8800-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Roman Gushchin <guro@fb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:28 -07:00
David Hildenbrand 3c36b419b1 fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER
Patch series "fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages", v3.

Looking for places where the kernel might unconditionally read
PageOffline() pages, I stumbled over /proc/kcore; turns out /proc/kcore
needs some more love to not touch some other pages we really don't want to
read -- i.e., hwpoisoned ones.

Examples for PageOffline() pages are pages inflated in a balloon, memory
unplugged via virtio-mem, and partially-present sections in memory added
by the Hyper-V balloon.

When reading pages inflated in a balloon, we essentially produce
unnecessary load in the hypervisor; holes in partially present sections in
case of Hyper-V are not accessible and already were a problem for
/proc/vmcore, fixed in makedumpfile by detecting PageOffline() pages.  In
the future, virtio-mem might disallow reading unplugged memory -- marked
as PageOffline() -- in some environments, resulting in undefined behavior
when accessed; therefore, I'm trying to identify and rework all these
(corner) cases.

With this series, there is really only access via /dev/mem, /proc/vmcore
and kdb left after I ripped out /dev/kmem.  kdb is an advanced corner-case
use case -- we won't care for now if someone explicitly tries to do nasty
things by reading from/writing to physical addresses we better not touch.
/dev/mem is a use case we won't support for virtio-mem, at least for now,
so we'll simply disallow mapping any virtio-mem memory via /dev/mem next.
/proc/vmcore is really only a problem when dumping the old kernel via
something that's not makedumpfile (read: basically never), however, we'll
try sanitizing that as well in the second kernel in the future.

Tested via kcore_dump:
	https://github.com/schlafwandler/kcore_dump

This patch (of 6):

Commit db779ef67f ("proc/kcore: Remove unused kclist_add_remap()")
removed the last user of KCORE_REMAP.

Commit 595dd46ebf ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when
dumping vsyscall user page") removed the last user of KCORE_OTHER.

Let's drop both types.  While at it, also drop vaddr in "struct
kcore_list", used by KCORE_REMAP only.

Link: https://lkml.kernel.org/r/20210526093041.8800-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210526093041.8800-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:28 -07:00
Axel Rasmussen 964ab0040f userfaultfd/shmem: advertise shmem minor fault support
Now that the feature is fully implemented (the faulting path hooks exist
so userspace is notified, and the ioctl to resolve such faults is
available), advertise this as a supported feature.

Link: https://lkml.kernel.org/r/20210503180737.2487560-6-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:27 -07:00
Axel Rasmussen c949b097ef userfaultfd/shmem: support minor fault registration for shmem
This patch allows shmem-backed VMAs to be registered for minor faults.
Minor faults are appropriately relayed to userspace in the fault path, for
VMAs with the relevant flag.

This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed
minor faults, though, so userspace doesn't yet have a way to resolve such
faults.

Because of this, we also don't yet advertise this as a supported feature.
That will be done in a separate commit when the feature is fully
implemented.

Link: https://lkml.kernel.org/r/20210503180737.2487560-4-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:27 -07:00
Peter Xu fb8e37f35a mm/pagemap: export uffd-wp protection information
Export the PTE/PMD status of uffd-wp to pagemap too.

Link: https://lkml.kernel.org/r/20210428225030.9708-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:27 -07:00
Peter Xu 00b151f21f mm/userfaultfd: fail uffd-wp registration if not supported
We should fail uffd-wp registration immediately if the arch does not even
have CONFIG_HAVE_ARCH_USERFAULTFD_WP defined.  That'll block also relevant
ioctls on e.g.  UFFDIO_WRITEPROTECT because that'll check against
VM_UFFD_WP, which can only be applied with a success registration.

Remove the WP feature bit too for those archs when handling UFFDIO_API
ioctl.

Link: https://lkml.kernel.org/r/20210428225030.9708-5-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:27 -07:00
Muchun Song e6d41f12df mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON
When using HUGETLB_PAGE_FREE_VMEMMAP, the freeing unused vmemmap pages
associated with each HugeTLB page is default off.  Now the vmemmap is PMD
mapped.  So there is no side effect when this feature is enabled with no
HugeTLB pages in the system.  Someone may want to enable this feature in
the compiler time instead of using boot command line.  So add a config to
make it default on when someone do not want to enable it via command line.

Link: https://lkml.kernel.org/r/20210616094915.34432-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:26 -07:00
Miaohe Lin e6be37b2e7 mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled()
Since commit 99cb0dbd47 ("mm,thp: add read-only THP support for
(non-shmem) FS"), read-only THP file mapping is supported.  But it forgot
to add checking for it in transparent_hugepage_enabled().  To fix it, we
add checking for read-only THP file mapping and also introduce helper
transhuge_vma_enabled() to check whether thp is enabled for specified vma
to reduce duplicated code.  We rename transparent_hugepage_enabled to
transparent_hugepage_active to make the code easier to follow as suggested
by David Hildenbrand.

[linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable]
  Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.com

Link: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com
Fixes: 99cb0dbd47 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:26 -07:00
Muchun Song 6be24bed9d mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
The option HUGETLB_PAGE_FREE_VMEMMAP allows for the freeing of some
vmemmap pages associated with pre-allocated HugeTLB pages.  For example,
on X86_64 6 vmemmap pages of size 4KB each can be saved for each 2MB
HugeTLB page.  4094 vmemmap pages of size 4KB each can be saved for each
1GB HugeTLB page.

When a HugeTLB page is allocated or freed, the vmemmap array representing
the range associated with the page will need to be remapped.  When a page
is allocated, vmemmap pages are freed after remapping.  When a page is
freed, previously discarded vmemmap pages must be allocated before
remapping.

The config option is introduced early so that supporting code can be
written to depend on the option.  The initial version of the code only
provides support for x86-64.

If config HAVE_BOOTMEM_INFO_NODE is enabled, the freeing vmemmap page code
denpend on it to free vmemmap pages.  Otherwise, just use
free_reserved_page() to free vmemmmap pages.  The routine
register_page_bootmem_info() is used to register bootmem info.  Therefore,
make sure register_page_bootmem_info is enabled if
HUGETLB_PAGE_FREE_VMEMMAP is defined.

Link: https://lkml.kernel.org/r/20210510030027.56044-3-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Tested-by: Chen Huang <chenhuang5@huawei.com>
Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-30 20:47:25 -07:00
Linus Torvalds a6ecc2a491 In addition to bug fixes and cleanups, there are two new features for
ext4 in 5.14:
  - Allow applications to poll on changes to /sys/fs/ext4/*/errors_count
  - Add the ioctl EXT4_IOC_CHECKPOINT which allows the journal to be
    checkpointed, truncated and discarded or zero'ed.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmDcjRgACgkQ8vlZVpUN
 gaMAMQgAjRYUQ+tdJVZzInFwukudhgLyuCP9AdCx76fisaH22yNCakQ7M2XGz59i
 /YbJerLaueYpHZzpA9p5+sSjVhMwILO3scBSJbOwdsbrFAsFLzcgQKQhGGqK2KvX
 IAOEArC8/hm1wnVb7sfQYdBHlWyeJpI8hd/8WZPlYtySlRnP1TZCd+X7y7lmNs1H
 QU1KECwstI2t8Lug0QeKx2B9PI9AWcCs0lTJ4LfcANZAh3HIJi9aUCk4SFDRkf3/
 8AazvMqTHJD9yc+BNyZOro2ykDFCStkNqf0cDYTzvKrr66CHScPUtyI0oAEdspxN
 +SNNARPGZgNOuR3ZRbGivtwgEB+GpQ==
 =jSd4
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "In addition to bug fixes and cleanups, there are two new features for
  ext4 in 5.14:

   - Allow applications to poll on changes to
     /sys/fs/ext4/*/errors_count

   - Add the ioctl EXT4_IOC_CHECKPOINT which allows the journal to be
     checkpointed, truncated and discarded or zero'ed"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits)
  jbd2: export jbd2_journal_[un]register_shrinker()
  ext4: notify sysfs on errors_count value change
  fs: remove bdev_try_to_free_page callback
  ext4: remove bdev_try_to_free_page() callback
  jbd2: simplify journal_clean_one_cp_list()
  jbd2,ext4: add a shrinker to release checkpointed buffers
  jbd2: remove redundant buffer io error checks
  jbd2: don't abort the journal when freeing buffers
  jbd2: ensure abort the journal if detect IO error when writing original buffer back
  jbd2: remove the out label in __jbd2_journal_remove_checkpoint()
  ext4: no need to verify new add extent block
  jbd2: clean up misleading comments for jbd2_fc_release_bufs
  ext4: add check to prevent attempting to resize an fs with sparse_super2
  ext4: consolidate checks for resize of bigalloc into ext4_resize_begin
  ext4: remove duplicate definition of ext4_xattr_ibody_inline_set()
  ext4: fsmap: fix the block/inode bitmap comment
  ext4: fix comment for s_hash_unsigned
  ext4: use local variable ei instead of EXT4_I() macro
  ext4: fix avefreec in find_group_orlov
  ext4: correct the cache_nr in tracepoint ext4_es_shrink_exit
  ...
2021-06-30 19:37:39 -07:00
Theodore Ts'o 8813587a99 Revert "ext4: consolidate checks for resize of bigalloc into ext4_resize_begin"
The function ext4_resize_begin() gets called from three different
places, and online resize for bigalloc file systems is disallowed from
the old-style online resize (EXT4_IOC_GROUP_ADD and
EXT4_IOC_GROUP_EXTEND), but it *is* supposed to be allowed via
EXT4_IOC_RESIZE_FS.

This reverts commit e9f9f61d0c.
2021-06-30 20:54:22 -04:00
Linus Torvalds dbe69e4337 Networking changes for 5.14.
Core:
 
  - BPF:
    - add syscall program type and libbpf support for generating
      instructions and bindings for in-kernel BPF loaders (BPF loaders
      for BPF), this is a stepping stone for signed BPF programs
    - infrastructure to migrate TCP child sockets from one listener
      to another in the same reuseport group/map to improve flexibility
      of service hand-off/restart
    - add broadcast support to XDP redirect
 
  - allow bypass of the lockless qdisc to improving performance
    (for pktgen: +23% with one thread, +44% with 2 threads)
 
  - add a simpler version of "DO_ONCE()" which does not require
    jump labels, intended for slow-path usage
 
  - virtio/vsock: introduce SOCK_SEQPACKET support
 
  - add getsocketopt to retrieve netns cookie
 
  - ip: treat lowest address of a IPv4 subnet as ordinary unicast address
        allowing reclaiming of precious IPv4 addresses
 
  - ipv6: use prandom_u32() for ID generation
 
  - ip: add support for more flexible field selection for hashing
        across multi-path routes (w/ offload to mlxsw)
 
  - icmp: add support for extended RFC 8335 PROBE (ping)
 
  - seg6: add support for SRv6 End.DT46 behavior
 
  - mptcp:
     - DSS checksum support (RFC 8684) to detect middlebox meddling
     - support Connection-time 'C' flag
     - time stamping support
 
  - sctp: packetization Layer Path MTU Discovery (RFC 8899)
 
  - xfrm: speed up state addition with seq set
 
  - WiFi:
     - hidden AP discovery on 6 GHz and other HE 6 GHz improvements
     - aggregation handling improvements for some drivers
     - minstrel improvements for no-ack frames
     - deferred rate control for TXQs to improve reaction times
     - switch from round robin to virtual time-based airtime scheduler
 
  - add trace points:
     - tcp checksum errors
     - openvswitch - action execution, upcalls
     - socket errors via sk_error_report
 
 Device APIs:
 
  - devlink: add rate API for hierarchical control of max egress rate
             of virtual devices (VFs, SFs etc.)
 
  - don't require RCU read lock to be held around BPF hooks
    in NAPI context
 
  - page_pool: generic buffer recycling
 
 New hardware/drivers:
 
  - mobile:
     - iosm: PCIe Driver for Intel M.2 Modem
     - support for Qualcomm MSM8998 (ipa)
 
  - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices
 
  - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches
 
  - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU)
 
  - NXP SJA1110 Automotive Ethernet 10-port switch
 
  - Qualcomm QCA8327 switch support (qca8k)
 
  - Mikrotik 10/25G NIC (atl1c)
 
 Driver changes:
 
  - ACPI support for some MDIO, MAC and PHY devices from Marvell and NXP
    (our first foray into MAC/PHY description via ACPI)
 
  - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx
 
  - Mellanox/Nvidia NIC (mlx5)
    - NIC VF offload of L2 bridging
    - support IRQ distribution to Sub-functions
 
  - Marvell (prestera):
     - add flower and match all
     - devlink trap
     - link aggregation
 
  - Netronome (nfp): connection tracking offload
 
  - Intel 1GE (igc): add AF_XDP support
 
  - Marvell DPU (octeontx2): ingress ratelimit offload
 
  - Google vNIC (gve): new ring/descriptor format support
 
  - Qualcomm mobile (rmnet & ipa): inline checksum offload support
 
  - MediaTek WiFi (mt76)
     - mt7915 MSI support
     - mt7915 Tx status reporting
     - mt7915 thermal sensors support
     - mt7921 decapsulation offload
     - mt7921 enable runtime pm and deep sleep
 
  - Realtek WiFi (rtw88)
     - beacon filter support
     - Tx antenna path diversity support
     - firmware crash information via devcoredump
 
  - Qualcomm 60GHz WiFi (wcn36xx)
     - Wake-on-WLAN support with magic packets and GTK rekeying
 
  - Micrel PHY (ksz886x/ksz8081): add cable test support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDb+fUACgkQMUZtbf5S
 Irs2Jg//aqN0Q8CgIvYCVhPxQw1tY7pTAbgyqgBZ01vwjyvtIOgJiWzSfFEU84mX
 M8fcpFX5eTKrOyJ9S6UFfQ/JG114n3hjAxFFT4Hxk2gC1Tg0vHuFQTDHcUl28bUE
 mTm61e1YpdorILnv2k5JVQ/wu0vs5QKDrjcYcrcPnh+j93wvnPOgAfDBV95nZzjS
 OTt4q2fR8GzLcSYWWsclMbDNkzyTG50RW/0Yd6aGjr5QGvXfrMeXfUJNz533PMf/
 w5lNyjRKv+x9mdTZJzU0+msNUrZgUdRz7W8Ey8lD3hJZRE+D6/uU7FtsE8Mi3+uc
 HWxeZUyzA3YF1MfVl/eesbxyPT7S/OkLzk4O5B35FbqP0YltaP+bOjq1/nM3ce1/
 io9Dx9pIl/2JANUgRCAtLi8Z2dkvRoqTaBxZ/nPudCCljFwDwl6joTMJ7Ow22i5Y
 5aIkcXFmZq4LbJDiHvbTlqT7yiuaEvu2UK/23bSIg/K3nF4eAmkY9Y1EgiMf60OF
 78Ttw0wk2tUegwaS5MZnCniKBKDyl9gM2F6rbZ/IxQRR2LTXFc1B6gC+ynUxgXfh
 Ub8O++6qGYGYZ0XvQH4pzco79p3qQWBTK5beIp2eu6BOAjBVIXq4AibUfoQLACsu
 hX7jMPYd0kc3WFgUnKgQP8EnjFSwbf4XiaE7fIXvWBY8hzCw2h4=
 =LvtX
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - BPF:
      - add syscall program type and libbpf support for generating
        instructions and bindings for in-kernel BPF loaders (BPF loaders
        for BPF), this is a stepping stone for signed BPF programs
      - infrastructure to migrate TCP child sockets from one listener to
        another in the same reuseport group/map to improve flexibility
        of service hand-off/restart
      - add broadcast support to XDP redirect

   - allow bypass of the lockless qdisc to improving performance (for
     pktgen: +23% with one thread, +44% with 2 threads)

   - add a simpler version of "DO_ONCE()" which does not require jump
     labels, intended for slow-path usage

   - virtio/vsock: introduce SOCK_SEQPACKET support

   - add getsocketopt to retrieve netns cookie

   - ip: treat lowest address of a IPv4 subnet as ordinary unicast
     address allowing reclaiming of precious IPv4 addresses

   - ipv6: use prandom_u32() for ID generation

   - ip: add support for more flexible field selection for hashing
     across multi-path routes (w/ offload to mlxsw)

   - icmp: add support for extended RFC 8335 PROBE (ping)

   - seg6: add support for SRv6 End.DT46 behavior

   - mptcp:
      - DSS checksum support (RFC 8684) to detect middlebox meddling
      - support Connection-time 'C' flag
      - time stamping support

   - sctp: packetization Layer Path MTU Discovery (RFC 8899)

   - xfrm: speed up state addition with seq set

   - WiFi:
      - hidden AP discovery on 6 GHz and other HE 6 GHz improvements
      - aggregation handling improvements for some drivers
      - minstrel improvements for no-ack frames
      - deferred rate control for TXQs to improve reaction times
      - switch from round robin to virtual time-based airtime scheduler

   - add trace points:
      - tcp checksum errors
      - openvswitch - action execution, upcalls
      - socket errors via sk_error_report

  Device APIs:

   - devlink: add rate API for hierarchical control of max egress rate
     of virtual devices (VFs, SFs etc.)

   - don't require RCU read lock to be held around BPF hooks in NAPI
     context

   - page_pool: generic buffer recycling

  New hardware/drivers:

   - mobile:
      - iosm: PCIe Driver for Intel M.2 Modem
      - support for Qualcomm MSM8998 (ipa)

   - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices

   - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches

   - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU)

   - NXP SJA1110 Automotive Ethernet 10-port switch

   - Qualcomm QCA8327 switch support (qca8k)

   - Mikrotik 10/25G NIC (atl1c)

  Driver changes:

   - ACPI support for some MDIO, MAC and PHY devices from Marvell and
     NXP (our first foray into MAC/PHY description via ACPI)

   - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx

   - Mellanox/Nvidia NIC (mlx5)
      - NIC VF offload of L2 bridging
      - support IRQ distribution to Sub-functions

   - Marvell (prestera):
      - add flower and match all
      - devlink trap
      - link aggregation

   - Netronome (nfp): connection tracking offload

   - Intel 1GE (igc): add AF_XDP support

   - Marvell DPU (octeontx2): ingress ratelimit offload

   - Google vNIC (gve): new ring/descriptor format support

   - Qualcomm mobile (rmnet & ipa): inline checksum offload support

   - MediaTek WiFi (mt76)
      - mt7915 MSI support
      - mt7915 Tx status reporting
      - mt7915 thermal sensors support
      - mt7921 decapsulation offload
      - mt7921 enable runtime pm and deep sleep

   - Realtek WiFi (rtw88)
      - beacon filter support
      - Tx antenna path diversity support
      - firmware crash information via devcoredump

   - Qualcomm WiFi (wcn36xx)
      - Wake-on-WLAN support with magic packets and GTK rekeying

   - Micrel PHY (ksz886x/ksz8081): add cable test support"

* tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2168 commits)
  tcp: change ICSK_CA_PRIV_SIZE definition
  tcp_yeah: check struct yeah size at compile time
  gve: DQO: Fix off by one in gve_rx_dqo()
  stmmac: intel: set PCI_D3hot in suspend
  stmmac: intel: Enable PHY WOL option in EHL
  net: stmmac: option to enable PHY WOL with PMT enabled
  net: say "local" instead of "static" addresses in ndo_dflt_fdb_{add,del}
  net: use netdev_info in ndo_dflt_fdb_{add,del}
  ptp: Set lookup cookie when creating a PTP PPS source.
  net: sock: add trace for socket errors
  net: sock: introduce sk_error_report
  net: dsa: replay the local bridge FDB entries pointing to the bridge dev too
  net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev
  net: dsa: include fdb entries pointing to bridge in the host fdb list
  net: dsa: include bridge addresses which are local in the host fdb list
  net: dsa: sync static FDB entries on foreign interfaces to hardware
  net: dsa: install the host MDB and FDB entries in the master's RX filter
  net: dsa: reference count the FDB addresses at the cross-chip notifier level
  net: dsa: introduce a separate cross-chip notifier type for host FDBs
  net: dsa: reference count the MDB entries at the cross-chip notifier level
  ...
2021-06-30 15:51:09 -07:00
Christoph Hellwig da6269da4c block: remove REQ_OP_SCSI_{IN,OUT}
With the legacy IDE driver gone drivers now use either REQ_OP_DRV_*
or REQ_OP_SCSI_*, so unify the two concepts of passthrough requests
into a single one.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 15:34:19 -06:00
Hao Xu e149bd742b io_uring: code clean for kiocb_done()
A simple code clean for kiocb_done()

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:40 -06:00
Hao Xu 915b3dde9b io_uring: spin in iopoll() only when reqs are in a single queue
We currently spin in iopoll() when requests to be iopolled are for
same file(device), while one device may have multiple hardware queues.
given an example:

hw_queue_0     |    hw_queue_1
req(30us)           req(10us)

If we first spin on iopolling for the hw_queue_0. the avg latency would
be (30us + 30us) / 2 = 30us. While if we do round robin, the avg
latency would be (30us + 10us) / 2 = 20us since we reap the request in
hw_queue_1 in time. So it's better to do spinning only when requests
are in same hardware queue.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:40 -06:00
Pavel Begunkov 99ebe4efbd io_uring: pre-initialise some of req fields
Most of requests are allocated from an internal cache, so it's waste of
time fully initialising them every time. Instead, let's pre-init some of
the fields we can during initial allocation (e.g. kmalloc(), see
io_alloc_req()) and keep them valid on request recycling. There are four
of them in this patch:

->ctx is always stays the same
->link is NULL on free, it's an invariant
->result is not even needed to init, just a precaution
->async_data we now clean in io_dismantle_req() as it's likely to
   never be allocated.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/892ba0e71309bba9fe9e0142472330bbf9d8f05d.1624739600.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:40 -06:00
Pavel Begunkov 5182ed2e33 io_uring: refactor io_submit_flush_completions
Don't init req_batch before we actually need it. Also, add a small clean
up for req declaration.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ad85512e12bd3a20d521e9782750300970e5afc8.1624739600.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:40 -06:00
Pavel Begunkov 4cfb25bf88 io_uring: optimise hot path restricted checks
Move likely/unlikely from io_check_restriction() to specifically
ctx->restricted check, because doesn't do what it supposed to and make
the common path take an extra jump.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/22bf70d0a543dfc935d7276bdc73081784e30698.1624739600.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:40 -06:00
Pavel Begunkov e5dc480d4e io_uring: remove not needed PF_EXITING check
Since cancellation got moved before exit_signals(), there is no one left
who can call io_run_task_work() with PF_EXIING set, so remove the check.
Note that __io_req_task_submit() still needs a similar check.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f7f305ececb1e6044ea649fb983ca754805bb884.1624739600.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:40 -06:00
Pavel Begunkov dd432ea520 io_uring: mainstream sqpoll task_work running
task_works are widely used, so place io_run_task_work() directly into
the main path of io_sq_thread(), and remove it from other places where
it's not needed anymore.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/24eb5e35d519c590d3dffbd694b4c61a5fe49029.1624739600.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:39 -06:00
Pavel Begunkov b2d9c3da77 io_uring: refactor io_arm_poll_handler()
gcc 11 goes a weird path and duplicates most of io_arm_poll_handler()
for READ and WRITE cases. Help it and move all pollin vs pollout
specific bits under a single if-else, so there is no temptation for this
kind of unfolding.

before vs after:
   text    data     bss     dec     hex filename
  85362   12650       8   98020   17ee4 ./fs/io_uring.o
  85186   12650       8   97844   17e34 ./fs/io_uring.o

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1deea0037293a922a0358e2958384b2e42437885.1624739600.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:39 -06:00
Olivier Langlois 59b735aeeb io_uring: reduce latency by reissueing the operation
It is quite frequent that when an operation fails and returns EAGAIN,
the data becomes available between that failure and the call to
vfs_poll() done by io_arm_poll_handler().

Detecting the situation and reissuing the operation is much faster
than going ahead and push the operation to the io-wq.

Performance improvement testing has been performed with:
Single thread, 1 TCP connection receiving a 5 Mbps stream, no sqpoll.

4 measurements have been taken:
1. The time it takes to process a read request when data is already available
2. The time it takes to process by calling twice io_issue_sqe() after vfs_poll() indicated that data was available
3. The time it takes to execute io_queue_async_work()
4. The time it takes to complete a read request asynchronously

2.25% of all the read operations did use the new path.

ready data (baseline)
avg	3657.94182918628
min	580
max	20098
stddev	1213.15975908162

reissue	completion
average	7882.67567567568
min	2316
max	28811
stddev	1982.79172973284

insert io-wq time
average	8983.82276995305
min	3324
max	87816
stddev	2551.60056552038

async time completion
average	24670.4758861127
min	10758
max	102612
stddev	3483.92416873804

Conclusion:
On average reissuing the sqe with the patch code is 1.1uSec faster and
in the worse case scenario 59uSec faster than placing the request on
io-wq

On average completion time by reissuing the sqe with the patch code is
16.79uSec faster and in the worse case scenario 73.8uSec faster than
async completion.

Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9e8441419bb1b8f3c3fcc607b2713efecdef2136.1624364038.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:39 -06:00
Jens Axboe 22634bc562 io_uring: add IOPOLL and reserved field checks to IORING_OP_UNLINKAT
We can't support IOPOLL with non-pollable request types, and we should
check for unused/reserved fields like we do for other request types.

Fixes: 14a1143b68 ("io_uring: add support for IORING_OP_UNLINKAT")
Cc: stable@vger.kernel.org
Reported-by: Dmitry Kadashev <dkadashev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:39 -06:00
Jens Axboe ed7eb25922 io_uring: add IOPOLL and reserved field checks to IORING_OP_RENAMEAT
We can't support IOPOLL with non-pollable request types, and we should
check for unused/reserved fields like we do for other request types.

Fixes: 80a261fd00 ("io_uring: add support for IORING_OP_RENAMEAT")
Cc: stable@vger.kernel.org
Reported-by: Dmitry Kadashev <dkadashev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:39 -06:00
Pavel Begunkov 12dcb58ac7 io_uring: refactor io_openat2()
Put do_filp_open() fail path of io_openat2() under a single if,
deduplicating put_unused_fd(), making it look better and helping
the hot path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f4c84d25c049d0af2adc19c703bbfef607200209.1624543113.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:39 -06:00
Pavel Begunkov 16340eab61 io_uring: update sqe layout build checks
Add missing BUILD_BUG_SQE_ELEM() for ->buf_group verifying that SQE
layout doesn't change.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1f9d21bd74599b856b3a632be4c23ffa184a3ef0.1624543113.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:39 -06:00
Pavel Begunkov fe7e325750 io_uring: fix code style problems
Fix a bunch of problems mostly found by checkpatch.pl

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cfaf9a2f27b43934144fe9422a916bd327099f44.1624543113.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:39 -06:00
Pavel Begunkov 1a924a8082 io_uring: refactor io_sq_thread()
Move needs_sched declaration into the block where it's used, so it's
harder to misuse/wrongfully reuse.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e4a07db1353ee38b924dd1b45394cf8e746130b4.1624543113.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:39 -06:00
Pavel Begunkov 948e19479c io_uring: don't change sqpoll creds if not needed
SQPOLL doesn't need to change creds if it's not submitting requests.
Move creds overriding into __io_sq_thread() after checking if there are
SQEs pending.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c54368da2357ac539e0a333f7cfff70d5fb045b2.1624543113.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 14:15:38 -06:00
Linus Torvalds 440462198d for-5.14/drivers-2021-06-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDbd5UQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvsNEADCJKP81boFzRcdJo7EqaNDAzZyKOIg9Oq7
 4GZE0Wm0SgA6+04bKrNVd9KLcKvQ+NC1pK7UJemSSH2y9ir+zHfyYgAV0/+wFmYm
 NgHlDjBvf80XSI5wezcb6MxZT+R7IaIpDsW1ZvV9hFtPSncn5o2OIWiSdJtHT/Rv
 enlgZPc7OwNWoVMX8eR58IoO0k3S6GLpctUZHt/AUukaKgoOks0X523qhEPf3Upr
 RkbIZuqLWVgpdT6457iSE/OijUczD4thTI8bdprxzhgimOm2vV52sO6F5HtHc7GX
 qW+PWYUaiUk7UpObuOuyv0yyUG45ii73iY1W0w66RiyCjVTgtpdwwMQ38VlBcoOg
 zcE1jneAEJt6TiS6zfRaER/10JoCIG4gp1+apPuaXud/o3BqWI0cagVHAgaLziBI
 F7bDJkbJZIR6GrWMgemBI+mc5/LACBePxzPGLScKFptejtQ/ysfZQ6aCLROJWB2U
 4EnysAaUBf6tywj30JqfQvqFNGkHIgY95FKiXJW6GzqqwgBouNf48vS15BgkwI+2
 EijcqUhlOVNfc3RIc0ZL5c9KcPIN9t5sqBrWZe3wgCErhxAx6w6Za9nDdP+US9bl
 /apCpvDFlu59g8n1wtkNE/uC+XqdKDwsplYhnfpX0FGni5wIknhQq3bSe4dPFgSn
 pG5VMrw3pA==
 =D6dS
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Pretty calm round, mostly just NVMe and a bit of MD:

   - NVMe updates (via Christoph)
        - improve the APST configuration algorithm (Alexey Bogoslavsky)
        - look for StorageD3Enable on companion ACPI device
          (Mario Limonciello)
        - allow selecting the network interface for TCP connections
          (Martin Belanger)
        - misc cleanups (Amit Engel, Chaitanya Kulkarni, Colin Ian King,
          Christoph)
        - move the ACPI StorageD3 code to drivers/acpi/ and add quirks
          for certain AMD CPUs (Mario Limonciello)
        - zoned device support for nvmet (Chaitanya Kulkarni)
        - fix the rules for changing the serial number in nvmet
          (Noam Gottlieb)
        - various small fixes and cleanups (Dan Carpenter, JK Kim,
          Chaitanya Kulkarni, Hannes Reinecke, Wesley Sheng, Geert
          Uytterhoeven, Daniel Wagner)

   - MD updates (Via Song)
        - iostats rewrite (Guoqing Jiang)
        - raid5 lock contention optimization (Gal Ofri)

   - Fall through warning fix (Gustavo)

   - Misc fixes (Gustavo, Jiapeng)"

* tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block: (78 commits)
  nvmet: use NVMET_MAX_NAMESPACES to set nn value
  loop: Fix missing discard support when using LOOP_CONFIGURE
  nvme.h: add missing nvme_lba_range_type endianness annotations
  nvme: remove zeroout memset call for struct
  nvme-pci: remove zeroout memset call for struct
  nvmet: remove zeroout memset call for struct
  nvmet: add ZBD over ZNS backend support
  nvmet: add Command Set Identifier support
  nvmet: add nvmet_req_bio put helper for backends
  nvmet: add req cns error complete helper
  block: export blk_next_bio()
  nvmet: remove local variable
  nvmet: use nvme status value directly
  nvmet: use u32 type for the local variable nsid
  nvmet: use u32 for nvmet_subsys max_nsid
  nvmet: use req->cmd directly in file-ns fast path
  nvmet: use req->cmd directly in bdev-ns fast path
  nvmet: make ver stable once connection established
  nvmet: allow mn change if subsys not discovered
  nvmet: make sn stable once connection was established
  ...
2021-06-30 12:21:16 -07:00
Linus Torvalds df668a5fe4 for-5.14/block-2021-06-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDbXAwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr0HEADDJaSgjpnWQwH1RVLNagJa9KnktxZYsEs+
 as3QmDdpKRG3rEC9bdE7FLe/xq3WBaO5j1hTQ9P6IguqLyS1Df72DtTlKyaCrZoe
 zv9eIlY4lZUfksE2nzWmlN9uG0FBVXeEQpHCLSNbUZeK1zvV6+NNhQqw2kc0sEqu
 hReUFeMUbsMcu/w5T3XMVJNsTMCql9wta2H0q5hONQyJQSrIwa1D+sUdE5I8fO4j
 bnoYX9yxHX26EztX1UJiGRgoq5Trz7LY7hAfljKSkewpFwiHE2vBdq2L0C2RKsIV
 tTs2DjMCMQyPNeA7WAG8HlR4aPG+7+/fuBP1KJHkykjWXglWN7OqISuBv6rrBgQs
 gNRnZ4qmb1CzD6aLEBk59nHt6po6eMxXIW856YktKy8rKcrgK29qP44Z+oomkPKo
 ZjQ0wqN5CvpObM/dIKxl9bAJ4zQDHBt49d5nTTQLfWl/mgevu6ZNWD/hONyCQmFy
 zKKqQ/wkxWHutOsjC5/MKNb3ZRNH9tt9X+HfULO2DU6IqqifYw/ex4z4MVsBopJC
 7pPfd81kgC73TgXe1AaCwHqNWsrqYCuTK0ew1CtGudlS3lucMwtap4GBiCgg5gbu
 M8pEgwO4OcCLHyRUc8zdfqI7HumbprbFmojPkwGSEe0ofVD74lMhzbUj5jvTYY2B
 t8D2XcgyOA==
 =lhon
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:

 - disk events cleanup (Christoph)

 - gendisk and request queue allocation simplifications (Christoph)

 - bdev_disk_changed cleanups (Christoph)

 - IO priority improvements (Bart)

 - Chained bio completion trace fix (Edward)

 - blk-wbt fixes (Jan)

 - blk-wbt enable/disable fix (Zhang)

 - Scheduler dispatch improvements (Jan, Ming)

 - Shared tagset scheduler improvements (John)

 - BFQ updates (Paolo, Luca, Pietro)

 - BFQ lock inversion fix (Jan)

 - Documentation improvements (Kir)

 - CLONE_IO block cgroup fix (Tejun)

 - Remove of ancient and deprecated block dump feature (zhangyi)

 - Discard merge fix (Ming)

 - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
   Yang)

* tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
  block: fix discard request merge
  block/mq-deadline: Remove a WARN_ON_ONCE() call
  blk-mq: update hctx->dispatch_busy in case of real scheduler
  blk: Fix lock inversion between ioc lock and bfqd lock
  bfq: Remove merged request already in bfq_requests_merged()
  block: pass a gendisk to bdev_disk_changed
  block: move bdev_disk_changed
  block: add the events* attributes to disk_attrs
  block: move the disk events code to a separate file
  block: fix trace completion for chained bio
  block/partitions/msdos: Fix typo inidicator -> indicator
  block, bfq: reset waker pointer with shared queues
  block, bfq: check waker only for queues with no in-flight I/O
  block, bfq: avoid delayed merge of async queues
  block, bfq: boost throughput by extending queue-merging times
  block, bfq: consider also creation time in delayed stable merge
  block, bfq: fix delayed stable merge check
  block, bfq: let also stably merged queues enjoy weight raising
  blk-wbt: make sure throttle is enabled properly
  blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
  ...
2021-06-30 12:12:56 -07:00
Zhang Yi 16aa4c9a1f jbd2: export jbd2_journal_[un]register_shrinker()
Export jbd2_journal_[un]register_shrinker() to fix this error when
ext4 is built as a module:

  ERROR: modpost: "jbd2_journal_unregister_shrinker" undefined!
  ERROR: modpost: "jbd2_journal_register_shrinker" undefined!

Fixes: 4ba3fcdde7 ("jbd2,ext4: add a shrinker to release checkpointed buffers")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210630083638.140218-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-30 11:05:00 -04:00
Joe Perches 78c14b385c treewide: Add missing semicolons to __assign_str uses
The __assign_str macro has an unusual ending semicolon but the vast
majority of uses of the macro already have semicolon termination.

$ git grep -P '\b__assign_str\b' | wc -l
551
$ git grep -P '\b__assign_str\b.*;' | wc -l
480

Add semicolons to the __assign_str() uses without semicolon termination
and all the other uses without semicolon termination via additional defines
that are equivalent to __assign_str() with the eventual goal of removing
the semicolon from the __assign_str() macro definition.

Link: https://lore.kernel.org/lkml/1e068d21106bb6db05b735b4916bb420e6c9842a.camel@perches.com/
Link: https://lkml.kernel.org/r/48a056adabd8f70444475352f617914cef504a45.camel@perches.com

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-30 09:19:14 -04:00
Linus Torvalds 007b350a58 dlm for 5.14
This is a major dlm networking enhancement that adds message
 retransmission so that the dlm can reliably continue operating
 when network connections fail and nodes reconnect.  Previously,
 this would result in lost messages which could only be handled
 as a node failure.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJg2zyvAAoJEDgbc8f8gGmqjKkP/0RaNf0PCPohzCoiRk8YbZLP
 ZOIBq/uZxmwahuG1s8Hqi7+HEIfgZMEK1deJzTPggVruJkj1PRYdqHIAUx6Fshwj
 +RG7Jcw02hprWrojDP8Ey1rFOeiSCi9wa2rznWJgSw31YJMDhUazQCA3VLNnxYXi
 KNLRNU897SMVMYadkzsEZhQ1oF9cokVWRynI/M5fYxvojO/afd6Hja+2RZxLBM8h
 Ono006So5K+BMo3trIU9wWJPu1VrgnM9U/teMkL/UnPJH3p/kqyzfb238VDR8nye
 eElOox4r4x99hoKds9zNErQBVWyaPsKnCMXAVY7lPAKmZ4bVvogOtAcE8+jlD3nx
 vssVYZKqkmX5SuJKTIATDUbeCh8in4dMCHQ48AJgyeX0JXyqEsFYzKw12D+cA71L
 mI1YK0SJSx8JkHfij5kXWhrCR9sku8WpJkk+3i3sFOZg6VEVLOoGLkMVAtEba8Ls
 Tiscj8uiDDS/mMQosswZtlipzD7ajsttN2Z0o/gkoMvs9QDdHM8UjNfPZn6zYJKP
 SVCAaYP5q9IQ1lmQAgrMFc378H09bigM4eo4IuwRAkG1iwifhif+xhw28K9ZMoAf
 dCDoEhMUidBh8KTG2vfsoUdvcK503xY+z/cSYXfIYcFDdQrF5TNyeiWObp1BRw3/
 LqbMp1elu1fSc74uI07P
 =se79
 -----END PGP SIGNATURE-----

Merge tag 'dlm-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "This is a major dlm networking enhancement that adds message
  retransmission so that the dlm can reliably continue operating when
  network connections fail and nodes reconnect.

  Previously, this would result in lost messages which could only be
  handled as a node failure"

* tag 'dlm-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: (26 commits)
  fs: dlm: invalid buffer access in lookup error
  fs: dlm: fix race in mhandle deletion
  fs: dlm: rename socket and app buffer defines
  fs: dlm: introduce proto values
  fs: dlm: move dlm allow conn
  fs: dlm: use alloc_ordered_workqueue
  fs: dlm: fix memory leak when fenced
  fs: dlm: fix lowcomms_start error case
  fs: dlm: Fix spelling mistake "stucked" -> "stuck"
  fs: dlm: Fix memory leak of object mh
  fs: dlm: don't allow half transmitted messages
  fs: dlm: add midcomms debugfs functionality
  fs: dlm: add reliable connection if reconnect
  fs: dlm: add union in dlm header for lockspace id
  fs: dlm: move out some hash functionality
  fs: dlm: add functionality to re-transmit a message
  fs: dlm: make buffer handling per msg
  fs: dlm: add more midcomms hooks
  fs: dlm: public header in out utility
  fs: dlm: fix connection tcp EOF handling
  ...
2021-06-29 20:30:00 -07:00
Linus Torvalds 8418dabd97 Various minor gfs2 cleanups and fixes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmDa9GoUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTr2Wg//VPDf8gD07ixZ9jOujwCB/0q5n403
 doTtonF+26VVN0e3aZY5M949cW/65ogKYiZgzC6JlCkqsGg+8Y8xQ2mxKtUPQoY7
 lGCU1uQVnoZMWpRzcigH3iGFz+yyW6PFH7lgr+mcyqBHNPc1zPK5oioV3KKQTtZH
 XUP5FGZ9A3noty/feMCcicjAfh3iqwBSa7+QqPe9OlZ08AeDbaHatFzSnOPerpEh
 eIOzv795q9MVKOghq+JHgELbnX58IrI3FVWDAWS9+KBfgdb23DV/78mmA9rjjP91
 UXD5IJ7hYDQiPkKHX6uHrLyQaIkZlXISEAYiEmiNzfVysopl88P1JAV6n6MfrCVx
 MX3ZEvOFAe5aSXjUMXPdhgYgf7dg1gV5L6zmDSTngQqegGxpp8tSXiUY1f5U/JU3
 hVSwN0H1xOd6w9GbZM8GJq316gM8DHZGS/0UyZjHNYrH1rt4m7C0K0aukbelJ0EM
 YdYceC6zONSk+ltM90Cvfy0WPO7nVm9hrG8nJcyxzyuIptrKqa9BCzLV7FvlObVE
 Ia3MQf0j7wF+33Ro3VcyQTV2xadqFIzeJ2yLxJZmJ8pf/Pyu8pB0/+8Ug3Xrve+w
 i1Q4y4LikJNNhyhBYt2Qev+myIx7VE8WoQ7tFwOmMHhKs9DbTP4WpyJL/x7Wy9wg
 34KFHHVp+jIeUwI=
 =47Ww
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v5.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Andreas Gruenbacher:
 "Various minor gfs2 cleanups and fixes"

* tag 'gfs2-v5.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Clean up gfs2_unstuff_dinode
  gfs2: Unstuff before locking page in gfs2_page_mkwrite
  gfs2: Clean up the error handling in gfs2_page_mkwrite
  gfs2: Fix error handling in init_statfs
  gfs2: Fix underflow in gfs2_page_mkwrite
  gfs2: Use list_move_tail instead of list_del/list_add_tail
  gfs2: Fix do_gfs2_set_flags description
2021-06-29 20:23:08 -07:00
Linus Torvalds bbd91626f7 38 cifs/smb3 fixes including improvement to fallocate emulation, some DFS fixes. About 1/3 are to address Coverity warnings
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmDaXUUACgkQiiy9cAdy
 T1HIgQv8CyUHp3NzQkRgfw0TJ0iozOE1TvcvZxlcnE1gXAc6F2hDavpoPrex2cp3
 OQEph0+JA7w3cKr84GNyjnVt00EWgvxhM0CBFVQt/x0dlEdUVJRfZ+jaUp4iyRqR
 R8Jml00vJwup+xHVHVUfhEpS6EhF5I39pb4zxGP9Efh5CfctsUucFJBTXD/xLJl5
 eTgdO1w/9Zte0Ga/f6nB0k+c4IzCkJUsgt9/H4ivA3xOv3r8+i/fyOHTyyOkRsGq
 GStAo0pfV5h9s4laYK/brGAAsCaCn/2jsKriB7dwfOUql6M6VKIVFBbQ3TNLTw7f
 lYxxlOjABn7gY9LvkOpRYtu0yd9iUHTwZLreWLdaSsvSftfX6+XZfm7b5lfCn8xn
 5UCxw1obhczxZM2qWdV5Tl+EzJpNrVkdK5Q3wvpXBmbl92embHb85DQHYEzgxsy3
 OBwlSUYDgLdgc8eXuP0WlfF15NA91RC3msgfQ3PKRqM3xnNSqbp6RbPXW1Bd1M1V
 UqZBZhJc
 =EIWo
 -----END PGP SIGNATURE-----

Merge tag '5.14-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs updates from Steve French:

 - improve fallocate emulation

 - DFS fixes

 - minor multichannel fixes

 - various cleanup patches, many to address Coverity warnings

* tag '5.14-rc-smb3-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6: (38 commits)
  smb3: prevent races updating CurrentMid
  cifs: fix missing spinlock around update to ses->status
  cifs: missing null pointer check in cifs_mount
  smb3: fix possible access to uninitialized pointer to DACL
  cifs: missing null check for newinode pointer
  cifs: remove two cases where rc is set unnecessarily in sid_to_id
  SMB3: Add new info level for query directory
  cifs: fix NULL dereference in smb2_check_message()
  smbdirect: missing rc checks while waiting for rdma events
  cifs: Avoid field over-reading memcpy()
  smb311: remove dead code for non compounded posix query info
  cifs: fix SMB1 error path in cifs_get_file_info_unix
  smb3: fix uninitialized value for port in witness protocol move
  cifs: fix unneeded null check
  cifs: use SPDX-Licence-Identifier
  cifs: convert list_for_each to entry variant in cifs_debug.c
  cifs: convert list_for_each to entry variant in smb2misc.c
  cifs: avoid extra calls in posix_info_parse
  cifs: retry lookup and readdir when EAGAIN is returned.
  cifs: fix check of dfs interlinks
  ...
2021-06-29 20:18:16 -07:00
Linus Torvalds b97902b62a fs.openat2.unknown_flags.v5.14
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYNnKewAKCRCRxhvAZXjc
 oo/DAQCgKsDJTSht/QXuA0bMqdsQW27AWFfKacbk5lY4EjXz1gD/ZsYU2Si1fgkB
 7mEl32JsfgcIBv0VdIulAh2F29Fa0A0=
 =/b8l
 -----END PGP SIGNATURE-----

Merge tag 'fs.openat2.unknown_flags.v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull openat2 fixes from Christian Brauner:

 - Remove the unused VALID_UPGRADE_FLAGS define we carried from an
   extension to openat2() that we haven't merged. Aleksa might be
   getting back to it at some point but just not right now.

 - openat2() used to accidently ignore unknown flag values in the upper
   32 bits.

   The new openat2() syscall verifies that no unknown O-flag values are
   set and returns an error to userspace if they are while the older
   open syscalls like open() and openat() simply ignore unknown flag
   values:

      #define O_FLAG_CURRENTLY_INVALID (1 << 31)
      struct open_how how = {
            .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID,
            .resolve = 0,
      };

      /* fails */
      fd = openat2(-EBADF, "/dev/null", &how, sizeof(how));

      /* succeeds */
      fd = openat(-EBADF, "/dev/null", O_RDONLY | O_FLAG_CURRENTLY_INVALID);

   However, openat2() silently truncates the upper 32 bits meaning:

      #define O_FLAG_CURRENTLY_INVALID_LOWER32 (1 << 31)
      #define O_FLAG_CURRENTLY_INVALID_UPPER32 (1 << 40)

      struct open_how how_lowe32 = {
            .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_LOWER32,
      };

      struct open_how how_upper32 = {
            .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_UPPER32,
      };

      /* fails */
      fd = openat2(-EBADF, "/dev/null", &how_lower32, sizeof(how_lower32));

      /* succeeds */
      fd = openat2(-EBADF, "/dev/null", &how_upper32, sizeof(how_upper32));

   Fix this by preventing the immediate truncation in build_open_flags()
   and add a compile-time check to catch when we add flags in the upper
   32 bit range.

* tag 'fs.openat2.unknown_flags.v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  test: add openat2() test for invalid upper 32 bit flag value
  open: don't silently ignore unknown O-flags in openat2()
  fcntl: remove unused VALID_UPGRADE_FLAGS
2021-06-29 20:10:16 -07:00
Linus Torvalds 30d1a556a9 fs.mount_setattr.nosymfollow.v5.14
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYNnKNAAKCRCRxhvAZXjc
 ok/HAQDz3FK3/yeqxH6OLyCedUcD+YBFPPzrfqX+3y6q3z5tGgD9GAGxFXWcMFA2
 /cbfmizwh1eJ3WMnbHUp7x6ogpQhWwQ=
 =PpNs
 -----END PGP SIGNATURE-----

Merge tag 'fs.mount_setattr.nosymfollow.v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull mount_setattr updates from Christian Brauner:
 "A few releases ago the old mount API gained support for a mount
  options which prevents following symlinks on a given mount. This adds
  support for it in the new mount api through the MOUNT_ATTR_NOSYMFOLLOW
  flag via mount_setattr() and fsmount(). With mount_setattr() that flag
  can even be applied recursively.

  There's an additional ack from Ross Zwisler who originally authored
  the nosymfollow patch. As I've already had the patches in my for-next
  I didn't add his ack explicitly"

* tag 'fs.mount_setattr.nosymfollow.v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  tests: test MOUNT_ATTR_NOSYMFOLLOW with mount_setattr()
  mount: Support "nosymfollow" in new mount api
2021-06-29 20:07:06 -07:00
Jonathan Davies d578b99443 ext4: notify sysfs on errors_count value change
After s_error_count is incremented, signal the change in the
corresponding sysfs attribute via sysfs_notify. This allows userspace to
poll() on changes to /sys/fs/ext4/*/errors_count.

[ Moved call of ext4_notify_error_sysfs() to flush_stashed_error_work()
  to avoid BUG's caused by calling sysfs_notify trying to sleep after
  being called from an invalid context. -- TYT ]

Signed-off-by: Jonathan Davies <jonathan.davies@nutanix.com>
Link: https://lore.kernel.org/r/20210611140209.28903-1-jonathan.davies@nutanix.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-29 21:06:02 -04:00
Linus Torvalds 65090f30ab Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "191 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
  slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
  mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
  pagealloc, and memory-failure)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits)
  mm,hwpoison: make get_hwpoison_page() call get_any_page()
  mm,hwpoison: send SIGBUS with error virutal address
  mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
  mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
  mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
  mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
  docs: remove description of DISCONTIGMEM
  arch, mm: remove stale mentions of DISCONIGMEM
  mm: remove CONFIG_DISCONTIGMEM
  m68k: remove support for DISCONTIGMEM
  arc: remove support for DISCONTIGMEM
  arc: update comment about HIGHMEM implementation
  alpha: remove DISCONTIGMEM and NUMA
  mm/page_alloc: move free_the_page
  mm/page_alloc: fix counting of managed_pages
  mm/page_alloc: improve memmap_pages dbg msg
  mm: drop SECTION_SHIFT in code comments
  mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
  mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
  mm/page_alloc: scale the number of pages that are batch freed
  ...
2021-06-29 17:29:11 -07:00
Linus Torvalds e563592c3e printk changes for 5.14
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmDa2mEACgkQUqAMR0iA
 lPKBiBAAgvhNnaRVR6/GBVrv5jYM8obJM7PHPxp8dh+ZRb1mDyL1ZDU2r7lmQjMD
 ORBN5eK6pXk/gVabXR5lY0B7vQ8phJmYO98Lk2E3n9ZTgMkTHQ5WOHzBpt93gd/y
 l9m00ZD0YcHrkmM1fq73FuZVEMzPk85cjTe8n6JeHJgSAdoOY/rl61Cn57ZHFIa4
 DkpdNGkJaf77UIWOc8NLAXOdSD9NxSGycHXpU0q8QO9UFq+Le2qN4OPj3S1CNCO2
 ciy+VcW1VQ/BesPPlBIk3ImHWPS4ty3n4EYFzNm+saElIaWxyhNBXAvcBAK/x9LK
 3QibfBFwbS3sllhnk96Z24UaWWMg2AygbV2aqd3xMLpW3XD6q/MVnWGHfayhnmYj
 aNcWpldIjwDH4iZJ5vnD4KewQpYp+Jc5Hqv6UyIf1O8nEvvQubrDXjSDLLcbZFI1
 m2cs9DTc5ezyX/DifBsViDbw8hPjJg7QAbRjVk1EfVQrH090mRQoSoQQI4QtuMEi
 pPiTALNG1HRKIoYrKxQMB43JvZ1zjaSbtNbW4JJ9Bm3kxFZ/Oa8NXzE5BOjeykZm
 bCePtc018GZaGNW0z/Zr460c/Tuaj8fZSzUOj9Xnw5Hv4G3W5+5EqDy7sU/GTPjL
 It9rAZYo+cM9vp1BD2343YPZgnChWHaW0BF/WDqFAhLd9av/WKI=
 =Oa1c
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk updates from Petr Mladek:

 - Add %pt[RT]s modifier to vsprintf(). It overrides ISO 8601 separator
   by using ' ' (space). It produces "YYYY-mm-dd HH:MM:SS" instead of
   "YYYY-mm-ddTHH:MM:SS".

 - Correctly parse long row of numbers by sscanf() when using the field
   width. Add extensive sscanf() selftest.

 - Generalize re-entrant CPU lock that has already been used to
   serialize dump_stack() output. It is part of the ongoing printk
   rework. It will allow to remove the obsoleted printk_safe buffers and
   introduce atomic consoles.

 - Some code clean up and sparse warning fixes.

* tag 'printk-for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk: fix cpu lock ordering
  lib/dump_stack: move cpu lock to printk.c
  printk: Remove trailing semicolon in macros
  random32: Fix implicit truncation warning in prandom_seed_state()
  lib: test_scanf: Remove pointless use of type_min() with unsigned types
  selftests: lib: Add wrapper script for test_scanf
  lib: test_scanf: Add tests for sscanf number conversion
  lib: vsprintf: Fix handling of number field widths in vsscanf
  lib: vsprintf: scanf: Negative number must have field width > 1
  usb: host: xhci-tegra: Switch to use %ptTs
  nilfs2: Switch to use %ptTs
  kdb: Switch to use %ptTs
  lib/vsprintf: Allow to override ISO 8601 date and time separator
2021-06-29 12:07:18 -07:00
David Hildenbrand a4eec6a3df binfmt: remove in-tree usage of MAP_EXECUTABLE
Ever since commit e9714acf8c ("mm: kill vma flag VM_EXECUTABLE and
mm->num_exe_file_vmas"), VM_EXECUTABLE is gone and MAP_EXECUTABLE is
essentially completely ignored.  Let's remove all usage of MAP_EXECUTABLE.

[akpm@linux-foundation.org: fix blooper in fs/binfmt_aout.c. per David]

Link: https://lkml.kernel.org/r/20210421093453.6904-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:50 -07:00
Andrea Arcangeli a458b76a41 mm: gup: pack has_pinned in MMF_HAS_PINNED
has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
cleanup.

Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast
would reintroduce a loss of SMP scalability to pin-fast, so there's no
future potential usefulness to keep an atomic in the mm for this.

set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE
(atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like
atomic_set after this commit) has to be still issued only once per "mm",
so the difference between the two will be lost in the noise.

will-it-scale "mmap2" shows no change in performance with enterprise
config as expected.

will-it-scale "pin_fast" retains the > 4000% SMP scalability performance
improvement against upstream as expected.

This is a noop as far as overall performance and SMP scalability are
concerned.

[peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED]
  Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s
[akpm@linux-foundation.org: coding style fixes]
[peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments]

Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Matthew Wilcox (Oracle) 3a6b216200 mm: move page dirtying prototypes from mm.h
These functions implement the address_space ->set_page_dirty operation and
should live in pagemap.h, not mm.h so that the rest of the kernel doesn't
get funny ideas about calling them directly.

Link: https://lkml.kernel.org/r/20210615162342.1669332-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Matthew Wilcox (Oracle) b82a96c925 fs: remove noop_set_page_dirty()
Use __set_page_dirty_no_writeback() instead.  This will set the dirty bit
on the page, which will be used to avoid calling set_page_dirty() in the
future.  It will have no effect on actually writing the page back, as the
pages are not on any LRU lists.

[akpm@linux-foundation.org: export __set_page_dirty_no_writeback() to modules]

Link: https://lkml.kernel.org/r/20210615162342.1669332-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Matthew Wilcox (Oracle) fc50eee329 fs: remove anon_set_page_dirty()
Use __set_page_dirty_no_writeback() instead.  This will set the dirty bit
on the page, which will be used to avoid calling set_page_dirty() in the
future.  It will have no effect on actually writing the page back, as the
pages are not on any LRU lists.

Link: https://lkml.kernel.org/r/20210615162342.1669332-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Matthew Wilcox (Oracle) fd7353f88b iomap: use __set_page_dirty_nobuffers
The only difference between iomap_set_page_dirty() and
__set_page_dirty_nobuffers() is that the latter includes a debugging check
that a !Uptodate page has private data.

Link: https://lkml.kernel.org/r/20210615162342.1669332-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Matthew Wilcox (Oracle) 6e1cae881a mm/writeback: move __set_page_dirty() to core mm
Patch series "Further set_page_dirty cleanups".

Prompted by Christoph's recent patches, here are some more patches to
improve the state of set_page_dirty().  They're all from the folio tree,
so they've been tested to a certain extent.

This patch (of 6):

Nothing in __set_page_dirty() is specific to buffer_head, so move it to
mm/page-writeback.c.  That removes the only caller of
account_page_dirtied() outside of page-writeback.c, so make it static.

Link: https://lkml.kernel.org/r/20210615162342.1669332-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20210615162342.1669332-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Christoph Hellwig 0af573780b mm: require ->set_page_dirty to be explicitly wired up
Remove the CONFIG_BLOCK default to __set_page_dirty_buffers and just wire
that method up for the missing instances.

[hch@lst.de: ecryptfs: add a ->set_page_dirty cludge]
  Link: https://lkml.kernel.org/r/20210624125250.536369-1-hch@lst.de

Link: https://lkml.kernel.org/r/20210614061512.3966143-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Tyler Hicks <code@tyhicks.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Christoph Hellwig c1e3dbe981 fs: move ramfs_aops to libfs
Move the ramfs aops to libfs and reuse them for kernfs and configfs.
Thosw two did not wire up ->set_page_dirty before and now get
__set_page_dirty_no_writeback, which is the right one for no-writeback
address_space usage.

Drop the now unused exports of the libfs helpers only used for ramfs-style
pagecache usage.

Link: https://lkml.kernel.org/r/20210614061512.3966143-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Christoph Hellwig 34ebcce793 fs: unexport __set_page_dirty
Patch series "remove the implicit .set_page_dirty default".

This series cleans up a few lose ends around ->set_page_dirty, most
importantly removes the default to the buffer head based on if no method
is wired up.

This patch (of 3):

__set_page_dirty is only used by built-in code.

Link: https://lkml.kernel.org/r/20210614061512.3966143-1-hch@lst.de
Link: https://lkml.kernel.org/r/20210614061512.3966143-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin c22d70a162 writeback, cgroup: release dying cgwbs by switching attached inodes
Asynchronously try to release dying cgwbs by switching attached inodes to
the nearest living ancestor wb.  It helps to get rid of per-cgroup
writeback structures themselves and of pinned memory and block cgroups,
which are significantly larger structures (mostly due to large per-cpu
statistics data).  This prevents memory waste and helps to avoid different
scalability problems caused by large piles of dying cgroups.

Reuse the existing mechanism of inode switching used for foreign inode
detection.  To speed things up batch up to 115 inode switching in a single
operation (the maximum number is selected so that the resulting struct
inode_switch_wbs_context can fit into 1024 bytes).  Because every
switching consists of two steps divided by an RCU grace period, it would
be too slow without batching.  Please note that the whole batch counts as
a single operation (when increasing/decreasing isw_nr_in_flight).  This
allows to keep umounting working (flush the switching queue), however
prevents cleanups from consuming the whole switching quota and effectively
blocking the frn switching.

A cgwb cleanup operation can fail due to different reasons (e.g.  not
enough memory, the cgwb has an in-flight/pending io, an attached inode in
a wrong state, etc).  In this case the next scheduled cleanup will make a
new attempt.  An attempt is made each time a new cgwb is offlined (in
other words a memcg and/or a blkcg is deleted by a user).  In the future
an additional attempt scheduled by a timer can be implemented.

[guro@fb.com: replace open-coded "115" with arithmetic]
  Link: https://lkml.kernel.org/r/YMEcSBcq/VXMiPPO@carbon.dhcp.thefacebook.com
[guro@fb.com: add smp_mb() to inode_prepare_wbs_switch()]
  Link: https://lkml.kernel.org/r/YMFa+guFw7OFjf3X@carbon.dhcp.thefacebook.com
[willy@infradead.org: fix documentation]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-2-willy@infradead.org

Link: https://lkml.kernel.org/r/20210608230225.2078447-9-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin f5fbe6b7ad writeback, cgroup: support switching multiple inodes at once
Currently only a single inode can be switched to another writeback
structure at once.  That means to switch an inode a separate
inode_switch_wbs_context structure must be allocated, and a separate rcu
callback and work must be scheduled.

It's fine for the existing ad-hoc switching, which is not happening that
often, but sub-optimal for massive switching required in order to release
a writeback structure.  To prepare for it, let's add a support for
switching multiple inodes at once.

Instead of containing a single inode pointer, inode_switch_wbs_context
will contain a NULL-terminated array of inode pointers.
inode_do_switch_wbs() will be called for each inode.

To optimize the locking bdi->wb_switch_rwsem, old_wb's and new_wb's
list_locks will be acquired and released only once altogether for all
inodes.  wb_wakeup() will be also be called only once.  Instead of calling
wb_put(old_wb) after each successful switch, wb_put_many() is introduced
and used.

Link: https://lkml.kernel.org/r/20210608230225.2078447-8-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin 72d4512e9c writeback, cgroup: split out the functional part of inode_switch_wbs_work_fn()
Split out the functional part of the inode_switch_wbs_work_fn() function
as inode_do switch_wbs() to reuse it later for switching inodes attached
to dying cgwbs.

This commit doesn't bring any functional changes.

Link: https://lkml.kernel.org/r/20210608230225.2078447-7-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin f3b6a6df38 writeback, cgroup: keep list of inodes attached to bdi_writeback
Currently there is no way to iterate over inodes attached to a specific
cgwb structure.  It limits the ability to efficiently reclaim the
writeback structure itself and associated memory and block cgroup
structures without scanning all inodes belonging to a sb, which can be
prohibitively expensive.

While dirty/in-active-writeback an inode belongs to one of the
bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time.  Once
cleaned up, it's removed from all io lists.  So the inode->i_io_list can
be reused to maintain the list of inodes, attached to a bdi_writeback
structure.

This patch introduces a new wb->b_attached list, which contains all inodes
which were dirty at least once and are attached to the given cgwb.  Inodes
attached to the root bdi_writeback structures are never placed on such
list.  The following patch will use this list to try to release cgwbs
structures more efficiently.

Link: https://lkml.kernel.org/r/20210608230225.2078447-6-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin 29264d92a0 writeback, cgroup: switch to rcu_work API in inode_switch_wbs()
Inode's wb switching requires two steps divided by an RCU grace period.
It's currently implemented as an RCU callback inode_switch_wbs_rcu_fn(),
which schedules inode_switch_wbs_work_fn() as a work.

Switching to the rcu_work API allows to do the same in a cleaner and
slightly shorter form.

Link: https://lkml.kernel.org/r/20210608230225.2078447-5-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin 8826ee4fe7 writeback, cgroup: increment isw_nr_in_flight before grabbing an inode
isw_nr_in_flight is used to determine whether the inode switch queue
should be flushed from the umount path.  Currently it's increased after
grabbing an inode and even scheduling the switch work.  It means the
umount path can walk past cleanup_offline_cgwb() with active inode
references, which can result in a "Busy inodes after unmount." message and
use-after-free issues (with inode->i_sb which gets freed).

Fix it by incrementing isw_nr_in_flight before doing anything with the
inode and decrementing in the case when switching wasn't scheduled.

The problem hasn't yet been seen in the real life and was discovered by
Jan Kara by looking into the code.

Link: https://lkml.kernel.org/r/20210608230225.2078447-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Jan Kara <jack@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:47 -07:00