Pull vfs name lookup updates from Al Viro:
"Small namei.c patch series, mostly to simplify the rules for nameidata
state. It's actually from the previous cycle - but I didn't post it
for review in time...
Changes visible outside of fs/namei.c: file_open_root() calling
conventions change, some freed bits in LOOKUP_... space"
* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
namei: make sure nd->depth is always valid
teach set_nameidata() to handle setting the root as well
take LOOKUP_{ROOT,ROOT_GRABBED,JUMPED} out of LOOKUP_... space
switch file_open_root() to struct path
Pull iov_iter updates from Al Viro:
"iov_iter cleanups and fixes.
There are followups, but this is what had sat in -next this cycle. IMO
the macro forest in there became much thinner and easier to follow..."
* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
csum_and_copy_to_pipe_iter(): leave handling of csum_state to caller
clean up copy_mc_pipe_to_iter()
pipe_zero(): we don't need no stinkin' kmap_atomic()...
iov_iter: clean csum_and_copy_...() primitives up a bit
copy_page_from_iter(): don't need kmap_atomic() for kvec/bvec cases
copy_page_to_iter(): don't bother with kmap_atomic() for bvec/kvec cases
iterate_xarray(): only of the first iteration we might get offset != 0
pull handling of ->iov_offset into iterate_{iovec,bvec,xarray}
iov_iter: make iterator callbacks use base and len instead of iovec
iov_iter: make the amount already copied available to iterator callbacks
iov_iter: get rid of separate bvec and xarray callbacks
iov_iter: teach iterate_{bvec,xarray}() about possible short copies
iterate_bvec(): expand bvec.h macro forest, massage a bit
iov_iter: unify iterate_iovec and iterate_kvec
iov_iter: massage iterate_iovec and iterate_kvec to logics similar to iterate_bvec
iterate_and_advance(): get rid of magic in case when n is 0
csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter()
iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant
[xarray] iov_iter_npages(): just use DIV_ROUND_UP()
iov_iter_npages(): don't bother with iterate_all_kinds()
...
Pull vfs d_path() updates from Al Viro:
"d_path.c refactoring"
* 'work.d_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
getcwd(2): clean up error handling
d_path: prepend_path() is unlikely to return non-zero
d_path: prepend_path(): lift the inner loop into a new helper
d_path: prepend_path(): lift resetting b in case when we'd return 3 out of loop
d_path: prepend_path(): get rid of vfsmnt
d_path: introduce struct prepend_buffer
d_path: make prepend_name() boolean
d_path: lift -ENAMETOOLONG handling into callers of prepend_path()
d_path: don't bother with return value of prepend()
getcwd(2): saner logics around prepend_path() call
d_path: get rid of path_with_deleted()
d_path: regularize handling of root dentry in __dentry_path()
d_path: saner calling conventions for __dentry_path()
d_path: "\0" is {0,0}, not {0}
- Added option for per CPU threads to the hwlat tracer
- Have hwlat tracer handle hotplug CPUs
- New tracer: osnoise, that detects latency caused by interrupts, softirqs
and scheduling of other tasks.
- Added timerlat tracer that creates a thread and measures in detail what
sources of latency it has for wake ups.
- Removed the "success" field of the sched_wakeup trace event.
This has been hardcoded as "1" since 2015, no tooling should be looking
at it now. If one exists, we can revert this commit, fix that tool and
try to remove it again in the future.
- tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids.
- New boot command line option "tp_printk_stop", as tp_printk causes trace
events to write to console. When user space starts, this can easily live
lock the system. Having a boot option to stop just after boot up is
useful to prevent that from happening.
- Have ftrace_dump_on_oops boot command line option take numbers that match
the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops.
- Bootconfig clean ups, fixes and enhancements.
- New ktest script that tests bootconfig options.
- Add tracepoint_probe_register_may_exist() to register a tracepoint
without triggering a WARN*() if it already exists. BPF has a path from
user space that can do this. All other paths are considered a bug.
- Small clean ups and fixes
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYN8YPhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhxLAP9Mo5hHv7Hg6W7Ddv77rThm+qclsMR/
yW0P+eJpMm4+xAD8Cq03oE1DimPK+9WZBKU5rSqAkqG6CjgDRw6NlIszzQQ=
=WEPR
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Added option for per CPU threads to the hwlat tracer
- Have hwlat tracer handle hotplug CPUs
- New tracer: osnoise, that detects latency caused by interrupts,
softirqs and scheduling of other tasks.
- Added timerlat tracer that creates a thread and measures in detail
what sources of latency it has for wake ups.
- Removed the "success" field of the sched_wakeup trace event. This has
been hardcoded as "1" since 2015, no tooling should be looking at it
now. If one exists, we can revert this commit, fix that tool and try
to remove it again in the future.
- tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids.
- New boot command line option "tp_printk_stop", as tp_printk causes
trace events to write to console. When user space starts, this can
easily live lock the system. Having a boot option to stop just after
boot up is useful to prevent that from happening.
- Have ftrace_dump_on_oops boot command line option take numbers that
match the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops.
- Bootconfig clean ups, fixes and enhancements.
- New ktest script that tests bootconfig options.
- Add tracepoint_probe_register_may_exist() to register a tracepoint
without triggering a WARN*() if it already exists. BPF has a path
from user space that can do this. All other paths are considered a
bug.
- Small clean ups and fixes
* tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (49 commits)
tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
tracing: Simplify & fix saved_tgids logic
treewide: Add missing semicolons to __assign_str uses
tracing: Change variable type as bool for clean-up
trace/timerlat: Fix indentation on timerlat_main()
trace/osnoise: Make 'noise' variable s64 in run_osnoise()
tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing
tracing: Fix spelling in osnoise tracer "interferences" -> "interference"
Documentation: Fix a typo on trace/osnoise-tracer
trace/osnoise: Fix return value on osnoise_init_hotplug_support
trace/osnoise: Make interval u64 on osnoise_main
trace/osnoise: Fix 'no previous prototype' warnings
tracing: Have osnoise_main() add a quiescent state for task rcu
seq_buf: Make trace_seq_putmem_hex() support data longer than 8
seq_buf: Fix overflow in seq_buf_putmem_hex()
trace/osnoise: Support hotplug operations
trace/hwlat: Support hotplug operations
trace/hwlat: Protect kdata->kthread with get/put_online_cpus
trace: Add timerlat tracer
trace: Add osnoise tracer
...
- Refactor the buffer cache to use bulk page allocation
- Convert agnumber-based AG iteration to walk per-AG structures
- Clean up some unit conversions and other code warts
- Reduce spinlock contention in the directio fastpath
- Collapse all the inode cache walks into a single function
- Remove indirect function calls from the inode cache walk code
- Dramatically reduce the number of cache flushes sent when writing log
buffers
- Preserve inode sickness reports for longer
- Rename xfs_eofblocks since it controls inode cache walks
- Refactor the extended attribute code to prepare it for the addition
of log intent items to make xattrs fully transactional
- A few fixes to earlier large patchsets
- Log recovery fixes so that we don't accidentally mark the log clean
when log intent recovery fails
- Fix some latent SOB errors
- Clean up shutdown messages that get logged to dmesg
- Fix a regression in the online shrink code
- Fix a UAF in the buffer logging code if the fs goes offline
- Fix uninitialized error variables
- Fix a UAF in the CIL when commited log item callbacks race with a
shutdown
- Fix a bug where the CIL could hang trying to push part of the log ring
buffer that hasn't been filled yet
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmDXP38ACgkQ+H93GTRK
tOsKzw//eHvEgeyBo7ek06GDsUph2kQVR9AJWE7MNMiBFxlmL8R9H225xJK7Qmcr
YswcyEeDq8cNXbXDA249ueuMb+DxhZPY68hPK5BJ3KsbvL2RZV0lJCbk492l4cgb
IvBJiG/MDo55km83tdr81AlmFYQM7rSQz5MbVogGxxsnp0ul3VpIrJZba8kPRDQ1
mZzH2fdlnE9Ozw/CfvjSgT1pySyFpxNeTRucYXUQil1hL1AGTBw7rGGNnccS090y
u/EawQ4WJ131m8O3+WomUmaGyZFlWvTpHzukKxvrEvZ6AG+HpIhMcbZ5J6nkRTY4
xxhUBG2qNKIcgPmPwAGmx1cylcsOCNKQgp+fko9tAZjEkgT5cbCpqpjGgjNB0RCf
pB0PY6idCFl9hmBpVgMWz2AZ9IsDmK54qufmLtzq/zN8cThzt6A95UUR0rGu5Kd8
CUmmdQTYl0GqlTTszCO2rw1+zRtcasMpBVmeYHDxy00bd1dHLUJ6o8DuXRYTTQti
J/6CZVVD56jieRb+uvrOq4mhiPR2kynciiu1dXdY5kx79kKom6HMBBvtTl8b9kmh
smWihfip7BTpz5vFzcwFmMxFwzW3K4LnDZl7qEGqXDEIHOL+pRWazU2yN3JZRGyd
z4SQMJuER0HTTA0yO09c3/CX9onorhjUIMgQ9U25l1hdyFna0+o=
=08Q9
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.14-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"Most of the work this cycle has been on refactoring various parts of
the codebase. The biggest non-cleanup changes are (1) reducing the
number of cache flushes sent when writing the log; (2) a substantial
number of log recovery fixes; and (3) I started accepting pull
requests from contributors if the commits in their branches match
what's been sent to the list.
For a week or so I /had/ staged a major cleanup of the logging code
from Dave Chinner, but it exposed so many lurking bugs in other parts
of the logging and log recovery code that I decided to defer that
patchset until we can address those latent bugs.
Larger cleanups this time include walking the incore inode cache (me)
and rework of the extended attribute code (Allison) to prepare it for
adding logged xattr updates (and directory tree parent pointers) in
future releases.
Summary:
- Refactor the buffer cache to use bulk page allocation
- Convert agnumber-based AG iteration to walk per-AG structures
- Clean up some unit conversions and other code warts
- Reduce spinlock contention in the directio fastpath
- Collapse all the inode cache walks into a single function
- Remove indirect function calls from the inode cache walk code
- Dramatically reduce the number of cache flushes sent when writing
log buffers
- Preserve inode sickness reports for longer
- Rename xfs_eofblocks since it controls inode cache walks
- Refactor the extended attribute code to prepare it for the addition
of log intent items to make xattrs fully transactional
- A few fixes to earlier large patchsets
- Log recovery fixes so that we don't accidentally mark the log clean
when log intent recovery fails
- Fix some latent SOB errors
- Clean up shutdown messages that get logged to dmesg
- Fix a regression in the online shrink code
- Fix a UAF in the buffer logging code if the fs goes offline
- Fix uninitialized error variables
- Fix a UAF in the CIL when commited log item callbacks race with a
shutdown
- Fix a bug where the CIL could hang trying to push part of the log
ring buffer that hasn't been filled yet"
* tag 'xfs-5.14-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (102 commits)
xfs: don't wait on future iclogs when pushing the CIL
xfs: Fix a CIL UAF by getting get rid of the iclog callback lock
xfs: remove callback dequeue loop from xlog_state_do_iclog_callbacks
xfs: don't nest icloglock inside ic_callback_lock
xfs: Initialize error in xfs_attr_remove_iter
xfs: fix endianness issue in xfs_ag_shrink_space
xfs: remove dead stale buf unpin handling code
xfs: hold buffer across unpin and potential shutdown processing
xfs: force the log offline when log intent item recovery fails
xfs: fix log intent recovery ENOSPC shutdowns when inactivating inodes
xfs: shorten the shutdown messages to a single line
xfs: print name of function causing fs shutdown instead of hex pointer
xfs: fix type mismatches in the inode reclaim functions
xfs: separate primary inode selection criteria in xfs_iget_cache_hit
xfs: refactor the inode recycling code
xfs: add iclog state trace events
xfs: xfs_log_force_lsn isn't passed a LSN
xfs: Fix CIL throttle hang when CIL space used going backwards
xfs: journal IO cache flush reductions
xfs: remove need_start_rec parameter from xlog_write()
...
- fix a memleak in configfs_release_bin_file (Chung-Chiang Cheng)
- implement the .read_iter and .write_iter (Bart Van Assche)
- minor cleanups (Bart Van Assche, me)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmDfOe8LHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYNbuA//dTCwt9U8+XIDeWatRJY2BFSTorNhaeayYiGDDnVo
7NVN/vw5qtt7UHo7hFBi918gh9IGSIbgyYZIZAV4/qCqnveNMnD/Zuwd8j0FHzu4
OvmjORj52cpzXiyJ87vJ+fGVjqe9/Vtrzr4IFyhJZQnz+c1L0NOgReAiLJtE+1C/
8RvecwCvIxZVjw8SRjOZNCd/DFBx6nPbwsh0ZLkAiL6TQkV5erOUmFyoOAiXRQwB
H+6fcrjGteoXXciGmvRtfIaJEO5o2yE4pXEsoYv2aDBF5dtpq/wti8CV7bWSIXiV
HI90UrVDO+bK94DKC0JmYFEAxi7HmX073OWu0xBgXjpUnc1kepJ+9DJ74Y43QwwF
C0eHKu4b/nP/50c/7UxqUSmS23c9YiJawkYZjxbnxklN9QhRs8v61htOO6i9CFnu
KjePcLs07sn6AEa16taCV/+iwfgyul047jegrZZSTNioBPgpdMX+j2fzt2FTWioJ
zfVLsiBqvKDKXEkulsTMMFmO4Un2kcX7cTX7v9DclGvpCuEoOpPUDERS5H8+DkqV
q+O2HPohFdFItcWge58eCvZ0OJ43Px07Mpn6WH48VLCDLTg7H7Z/85yEO6Wg+a6V
JlgwdoAlHo5BHfS2w7PtkblbaJH9MQRc6dvSuv/sE1x8AouWze7wkfVb0ZH0X1GF
Xjo=
=dquw
-----END PGP SIGNATURE-----
Merge tag 'configfs-5.13' of git://git.infradead.org/users/hch/configfs
Pull configfs updates from Christoph Hellwig:
- fix a memleak in configfs_release_bin_file (Chung-Chiang Cheng)
- implement the .read_iter and .write_iter (Bart Van Assche)
- minor cleanups (Bart Van Assche, me)
* tag 'configfs-5.13' of git://git.infradead.org/users/hch/configfs:
configfs: simplify configfs_release_bin_file
configfs: fix memleak in configfs_release_bin_file
configfs: implement the .read_iter and .write_iter methods
configfs: drop pointless kerneldoc comments
configfs: fix the kerneldoc comment for configfs_create_bin_file
Merge more updates from Andrew Morton:
"190 patches.
Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmDcl7AACgkQnJ2qBz9k
QNnsBQf+LBAPsfykQ/f8EdHErO1lfbVTmwf2g/JzTkjrIVZTZ6Ic47aCIiFxgHU2
Js9ufaPxpsbbopzpn2PAoCUzxNsZDqgXtnC03MOUAqoSFbAvgLHz2sQwjqeYJUGQ
P6n7VipEA/qBVpQI5zeCUhHYcahoNrRjSLzaFnE2Z8CrQYQ6Ry9gVEhduvu2OTru
62cWlAWlTJfx/FcR1Y0F/ZznnNSKMiAHcEe3F6Beztplg2ooq+z6FclJYrkmnxMq
SXSOsqTCdi1/oFx36NpvLkykrIS9I7N/iqCnKwbm6X+nyZZKyAwYZhWVqkbozPPu
+u1Ppq8o0IuWwEA6/UAmxgAO3m/Gkw==
=tn0h
-----END PGP SIGNATURE-----
Merge tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull misc fs updates from Jan Kara:
"The new quotactl_fd() syscall (remake of quotactl_path() syscall that
got introduced & disabled in 5.13 cycle), and couple of udf, reiserfs,
isofs, and writeback fixes and cleanups"
* tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
writeback: fix obtain a reference to a freeing memcg css
quota: remove unnecessary oom message
isofs: remove redundant continue statement
quota: Wire up quotactl_fd syscall
quota: Change quotactl_path() systcall to an fd-based one
reiserfs: Remove unneed check in reiserfs_write_full_page()
udf: Fix NULL pointer dereference in udf_symlink function
reiserfs: add check for invalid 1st journal block
Delete NULL check, all callers pass valid pointer.
Delete ->load_binary check -- failure to provide hook in a custom module
will be very noticeable at the very first execve call.
Link: https://lkml.kernel.org/r/YK1Gy1qXaLAR+tPl@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The create_date field of inode in hfsplus is corresponding to
kstat.btime and could be reported in statx.
Link: https://lkml.kernel.org/r/20210416172147.8736-1-cccheng@synology.com
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes scripts/checkpatch.pl warning:
WARNING: Possible unnecessary 'out of memory' message
Remove it can help us save a bit of memory.
Link: https://lkml.kernel.org/r/20210617084944.1279-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are no more users of the seq_escape_mem_ascii() followed by
string_escape_mem_ascii().
Remove them for good.
Link: https://lkml.kernel.org/r/20210504180819.73127-16-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The seq_escape_mem_ascii() is completely non-flexible and shouldn't be
used. Replace it with properly called seq_escape_mem().
Link: https://lkml.kernel.org/r/20210504180819.73127-15-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert seq_escape() to use seq_escape_str() rather than open coding it.
Note, for now we leave it as an exported symbol due to some old code that
can't tolerate ctype.h being (indirectly) included.
Link: https://lkml.kernel.org/r/20210504180819.73127-14-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce seq_escape_mem() to allow users to pass additional parameters to
string_escape_mem().
Link: https://lkml.kernel.org/r/20210504180819.73127-12-andriy.shevchenko@linux.intel.com
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
And 'ino' field to /proc/<pid>/fdinfo/<FD> and
/proc/<pid>/task/<tid>/fdinfo/<FD>.
The inode numbers can be used to uniquely identify DMA buffers in user
space and avoids a dependency on /proc/<pid>/fd/* when accounting
per-process DMA buffer sizes.
Link: https://lkml.kernel.org/r/20210308170651.919148-2-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Christian König <christian.koenig@amd.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Helge Deller <deller@gmx.de>
Cc: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Android captures per-process system memory state when certain low memory
events (e.g a foreground app kill) occur, to identify potential memory
hoggers. In order to measure how much memory a process actually consumes,
it is necessary to include the DMA buffer sizes for that process in the
memory accounting. Since the handle to DMA buffers are raw FDs, it is
important to be able to identify which processes have FD references to a
DMA buffer.
Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
/proc/<pid>/fdinfo -- both are only readable by the process owner, as
follows:
1. Do a readlink on each FD.
2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
3. stat the file to get the dmabuf inode number.
4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
Accessing other processes' fdinfo requires root privileges. This limits
the use of the interface to debugging environments and is not suitable for
production builds. Granting root privileges even to a system process
increases the attack surface and is highly undesirable.
Since fdinfo doesn't permit reading process memory and manipulating
process state, allow accessing fdinfo under PTRACE_MODE_READ_FSCRED.
Link: https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Jann Horn <jannh@google.com>
Acked-by: Christian König <christian.koenig@amd.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use size_t when capping the count argument received by mem_rw(). Since
count is size_t, using min_t(int, ...) can lead to a negative value
that will later be passed to access_remote_vm(), which can cause
unexpected behavior.
Since we are capping the value to at maximum PAGE_SIZE, the conversion
from size_t to int when passing it to access_remote_vm() as "len"
shouldn't be a problem.
Link: https://lkml.kernel.org/r/20210512125215.3348316-1-marcelo.cerri@canonical.com
Reviewed-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Souza Cascardo <cascardo@canonical.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Add support for SVM atomics in Nouveau", v11.
Introduction
============
Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory. To support atomic operations to
a shared virtual memory page such a device needs access to that page which
is exclusive of the CPU. This series introduces a mechanism to
temporarily unmap pages granting exclusive access to a device.
These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .
Implementation
==============
Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.
Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the
CPU finalising the entry.
Patches
=======
Patches 1 & 2 refactor existing migration and device private entry
functions.
Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().
Patch 5 renames some existing code but does not introduce functionality.
Patch 6 is a small clean-up to swap entry handling in copy_pte_range().
Patch 7 contains the bulk of the implementation for device exclusive
memory.
Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.
Patch 9 is a cleanup for the Nouveau SVM implementation.
Patch 10 contains the implementation of atomic access for the Nouveau
driver.
Testing
=======
This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic.
Without this series the test fails as there is no way of write-protecting
the page mapping which results in the device clobbering CPU writes. For
reference the test is available at
https://ozlabs.org/~apopple/opencl_svm_atomics/
Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.
This patch (of 10):
Remove multiple similar inline functions for dealing with different types
of special swap entries.
Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct page
for each swap entry type use a common function pfn_swap_entry_to_page().
Also open-code the various entry_to_pfn() functions as this results is
shorter code that is easier to understand.
Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Transparent huge pages are supported for read-only non-shmem files, but
are only used for vmas with VM_DENYWRITE. This condition ensures that
file THPs are protected from writes while an application is running
(ETXTBSY). Any existing file THPs are then dropped from the page cache
when a file is opened for write in do_dentry_open(). Since sys_mmap
ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
produced by execve().
Systems that make heavy use of shared libraries (e.g. Android) are unable
to apply VM_DENYWRITE through the dynamic linker, preventing them from
benefiting from the resultant reduced contention on the TLB.
This patch reduces the constraint on file THPs allowing use with any
executable mapping from a file not opened for write (see
inode_is_open_for_write()). It also introduces additional conditions to
ensure that files opened for write will never be backed by file THPs.
Restricting the use of THPs to executable mappings eliminates the risk
that a read-only file later opened for write would encounter significant
latencies due to page cache truncation.
The ld linker flag '-z max-page-size=(hugepage size)' can be used to
produce executables with the necessary layout. The dynamic linker must
map these file's segments at a hugepage size aligned vma for the mapping
to be backed with THPs.
Comparison of the performance characteristics of 4KB and 2MB-backed
libraries follows; the Android dex2oat tool was used to AOT compile an
example application on a single ARM core.
4KB Pages:
==========
count event_name # count / runtime
598,995,035,942 cpu-cycles # 1.800861 GHz
81,195,620,851 raw-stall-frontend # 244.112 M/sec
347,754,466,597 iTLB-loads # 1.046 G/sec
2,970,248,900 iTLB-load-misses # 0.854122% miss rate
Total test time: 332.854998 seconds.
2MB Pages:
==========
count event_name # count / runtime
592,872,663,047 cpu-cycles # 1.800358 GHz
76,485,624,143 raw-stall-frontend # 232.261 M/sec
350,478,413,710 iTLB-loads # 1.064 G/sec
803,233,322 iTLB-load-misses # 0.229182% miss rate
Total test time: 329.826087 seconds
A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use:
/apex/com.android.art/lib64/libart.so
FilePmdMapped: 4096 kB
/apex/com.android.art/lib64/libart-compiler.so
FilePmdMapped: 2048 kB
Link: https://lkml.kernel.org/r/20210406000930.3455850-1-cfijalkovich@google.com
Signed-off-by: Collin Fijalkovich <cfijalkovich@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Song Liu <song@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's properly synchronize with drivers that set PageOffline().
Unfreeze/thaw every now and then, so drivers that want to set
PageOffline() can make progress.
Link: https://lkml.kernel.org/r/20210526093041.8800-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's avoid reading:
1) Offline memory sections: the content of offline memory sections is
stale as the memory is effectively unused by the kernel. On s390x with
standby memory, offline memory sections (belonging to offline storage
increments) are not accessible. With virtio-mem and the hyper-v
balloon, we can have unavailable memory chunks that should not be
accessed inside offline memory sections. Last but not least, offline
memory sections might contain hwpoisoned pages which we can no longer
identify because the memmap is stale.
2) PG_offline pages: logically offline pages that are documented as
"The content of these pages is effectively stale. Such pages should
not be touched (read/write/dump/save) except by their owner.".
Examples include pages inflated in a balloon or unavailble memory
ranges inside hotplugged memory sections with virtio-mem or the hyper-v
balloon.
3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
As documented: "Accessing is not safe since it may cause another
machine check. Don't touch!"
Introduce is_page_hwpoison(), adding a comment that it is inherently racy
but best we can really do.
Reading /proc/kcore now performs similar checks as when reading
/proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
It's also similar to hibernation code, however, we don't skip hwpoisoned
pages when processing pages in kernel/power/snapshot.c:saveable_page()
yet.
Note 1: we can race against memory offlining code, especially memory going
offline and getting unplugged: however, we will properly tear down the
identity mapping and handle faults gracefully when accessing this memory
from kcore code.
Note 2: we can race against drivers setting PageOffline() and turning
memory inaccessible in the hypervisor. We'll handle this in a follow-up
patch.
Link: https://lkml.kernel.org/r/20210526093041.8800-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's resturcture the code, using switch-case, and checking pfn_is_ram()
only when we are dealing with KCORE_RAM.
Link: https://lkml.kernel.org/r/20210526093041.8800-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Roman Gushchin <guro@fb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages", v3.
Looking for places where the kernel might unconditionally read
PageOffline() pages, I stumbled over /proc/kcore; turns out /proc/kcore
needs some more love to not touch some other pages we really don't want to
read -- i.e., hwpoisoned ones.
Examples for PageOffline() pages are pages inflated in a balloon, memory
unplugged via virtio-mem, and partially-present sections in memory added
by the Hyper-V balloon.
When reading pages inflated in a balloon, we essentially produce
unnecessary load in the hypervisor; holes in partially present sections in
case of Hyper-V are not accessible and already were a problem for
/proc/vmcore, fixed in makedumpfile by detecting PageOffline() pages. In
the future, virtio-mem might disallow reading unplugged memory -- marked
as PageOffline() -- in some environments, resulting in undefined behavior
when accessed; therefore, I'm trying to identify and rework all these
(corner) cases.
With this series, there is really only access via /dev/mem, /proc/vmcore
and kdb left after I ripped out /dev/kmem. kdb is an advanced corner-case
use case -- we won't care for now if someone explicitly tries to do nasty
things by reading from/writing to physical addresses we better not touch.
/dev/mem is a use case we won't support for virtio-mem, at least for now,
so we'll simply disallow mapping any virtio-mem memory via /dev/mem next.
/proc/vmcore is really only a problem when dumping the old kernel via
something that's not makedumpfile (read: basically never), however, we'll
try sanitizing that as well in the second kernel in the future.
Tested via kcore_dump:
https://github.com/schlafwandler/kcore_dump
This patch (of 6):
Commit db779ef67f ("proc/kcore: Remove unused kclist_add_remap()")
removed the last user of KCORE_REMAP.
Commit 595dd46ebf ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when
dumping vsyscall user page") removed the last user of KCORE_OTHER.
Let's drop both types. While at it, also drop vaddr in "struct
kcore_list", used by KCORE_REMAP only.
Link: https://lkml.kernel.org/r/20210526093041.8800-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210526093041.8800-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the feature is fully implemented (the faulting path hooks exist
so userspace is notified, and the ioctl to resolve such faults is
available), advertise this as a supported feature.
Link: https://lkml.kernel.org/r/20210503180737.2487560-6-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch allows shmem-backed VMAs to be registered for minor faults.
Minor faults are appropriately relayed to userspace in the fault path, for
VMAs with the relevant flag.
This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed
minor faults, though, so userspace doesn't yet have a way to resolve such
faults.
Because of this, we also don't yet advertise this as a supported feature.
That will be done in a separate commit when the feature is fully
implemented.
Link: https://lkml.kernel.org/r/20210503180737.2487560-4-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Export the PTE/PMD status of uffd-wp to pagemap too.
Link: https://lkml.kernel.org/r/20210428225030.9708-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should fail uffd-wp registration immediately if the arch does not even
have CONFIG_HAVE_ARCH_USERFAULTFD_WP defined. That'll block also relevant
ioctls on e.g. UFFDIO_WRITEPROTECT because that'll check against
VM_UFFD_WP, which can only be applied with a success registration.
Remove the WP feature bit too for those archs when handling UFFDIO_API
ioctl.
Link: https://lkml.kernel.org/r/20210428225030.9708-5-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When using HUGETLB_PAGE_FREE_VMEMMAP, the freeing unused vmemmap pages
associated with each HugeTLB page is default off. Now the vmemmap is PMD
mapped. So there is no side effect when this feature is enabled with no
HugeTLB pages in the system. Someone may want to enable this feature in
the compiler time instead of using boot command line. So add a config to
make it default on when someone do not want to enable it via command line.
Link: https://lkml.kernel.org/r/20210616094915.34432-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 99cb0dbd47 ("mm,thp: add read-only THP support for
(non-shmem) FS"), read-only THP file mapping is supported. But it forgot
to add checking for it in transparent_hugepage_enabled(). To fix it, we
add checking for read-only THP file mapping and also introduce helper
transhuge_vma_enabled() to check whether thp is enabled for specified vma
to reduce duplicated code. We rename transparent_hugepage_enabled to
transparent_hugepage_active to make the code easier to follow as suggested
by David Hildenbrand.
[linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable]
Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com
Fixes: 99cb0dbd47 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The option HUGETLB_PAGE_FREE_VMEMMAP allows for the freeing of some
vmemmap pages associated with pre-allocated HugeTLB pages. For example,
on X86_64 6 vmemmap pages of size 4KB each can be saved for each 2MB
HugeTLB page. 4094 vmemmap pages of size 4KB each can be saved for each
1GB HugeTLB page.
When a HugeTLB page is allocated or freed, the vmemmap array representing
the range associated with the page will need to be remapped. When a page
is allocated, vmemmap pages are freed after remapping. When a page is
freed, previously discarded vmemmap pages must be allocated before
remapping.
The config option is introduced early so that supporting code can be
written to depend on the option. The initial version of the code only
provides support for x86-64.
If config HAVE_BOOTMEM_INFO_NODE is enabled, the freeing vmemmap page code
denpend on it to free vmemmap pages. Otherwise, just use
free_reserved_page() to free vmemmmap pages. The routine
register_page_bootmem_info() is used to register bootmem info. Therefore,
make sure register_page_bootmem_info is enabled if
HUGETLB_PAGE_FREE_VMEMMAP is defined.
Link: https://lkml.kernel.org/r/20210510030027.56044-3-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Tested-by: Chen Huang <chenhuang5@huawei.com>
Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ext4 in 5.14:
- Allow applications to poll on changes to /sys/fs/ext4/*/errors_count
- Add the ioctl EXT4_IOC_CHECKPOINT which allows the journal to be
checkpointed, truncated and discarded or zero'ed.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmDcjRgACgkQ8vlZVpUN
gaMAMQgAjRYUQ+tdJVZzInFwukudhgLyuCP9AdCx76fisaH22yNCakQ7M2XGz59i
/YbJerLaueYpHZzpA9p5+sSjVhMwILO3scBSJbOwdsbrFAsFLzcgQKQhGGqK2KvX
IAOEArC8/hm1wnVb7sfQYdBHlWyeJpI8hd/8WZPlYtySlRnP1TZCd+X7y7lmNs1H
QU1KECwstI2t8Lug0QeKx2B9PI9AWcCs0lTJ4LfcANZAh3HIJi9aUCk4SFDRkf3/
8AazvMqTHJD9yc+BNyZOro2ykDFCStkNqf0cDYTzvKrr66CHScPUtyI0oAEdspxN
+SNNARPGZgNOuR3ZRbGivtwgEB+GpQ==
=jSd4
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"In addition to bug fixes and cleanups, there are two new features for
ext4 in 5.14:
- Allow applications to poll on changes to
/sys/fs/ext4/*/errors_count
- Add the ioctl EXT4_IOC_CHECKPOINT which allows the journal to be
checkpointed, truncated and discarded or zero'ed"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits)
jbd2: export jbd2_journal_[un]register_shrinker()
ext4: notify sysfs on errors_count value change
fs: remove bdev_try_to_free_page callback
ext4: remove bdev_try_to_free_page() callback
jbd2: simplify journal_clean_one_cp_list()
jbd2,ext4: add a shrinker to release checkpointed buffers
jbd2: remove redundant buffer io error checks
jbd2: don't abort the journal when freeing buffers
jbd2: ensure abort the journal if detect IO error when writing original buffer back
jbd2: remove the out label in __jbd2_journal_remove_checkpoint()
ext4: no need to verify new add extent block
jbd2: clean up misleading comments for jbd2_fc_release_bufs
ext4: add check to prevent attempting to resize an fs with sparse_super2
ext4: consolidate checks for resize of bigalloc into ext4_resize_begin
ext4: remove duplicate definition of ext4_xattr_ibody_inline_set()
ext4: fsmap: fix the block/inode bitmap comment
ext4: fix comment for s_hash_unsigned
ext4: use local variable ei instead of EXT4_I() macro
ext4: fix avefreec in find_group_orlov
ext4: correct the cache_nr in tracepoint ext4_es_shrink_exit
...
Core:
- BPF:
- add syscall program type and libbpf support for generating
instructions and bindings for in-kernel BPF loaders (BPF loaders
for BPF), this is a stepping stone for signed BPF programs
- infrastructure to migrate TCP child sockets from one listener
to another in the same reuseport group/map to improve flexibility
of service hand-off/restart
- add broadcast support to XDP redirect
- allow bypass of the lockless qdisc to improving performance
(for pktgen: +23% with one thread, +44% with 2 threads)
- add a simpler version of "DO_ONCE()" which does not require
jump labels, intended for slow-path usage
- virtio/vsock: introduce SOCK_SEQPACKET support
- add getsocketopt to retrieve netns cookie
- ip: treat lowest address of a IPv4 subnet as ordinary unicast address
allowing reclaiming of precious IPv4 addresses
- ipv6: use prandom_u32() for ID generation
- ip: add support for more flexible field selection for hashing
across multi-path routes (w/ offload to mlxsw)
- icmp: add support for extended RFC 8335 PROBE (ping)
- seg6: add support for SRv6 End.DT46 behavior
- mptcp:
- DSS checksum support (RFC 8684) to detect middlebox meddling
- support Connection-time 'C' flag
- time stamping support
- sctp: packetization Layer Path MTU Discovery (RFC 8899)
- xfrm: speed up state addition with seq set
- WiFi:
- hidden AP discovery on 6 GHz and other HE 6 GHz improvements
- aggregation handling improvements for some drivers
- minstrel improvements for no-ack frames
- deferred rate control for TXQs to improve reaction times
- switch from round robin to virtual time-based airtime scheduler
- add trace points:
- tcp checksum errors
- openvswitch - action execution, upcalls
- socket errors via sk_error_report
Device APIs:
- devlink: add rate API for hierarchical control of max egress rate
of virtual devices (VFs, SFs etc.)
- don't require RCU read lock to be held around BPF hooks
in NAPI context
- page_pool: generic buffer recycling
New hardware/drivers:
- mobile:
- iosm: PCIe Driver for Intel M.2 Modem
- support for Qualcomm MSM8998 (ipa)
- WiFi: Qualcomm QCN9074 and WCN6855 PCI devices
- sparx5: Microchip SparX-5 family of Enterprise Ethernet switches
- Mellanox BlueField Gigabit Ethernet (control NIC of the DPU)
- NXP SJA1110 Automotive Ethernet 10-port switch
- Qualcomm QCA8327 switch support (qca8k)
- Mikrotik 10/25G NIC (atl1c)
Driver changes:
- ACPI support for some MDIO, MAC and PHY devices from Marvell and NXP
(our first foray into MAC/PHY description via ACPI)
- HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx
- Mellanox/Nvidia NIC (mlx5)
- NIC VF offload of L2 bridging
- support IRQ distribution to Sub-functions
- Marvell (prestera):
- add flower and match all
- devlink trap
- link aggregation
- Netronome (nfp): connection tracking offload
- Intel 1GE (igc): add AF_XDP support
- Marvell DPU (octeontx2): ingress ratelimit offload
- Google vNIC (gve): new ring/descriptor format support
- Qualcomm mobile (rmnet & ipa): inline checksum offload support
- MediaTek WiFi (mt76)
- mt7915 MSI support
- mt7915 Tx status reporting
- mt7915 thermal sensors support
- mt7921 decapsulation offload
- mt7921 enable runtime pm and deep sleep
- Realtek WiFi (rtw88)
- beacon filter support
- Tx antenna path diversity support
- firmware crash information via devcoredump
- Qualcomm 60GHz WiFi (wcn36xx)
- Wake-on-WLAN support with magic packets and GTK rekeying
- Micrel PHY (ksz886x/ksz8081): add cable test support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDb+fUACgkQMUZtbf5S
Irs2Jg//aqN0Q8CgIvYCVhPxQw1tY7pTAbgyqgBZ01vwjyvtIOgJiWzSfFEU84mX
M8fcpFX5eTKrOyJ9S6UFfQ/JG114n3hjAxFFT4Hxk2gC1Tg0vHuFQTDHcUl28bUE
mTm61e1YpdorILnv2k5JVQ/wu0vs5QKDrjcYcrcPnh+j93wvnPOgAfDBV95nZzjS
OTt4q2fR8GzLcSYWWsclMbDNkzyTG50RW/0Yd6aGjr5QGvXfrMeXfUJNz533PMf/
w5lNyjRKv+x9mdTZJzU0+msNUrZgUdRz7W8Ey8lD3hJZRE+D6/uU7FtsE8Mi3+uc
HWxeZUyzA3YF1MfVl/eesbxyPT7S/OkLzk4O5B35FbqP0YltaP+bOjq1/nM3ce1/
io9Dx9pIl/2JANUgRCAtLi8Z2dkvRoqTaBxZ/nPudCCljFwDwl6joTMJ7Ow22i5Y
5aIkcXFmZq4LbJDiHvbTlqT7yiuaEvu2UK/23bSIg/K3nF4eAmkY9Y1EgiMf60OF
78Ttw0wk2tUegwaS5MZnCniKBKDyl9gM2F6rbZ/IxQRR2LTXFc1B6gC+ynUxgXfh
Ub8O++6qGYGYZ0XvQH4pzco79p3qQWBTK5beIp2eu6BOAjBVIXq4AibUfoQLACsu
hX7jMPYd0kc3WFgUnKgQP8EnjFSwbf4XiaE7fIXvWBY8hzCw2h4=
=LvtX
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- BPF:
- add syscall program type and libbpf support for generating
instructions and bindings for in-kernel BPF loaders (BPF loaders
for BPF), this is a stepping stone for signed BPF programs
- infrastructure to migrate TCP child sockets from one listener to
another in the same reuseport group/map to improve flexibility
of service hand-off/restart
- add broadcast support to XDP redirect
- allow bypass of the lockless qdisc to improving performance (for
pktgen: +23% with one thread, +44% with 2 threads)
- add a simpler version of "DO_ONCE()" which does not require jump
labels, intended for slow-path usage
- virtio/vsock: introduce SOCK_SEQPACKET support
- add getsocketopt to retrieve netns cookie
- ip: treat lowest address of a IPv4 subnet as ordinary unicast
address allowing reclaiming of precious IPv4 addresses
- ipv6: use prandom_u32() for ID generation
- ip: add support for more flexible field selection for hashing
across multi-path routes (w/ offload to mlxsw)
- icmp: add support for extended RFC 8335 PROBE (ping)
- seg6: add support for SRv6 End.DT46 behavior
- mptcp:
- DSS checksum support (RFC 8684) to detect middlebox meddling
- support Connection-time 'C' flag
- time stamping support
- sctp: packetization Layer Path MTU Discovery (RFC 8899)
- xfrm: speed up state addition with seq set
- WiFi:
- hidden AP discovery on 6 GHz and other HE 6 GHz improvements
- aggregation handling improvements for some drivers
- minstrel improvements for no-ack frames
- deferred rate control for TXQs to improve reaction times
- switch from round robin to virtual time-based airtime scheduler
- add trace points:
- tcp checksum errors
- openvswitch - action execution, upcalls
- socket errors via sk_error_report
Device APIs:
- devlink: add rate API for hierarchical control of max egress rate
of virtual devices (VFs, SFs etc.)
- don't require RCU read lock to be held around BPF hooks in NAPI
context
- page_pool: generic buffer recycling
New hardware/drivers:
- mobile:
- iosm: PCIe Driver for Intel M.2 Modem
- support for Qualcomm MSM8998 (ipa)
- WiFi: Qualcomm QCN9074 and WCN6855 PCI devices
- sparx5: Microchip SparX-5 family of Enterprise Ethernet switches
- Mellanox BlueField Gigabit Ethernet (control NIC of the DPU)
- NXP SJA1110 Automotive Ethernet 10-port switch
- Qualcomm QCA8327 switch support (qca8k)
- Mikrotik 10/25G NIC (atl1c)
Driver changes:
- ACPI support for some MDIO, MAC and PHY devices from Marvell and
NXP (our first foray into MAC/PHY description via ACPI)
- HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx
- Mellanox/Nvidia NIC (mlx5)
- NIC VF offload of L2 bridging
- support IRQ distribution to Sub-functions
- Marvell (prestera):
- add flower and match all
- devlink trap
- link aggregation
- Netronome (nfp): connection tracking offload
- Intel 1GE (igc): add AF_XDP support
- Marvell DPU (octeontx2): ingress ratelimit offload
- Google vNIC (gve): new ring/descriptor format support
- Qualcomm mobile (rmnet & ipa): inline checksum offload support
- MediaTek WiFi (mt76)
- mt7915 MSI support
- mt7915 Tx status reporting
- mt7915 thermal sensors support
- mt7921 decapsulation offload
- mt7921 enable runtime pm and deep sleep
- Realtek WiFi (rtw88)
- beacon filter support
- Tx antenna path diversity support
- firmware crash information via devcoredump
- Qualcomm WiFi (wcn36xx)
- Wake-on-WLAN support with magic packets and GTK rekeying
- Micrel PHY (ksz886x/ksz8081): add cable test support"
* tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2168 commits)
tcp: change ICSK_CA_PRIV_SIZE definition
tcp_yeah: check struct yeah size at compile time
gve: DQO: Fix off by one in gve_rx_dqo()
stmmac: intel: set PCI_D3hot in suspend
stmmac: intel: Enable PHY WOL option in EHL
net: stmmac: option to enable PHY WOL with PMT enabled
net: say "local" instead of "static" addresses in ndo_dflt_fdb_{add,del}
net: use netdev_info in ndo_dflt_fdb_{add,del}
ptp: Set lookup cookie when creating a PTP PPS source.
net: sock: add trace for socket errors
net: sock: introduce sk_error_report
net: dsa: replay the local bridge FDB entries pointing to the bridge dev too
net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev
net: dsa: include fdb entries pointing to bridge in the host fdb list
net: dsa: include bridge addresses which are local in the host fdb list
net: dsa: sync static FDB entries on foreign interfaces to hardware
net: dsa: install the host MDB and FDB entries in the master's RX filter
net: dsa: reference count the FDB addresses at the cross-chip notifier level
net: dsa: introduce a separate cross-chip notifier type for host FDBs
net: dsa: reference count the MDB entries at the cross-chip notifier level
...
We currently spin in iopoll() when requests to be iopolled are for
same file(device), while one device may have multiple hardware queues.
given an example:
hw_queue_0 | hw_queue_1
req(30us) req(10us)
If we first spin on iopolling for the hw_queue_0. the avg latency would
be (30us + 30us) / 2 = 30us. While if we do round robin, the avg
latency would be (30us + 10us) / 2 = 20us since we reap the request in
hw_queue_1 in time. So it's better to do spinning only when requests
are in same hardware queue.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Most of requests are allocated from an internal cache, so it's waste of
time fully initialising them every time. Instead, let's pre-init some of
the fields we can during initial allocation (e.g. kmalloc(), see
io_alloc_req()) and keep them valid on request recycling. There are four
of them in this patch:
->ctx is always stays the same
->link is NULL on free, it's an invariant
->result is not even needed to init, just a precaution
->async_data we now clean in io_dismantle_req() as it's likely to
never be allocated.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/892ba0e71309bba9fe9e0142472330bbf9d8f05d.1624739600.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since cancellation got moved before exit_signals(), there is no one left
who can call io_run_task_work() with PF_EXIING set, so remove the check.
Note that __io_req_task_submit() still needs a similar check.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f7f305ececb1e6044ea649fb983ca754805bb884.1624739600.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
gcc 11 goes a weird path and duplicates most of io_arm_poll_handler()
for READ and WRITE cases. Help it and move all pollin vs pollout
specific bits under a single if-else, so there is no temptation for this
kind of unfolding.
before vs after:
text data bss dec hex filename
85362 12650 8 98020 17ee4 ./fs/io_uring.o
85186 12650 8 97844 17e34 ./fs/io_uring.o
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1deea0037293a922a0358e2958384b2e42437885.1624739600.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is quite frequent that when an operation fails and returns EAGAIN,
the data becomes available between that failure and the call to
vfs_poll() done by io_arm_poll_handler().
Detecting the situation and reissuing the operation is much faster
than going ahead and push the operation to the io-wq.
Performance improvement testing has been performed with:
Single thread, 1 TCP connection receiving a 5 Mbps stream, no sqpoll.
4 measurements have been taken:
1. The time it takes to process a read request when data is already available
2. The time it takes to process by calling twice io_issue_sqe() after vfs_poll() indicated that data was available
3. The time it takes to execute io_queue_async_work()
4. The time it takes to complete a read request asynchronously
2.25% of all the read operations did use the new path.
ready data (baseline)
avg 3657.94182918628
min 580
max 20098
stddev 1213.15975908162
reissue completion
average 7882.67567567568
min 2316
max 28811
stddev 1982.79172973284
insert io-wq time
average 8983.82276995305
min 3324
max 87816
stddev 2551.60056552038
async time completion
average 24670.4758861127
min 10758
max 102612
stddev 3483.92416873804
Conclusion:
On average reissuing the sqe with the patch code is 1.1uSec faster and
in the worse case scenario 59uSec faster than placing the request on
io-wq
On average completion time by reissuing the sqe with the patch code is
16.79uSec faster and in the worse case scenario 73.8uSec faster than
async completion.
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9e8441419bb1b8f3c3fcc607b2713efecdef2136.1624364038.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can't support IOPOLL with non-pollable request types, and we should
check for unused/reserved fields like we do for other request types.
Fixes: 14a1143b68 ("io_uring: add support for IORING_OP_UNLINKAT")
Cc: stable@vger.kernel.org
Reported-by: Dmitry Kadashev <dkadashev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can't support IOPOLL with non-pollable request types, and we should
check for unused/reserved fields like we do for other request types.
Fixes: 80a261fd00 ("io_uring: add support for IORING_OP_RENAMEAT")
Cc: stable@vger.kernel.org
Reported-by: Dmitry Kadashev <dkadashev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>