Daniel Borkmann says:
====================
pull-request: bpf 2021-11-16
We've added 12 non-merge commits during the last 5 day(s) which contain
a total of 23 files changed, 573 insertions(+), 73 deletions(-).
The main changes are:
1) Fix pruning regression where verifier went overly conservative rejecting
previsouly accepted programs, from Alexei Starovoitov and Lorenz Bauer.
2) Fix verifier TOCTOU bug when using read-only map's values as constant
scalars during verification, from Daniel Borkmann.
3) Fix a crash due to a double free in XSK's buffer pool, from Magnus Karlsson.
4) Fix libbpf regression when cross-building runqslower, from Jean-Philippe Brucker.
5) Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_*() helpers in tracing
programs due to deadlock possibilities, from Dmitrii Banshchikov.
6) Fix checksum validation in sockmap's udp_read_sock() callback, from Cong Wang.
7) Various BPF sample fixes such as XDP stats in xdp_sample_user, from Alexander Lobakin.
8) Fix libbpf gen_loader error handling wrt fd cleanup, from Kumar Kartikeya Dwivedi.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
udp: Validate checksum in udp_read_sock()
bpf: Fix toctou on read-only map's constant scalar tracking
samples/bpf: Fix build error due to -isystem removal
selftests/bpf: Add tests for restricted helpers
bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs
libbpf: Perform map fd cleanup for gen_loader in case of error
samples/bpf: Fix incorrect use of strlen in xdp_redirect_cpu
tools/runqslower: Fix cross-build
samples/bpf: Fix summary per-sec stats in xdp_sample_user
selftests/bpf: Check map in map pruning
bpf: Fix inner map state pruning regression.
xsk: Fix crash on double free in buffer pool
====================
Link: https://lore.kernel.org/r/20211116141134.6490-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit a23740ec43 ("bpf: Track contents of read-only maps as scalars") is
checking whether maps are read-only both from BPF program side and user space
side, and then, given their content is constant, reading out their data via
map->ops->map_direct_value_addr() which is then subsequently used as known
scalar value for the register, that is, it is marked as __mark_reg_known()
with the read value at verification time. Before a23740ec43, the register
content was marked as an unknown scalar so the verifier could not make any
assumptions about the map content.
The current implementation however is prone to a TOCTOU race, meaning, the
value read as known scalar for the register is not guaranteed to be exactly
the same at a later point when the program is executed, and as such, the
prior made assumptions of the verifier with regards to the program will be
invalid which can cause issues such as OOB access, etc.
While the BPF_F_RDONLY_PROG map flag is always fixed and required to be
specified at map creation time, the map->frozen property is initially set to
false for the map given the map value needs to be populated, e.g. for global
data sections. Once complete, the loader "freezes" the map from user space
such that no subsequent updates/deletes are possible anymore. For the rest
of the lifetime of the map, this freeze one-time trigger cannot be undone
anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_*
cmd calls which would update/delete map entries will be rejected with -EPERM
since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also
means that pending update/delete map entries must still complete before this
guarantee is given. This corner case is not an issue for loaders since they
create and prepare such program private map in successive steps.
However, a malicious user is able to trigger this TOCTOU race in two different
ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is
used to expand the competition interval, so that map_update_elem() can modify
the contents of the map after map_freeze() and bpf_prog_load() were executed.
This works, because userfaultfd halts the parallel thread which triggered a
map_update_elem() at the time where we copy key/value from the user buffer and
this already passed the FMODE_CAN_WRITE capability test given at that time the
map was not "frozen". Then, the main thread performs the map_freeze() and
bpf_prog_load(), and once that had completed successfully, the other thread
is woken up to complete the pending map_update_elem() which then changes the
map content. For ii) the idea of the batched update is similar, meaning, when
there are a large number of updates to be processed, it can increase the
competition interval between the two. It is therefore possible in practice to
modify the contents of the map after executing map_freeze() and bpf_prog_load().
One way to fix both i) and ii) at the same time is to expand the use of the
map's map->writecnt. The latter was introduced in fc9702273e ("bpf: Add mmap()
support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19be2 ("bpf:
Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with
the rationale to make a writable mmap()'ing of a map mutually exclusive with
read-only freezing. The counter indicates writable mmap() mappings and then
prevents/fails the freeze operation. Its semantics can be expanded beyond
just mmap() by generally indicating ongoing write phases. This would essentially
span any parallel regular and batched flavor of update/delete operation and
then also have map_freeze() fail with -EBUSY. For the check_mem_access() in
the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all
last pending writes have completed via bpf_map_write_active() test. Once the
map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0
only then we are really guaranteed to use the map's data as known constants.
For map->frozen being set and pending writes in process of still being completed
we fall back to marking that register as unknown scalar so we don't end up
making assumptions about it. With this, both TOCTOU reproducers from i) and
ii) are fixed.
Note that the map->writecnt has been converted into a atomic64 in the fix in
order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating
map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning
the freeze_mutex over entire map update/delete operations in syscall side
would not be possible due to then causing everything to be serialized.
Similarly, something like synchronize_rcu() after setting map->frozen to wait
for update/deletes to complete is not possible either since it would also
have to span the user copy which can sleep. On the libbpf side, this won't
break d66562fba1 ("libbpf: Add BPF object skeleton support") as the
anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed
mmap()-ed memory where for .rodata it's non-writable.
Fixes: a23740ec43 ("bpf: Track contents of read-only maps as scalars")
Reported-by: w1tcher.bupt@gmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull input updates from Dmitry Torokhov:
"Just one new driver (Cypress StreetFighter touchkey), and no input
core changes this time.
Plus various fixes and enhancements to existing drivers"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (54 commits)
Input: iforce - fix control-message timeout
Input: wacom_i2c - use macros for the bit masks
Input: ili210x - reduce sample period to 15ms
Input: ili210x - improve polled sample spacing
Input: ili210x - special case ili251x sample read out
Input: elantench - fix misreporting trackpoint coordinates
Input: synaptics-rmi4 - Fix device hierarchy
Input: i8042 - Add quirk for Fujitsu Lifebook T725
Input: cap11xx - add support for cap1206
Input: remove unused header <linux/input/cy8ctmg110_pdata.h>
Input: ili210x - add ili251x firmware update support
Input: ili210x - export ili251x version details via sysfs
Input: ili210x - use resolution from ili251x firmware
Input: pm8941-pwrkey - respect reboot_mode for warm reset
reboot: export symbol 'reboot_mode'
Input: max77693-haptic - drop unneeded MODULE_ALIAS
Input: cpcap-pwrbutton - do not set input parent explicitly
Input: max8925_onkey - don't mark comment as kernel-doc
Input: ads7846 - do not attempt IRQ workaround when deferring probe
Input: ads7846 - use input_set_capability()
...
Introduction of map_uid made two lookups from outer map to be distinct.
That distinction is only necessary when inner map has an embedded timer.
Otherwise it will make the verifier state pruning to be conservative
which will cause complex programs to hit 1M insn_processed limit.
Tighten map_uid logic to apply to inner maps with timers only.
Fixes: 3e8ce29850 ("bpf: Prevent pointer mismatch in bpf_timer_init.")
Reported-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/CACAyw99hVEJFoiBH_ZGyy=+oO-jyydoz6v1DeKPKs2HVsUH28w@mail.gmail.com
Link: https://lore.kernel.org/bpf/20211110172556.20754-1-alexei.starovoitov@gmail.com
This series contains initialization fixups, testing improvements, addition
of instruction pointer to data-race reports, and scoped data-race checks.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmGNQO4THHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jIECD/49FaTsFhtZdEDlvLI2u2QJnxkVjwda
PBZkJrB66jDk0Dyc0oUxOu4GGSw64vze8HOJxWhaBA4tmqWGDA0DmTqRFQ3VJ4uW
Csl1uCzkIR9R0dgkDFwkvnq2fNbcr4SwDu0i+7Iig3zws7nhnZlSJPSze6gFkVX2
mLtUXybSR4FvlFMRePHd6cxltmwUohLKOklsI6emOfnSgBBFQ3584wEZ2HN5KwwO
8EwVxE5YNWyZQKqIj76tUoa8qkWbp5SoiiK6mzSQbJpgX8gLN3GngeAc9ZrfY09R
aiSQK9FnkcNkpnRROKA6Go6ze5NGa+1NvF32swZ1nSYOb/LFBDtwt4G8Y8cqdmLv
UtsxjFX4hhxdZzBSbGK3GwDWtDLWgHrmf5K/qPNHkM+QwdoyS27C5Kzfs4jkbtZ0
rAEWBxTrtdTCd+xMIz04ZDlio05CqSqme2/t4xaxGpcYGHLcuSi3uFa1cRvfaew8
rSfq2WKd9Cu2dKmjyF+EtN4Y2o8l8IaxJyeq5bVrBHeijIBH0KdCWkeDhWIJcMmE
Wo36PYsFLyCdAwr66IoNFHvOxbtAQsERZa0/2FGlOKBAzntNA72BdlAFgKJWiLKg
M5K1Q+r7kfns/T1JhftTByryZBd5JM+OiZ/rwU0hCRY48L93ftTzGYSyLVfPBeZ0
lDgc/oJQziv9fA==
=MjQ1
-----END PGP SIGNATURE-----
Merge tag 'kcsan.2021.11.11a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull KCSAN updates from Paul McKenney:
"This contains initialization fixups, testing improvements, addition of
instruction pointer to data-race reports, and scoped data-race checks"
* tag 'kcsan.2021.11.11a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
kcsan: selftest: Cleanup and add missing __init
kcsan: Move ctx to start of argument list
kcsan: Support reporting scoped read-write access type
kcsan: Start stack trace with explicit location if provided
kcsan: Save instruction pointer for scoped accesses
kcsan: Add ability to pass instruction pointer of access to reporting
kcsan: test: Fix flaky test case
kcsan: test: Use kunit_skip() to skip tests
kcsan: test: Defer kcsan_test_init() after kunit initialization
and netfilter.
Current release - regressions:
- bpf: do not reject when the stack read size is different
from the tracked scalar size
- net: fix premature exit from NAPI state polling in napi_disable()
- riscv, bpf: fix RV32 broken build, and silence RV64 warning
Current release - new code bugs:
- net: fix possible NULL deref in sock_reserve_memory
- amt: fix error return code in amt_init(); fix stopping the workqueue
- ax88796c: use the correct ioctl callback
Previous releases - always broken:
- bpf: stop caching subprog index in the bpf_pseudo_func insn
- security: fixups for the security hooks in sctp
- nfc: add necessary privilege flags in netlink layer, limit operations
to admin only
- vsock: prevent unnecessary refcnt inc for non-blocking connect
- net/smc: fix sk_refcnt underflow on link down and fallback
- nfnetlink_queue: fix OOB when mac header was cleared
- can: j1939: ignore invalid messages per standard
- bpf, sockmap:
- fix race in ingress receive verdict with redirect to self
- fix incorrect sk_skb data_end access when src_reg = dst_reg
- strparser, and tls are reusing qdisc_skb_cb and colliding
- ethtool: fix ethtool msg len calculation for pause stats
- vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries
to access an unregistering real_dev
- udp6: make encap_rcv() bump the v6 not v4 stats
- drv: prestera: add explicit padding to fix m68k build
- drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge
- drv: mvpp2: fix wrong SerDes reconfiguration order
Misc & small latecomers:
- ipvs: auto-load ipvs on genl access
- mctp: sanity check the struct sockaddr_mctp padding fields
- libfs: support RENAME_EXCHANGE in simple_rename()
- avoid double accounting for pure zerocopy skbs
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGNQdwACgkQMUZtbf5S
IrsiMQ//f66lTJ8PJ5Qj70hX9dC897olx7uGHB9eiKoyOcJI459hFlfXwRU2T4Tf
fPNwPNUQ9Mynw9tX/jWEi+7zd6r6TSHGXK49U9/rIbQ95QjKY4LHowIE63x+vPl2
5Cpf+80zXC3DUX1fijgyG1ujnU3kBaqopTxDLmlsHw2PGkwT5Ox1DUwkhc370eEL
xlpq3PYGWA8/AQNyhSVBkG/UmoLaq0jYNP5yVcOj4jGjgcgLe1SLrqczENr35QHZ
cRkuBsFBMBZF7wSX2f9qQIB/+b1pcLlD9IO+K3S7Ruq+rUd7qfL/tmwNxEh0axYK
AyIun1Bxcy7QJGjtpGAz+Ku7jS9T3HxzyxhqilQo3co8jAW0WJ1YwHl+XPgQXyjV
DLG6Vxt4syiwsoSXGn8MQugs4nlBT+0qWl8YamIR+o7KkAYPc2QWkXlzEDfNeIW8
JNCZA3sy7VGi1ytorZGx16sQsEWnyRG9a6/WV20Dr+HVs1SKPcFzIfG6mVngR07T
mQMHnbAF6Z5d8VTcPQfMxd7UH48s1bHtk5lcSTa3j0Cw+GkA6ytTmjPdJ1qRcdkH
dl9jAfADe4O6frG+9XH7FEFqhmkghVI7bOCA4ZOhClVaIcDGgEZc2y7sY9/oZ7P4
KXBD2R5X1caCUM0UtzwL7/8ddOtPtHIrFnhY+7+I6ijt9qmI0BY=
=Ttgq
-----END PGP SIGNATURE-----
Merge tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, can and netfilter.
Current release - regressions:
- bpf: do not reject when the stack read size is different from the
tracked scalar size
- net: fix premature exit from NAPI state polling in napi_disable()
- riscv, bpf: fix RV32 broken build, and silence RV64 warning
Current release - new code bugs:
- net: fix possible NULL deref in sock_reserve_memory
- amt: fix error return code in amt_init(); fix stopping the
workqueue
- ax88796c: use the correct ioctl callback
Previous releases - always broken:
- bpf: stop caching subprog index in the bpf_pseudo_func insn
- security: fixups for the security hooks in sctp
- nfc: add necessary privilege flags in netlink layer, limit
operations to admin only
- vsock: prevent unnecessary refcnt inc for non-blocking connect
- net/smc: fix sk_refcnt underflow on link down and fallback
- nfnetlink_queue: fix OOB when mac header was cleared
- can: j1939: ignore invalid messages per standard
- bpf, sockmap:
- fix race in ingress receive verdict with redirect to self
- fix incorrect sk_skb data_end access when src_reg = dst_reg
- strparser, and tls are reusing qdisc_skb_cb and colliding
- ethtool: fix ethtool msg len calculation for pause stats
- vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries to
access an unregistering real_dev
- udp6: make encap_rcv() bump the v6 not v4 stats
- drv: prestera: add explicit padding to fix m68k build
- drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge
- drv: mvpp2: fix wrong SerDes reconfiguration order
Misc & small latecomers:
- ipvs: auto-load ipvs on genl access
- mctp: sanity check the struct sockaddr_mctp padding fields
- libfs: support RENAME_EXCHANGE in simple_rename()
- avoid double accounting for pure zerocopy skbs"
* tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (123 commits)
selftests/net: udpgso_bench_rx: fix port argument
net: wwan: iosm: fix compilation warning
cxgb4: fix eeprom len when diagnostics not implemented
net: fix premature exit from NAPI state polling in napi_disable()
net/smc: fix sk_refcnt underflow on linkdown and fallback
net/mlx5: Lag, fix a potential Oops with mlx5_lag_create_definer()
gve: fix unmatched u64_stats_update_end()
net: ethernet: lantiq_etop: Fix compilation error
selftests: forwarding: Fix packet matching in mirroring selftests
vsock: prevent unnecessary refcnt inc for nonblocking connect
net: marvell: mvpp2: Fix wrong SerDes reconfiguration order
net: ethernet: ti: cpsw_ale: Fix access to un-initialized memory
net: stmmac: allow a tc-taprio base-time of zero
selftests: net: test_vxlan_under_vrf: fix HV connectivity test
net: hns3: allow configure ETS bandwidth of all TCs
net: hns3: remove check VF uc mac exist when set by PF
net: hns3: fix some mac statistics is always 0 in device version V2
net: hns3: fix kernel crash when unload VF while it is being reset
net: hns3: sync rx ring head in echo common pull
net: hns3: fix pfc packet number incorrect after querying pfc parameters
...
Pull exit cleanups from Eric Biederman:
"While looking at some issues related to the exit path in the kernel I
found several instances where the code is not using the existing
abstractions properly.
This set of changes introduces force_fatal_sig a way of sending a
signal and not allowing it to be caught, and corrects the misuse of
the existing abstractions that I found.
A lot of the misuse of the existing abstractions are silly things such
as doing something after calling a no return function, rolling BUG by
hand, doing more work than necessary to terminate a kernel thread, or
calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL).
In the review a deficiency in force_fatal_sig and force_sig_seccomp
where ptrace or sigaction could prevent the delivery of the signal was
found. I have added a change that adds SA_IMMUTABLE to change that
makes it impossible to interrupt the delivery of those signals, and
allows backporting to fix force_sig_seccomp
And Arnd found an issue where a function passed to kthread_run had the
wrong prototype, and after my cleanup was failing to build."
* 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
soc: ti: fix wkup_m3_rproc_boot_thread return type
signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed
signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
exit/r8188eu: Replace the macro thread_exit with a simple return 0
exit/rtl8712: Replace the macro thread_exit with a simple return 0
exit/rtl8723bs: Replace the macro thread_exit with a simple return 0
signal/x86: In emulate_vsyscall force a signal instead of calling do_exit
signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig
signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails
exit/syscall_user_dispatch: Send ordinary signals on failure
signal: Implement force_fatal_sig
exit/kthread: Have kernel threads return instead of calling do_exit
signal/s390: Use force_sigsegv in default_trap_handler
signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.
signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON
signal/sparc: In setup_tsb_params convert open coded BUG into BUG
signal/powerpc: On swapcontext failure force SIGSEGV
signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT
signal/sparc32: Remove unreachable do_exit in do_sparc_fault
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYYvEbgAKCRCRxhvAZXjc
og17AQDj+gsxk2lT4GsRo+WrI9qegGSvYHaxbOoqqSL6rHrrsQD+IU92dwVfuUXE
oP+De6/TBmsdygnlECxITp8p4ByhGAM=
=wi2X
-----END PGP SIGNATURE-----
Merge tag 'kernel.sys.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull prctl updates from Christian Brauner:
"This contains the missing prctl uapi pieces for PR_SCHED_CORE.
In order to activate core scheduling the caller is expected to specify
the scope of the new core scheduling domain.
For example, passing 2 in the 4th argument of
prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, <pid>, 2, 0);
would indicate that the new core scheduling domain encompasses all
tasks in the process group of <pid>. Specifying 0 would only create a
core scheduling domain for the thread identified by <pid> and 2 would
encompass the whole thread-group of <pid>.
Note, the values 0, 1, and 2 correspond to PIDTYPE_PID, PIDTYPE_TGID,
and PIDTYPE_PGID. A first version tried to expose those values
directly to which I objected because:
- PIDTYPE_* is an enum that is kernel internal which we should not
expose to userspace directly.
- PIDTYPE_* indicates what a given struct pid is used for it doesn't
express a scope.
But what the 4th argument of PR_SCHED_CORE prctl() expresses is the
scope of the operation, i.e. the scope of the core scheduling domain
at creation time. So Eugene's patch now simply introduces three new
defines PR_SCHED_CORE_SCOPE_THREAD, PR_SCHED_CORE_SCOPE_THREAD_GROUP,
and PR_SCHED_CORE_SCOPE_PROCESS_GROUP. They simply express what
happens.
This has been on the mailing list for quite a while with all relevant
scheduler folks Cced. I announced multiple times that I'd pick this up
if I don't see or her anyone else doing it. None of this touches
proper scheduler code but only concerns uapi so I think this is fine.
With core scheduling being quite common now for vm managers (e.g.
moving individual vcpu threads into their own core scheduling domain)
and container managers (e.g. moving the init process into its own core
scheduling domain and letting all created children inherit it) having
to rely on raw numbers passed as the 4th argument in prctl() is a bit
annoying and everyone is starting to come up with their own defines"
* tag 'kernel.sys.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
uapi/linux/prctl: provide macro definitions for the PR_SCHED_CORE type argument
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYYvE0wAKCRCRxhvAZXjc
oo36AQCQRC9+LsfBsfoqrrdfWqp9ifs9DuytUg+CTftsy1Pn0QD/ZtySkNx9mnNl
0/lSTN5dJBfEYm6Xcfxuu/vu/iauhw0=
=dY6T
-----END PGP SIGNATURE-----
Merge tag 'pidfd.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd updates from Christian Brauner:
"Various places in the kernel have picked up pidfds.
The two most recent additions have probably been the ability to use
pidfds in bpf maps and the usage of pidfds in mm-based syscalls such
as process_mrelease() and process_madvise().
The same pattern to turn a pidfd into a struct task exists in two
places. One of those places used PIDTYPE_TGID while the other one used
PIDTYPE_PID even though it is clearly documented in all pidfd-helpers
that pidfds __currently__ only refer to thread-group leaders (subject
to change in the future if need be).
This isn't a bug per se but has the potential to be one if we allow
pidfds to refer to individual threads. If that happens we want to
audit all codepaths that make use of them to ensure they can deal with
pidfds refering to individual threads.
This adds a simple helper to turn a pidfd into a struct task making it
easy to grep for such places. Plus, it gets rid of code-duplication"
* tag 'pidfd.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
mm: use pidfd_get_task()
pid: add pidfd_get_task() helper
We can't call unregister_ftrace_function under ftrace_lock.
Link: https://lkml.kernel.org/r/20211109114217.1645296-1-jolsa@kernel.org
Fixes: ed29271894 ("ftrace/direct: Do not disable when switching direct callers")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The resetting of the entire ring buffer use to simply go through and reset
each individual CPU buffer that had its own protection and synchronization.
But this was very slow, due to performing a synchronization for each CPU.
The code was reshuffled to do one disabling of all CPU buffers, followed
by a single RCU synchronization, and then the resetting of each of the CPU
buffers. But unfortunately, the mutex that prevented multiple occurrences
of resetting the buffer was not moved to the upper function, and there is
nothing to protect from it.
Take the ring buffer mutex around the global reset.
Cc: stable@vger.kernel.org
Fixes: b23d7a5f4a ("ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU")
Reported-by: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- convert sparc32 to the generic dma-direct code
- use bitmap_zalloc (Christophe JAILLET)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmGKfNYLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMEIRAAhOocEFpeaSg8iLMd7QLzm5vvzAuR43iykkKCvdvV
Q4P+g8H9Jr65ThsGS90AuuDKuyKh3tmbL7loHlyDygmRHhHALOO4127um4RAnOAL
1y2qCRwgHEZTu1uiu65cB+RRrlJP6T4sHV7+U3uZ3P5nfQoVVIoHKMceSTLIa3dx
WPyJXP33TWK50ZvGYuzMhO5hQPA8sKSePiaN3gz3anF0lMnqlUNh1Iso6nasUW40
XifOFM2Bg/SO7HpBGssrku6Zc5x9TpyuQtLP0u+LpjrbUYUZvz/OteyVu5cTZdbP
QG7MG6jcvDuU41sjKYNjaNpGZlvmXrEs4pXiwbOhzHTG8TFIEiR/LRsrvBGS7DJ8
y0NKNryIKR3+9fMKDH0PWHC7NszJbAQR0J7OT7+GP8cx9M62x5MuV8d2uOXp6TPY
v3VO0SJQrBZLKpY7vixZ6TOYMz15kmULMRrkGzf95+z5MpM2RjJ4lY8Kqlm2PBRR
Q3k53Ii8ya9U61SvgcCH39gR1fGT+WO8E5UFttCfhUhn49KJc7DqbEUiOC8Ta7QC
OONXxhGLdXAkti5NLFAexk8zdLBVRMnzfG44tBnP/JWDbQu3lMNuQfUXzsJK9yDb
zWr/832qwTIzT01NGZDFWdKUPNpafyuDQ1lP9rZZ2ZLo+f/EXNsHvczXvkwP08xS
cyY=
=DvuN
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.16' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
"Just a small set of changes this time. The request dma_direct_alloc
cleanups are still under review and haven't made the cut.
Summary:
- convert sparc32 to the generic dma-direct code
- use bitmap_zalloc (Christophe JAILLET)"
* tag 'dma-mapping-5.16' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: use 'bitmap_zalloc()' when applicable
sparc32: use DMA_DIRECT_REMAP
sparc32: remove dma_make_coherent
sparc32: remove the call to dma_make_coherent in arch_dma_free
Merge more updates from Andrew Morton:
"87 patches.
Subsystems affected by this patch series: mm (pagecache and hugetlb),
procfs, misc, MAINTAINERS, lib, checkpatch, binfmt, kallsyms, ramfs,
init, codafs, nilfs2, hfs, crash_dump, signals, seq_file, fork,
sysvfs, kcov, gdb, resource, selftests, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (87 commits)
ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL
ipc: check checkpoint_restore_ns_capable() to modify C/R proc files
selftests/kselftest/runner/run_one(): allow running non-executable files
virtio-mem: disallow mapping virtio-mem memory via /dev/mem
kernel/resource: disallow access to exclusive system RAM regions
kernel/resource: clean up and optimize iomem_is_exclusive()
scripts/gdb: handle split debug for vmlinux
kcov: replace local_irq_save() with a local_lock_t
kcov: avoid enable+disable interrupts if !in_task()
kcov: allocate per-CPU memory on the relevant node
Documentation/kcov: define `ip' in the example
Documentation/kcov: include types.h in the example
sysv: use BUILD_BUG_ON instead of runtime check
kernel/fork.c: unshare(): use swap() to make code cleaner
seq_file: fix passing wrong private data
seq_file: move seq_escape() to a header
signal: remove duplicate include in signal.h
crash_dump: remove duplicate include in crash_dump.h
crash_dump: fix boolreturn.cocci warning
hfs/hfsplus: use WARN_ON for sanity check
...
virtio-mem dynamically exposes memory inside a device memory region as
system RAM to Linux, coordinating with the hypervisor which parts are
actually "plugged" and consequently usable/accessible.
On the one hand, the virtio-mem driver adds/removes whole memory blocks,
creating/removing busy IORESOURCE_SYSTEM_RAM resources, on the other
hand, it logically (un)plugs memory inside added memory blocks,
dynamically either exposing them to the buddy or hiding them from the
buddy and marking them PG_offline.
In contrast to physical devices, like a DIMM, the virtio-mem driver is
required to actually make use of any of the device-provided memory,
because it performs the handshake with the hypervisor. virtio-mem
memory cannot simply be access via /dev/mem without a driver.
There is no safe way to:
a) Access plugged memory blocks via /dev/mem, as they might contain
unplugged holes or might get silently unplugged by the virtio-mem
driver and consequently turned inaccessible.
b) Access unplugged memory blocks via /dev/mem because the virtio-mem
driver is required to make them actually accessible first.
The virtio-spec states that unplugged memory blocks MUST NOT be written,
and only selected unplugged memory blocks MAY be read. We want to make
sure, this is the case in sane environments -- where the virtio-mem driver
was loaded.
We want to make sure that in a sane environment, nobody "accidentially"
accesses unplugged memory inside the device managed region. For example,
a user might spot a memory region in /proc/iomem and try accessing it via
/dev/mem via gdb or dumping it via something else. By the time the mmap()
happens, the memory might already have been removed by the virtio-mem
driver silently: the mmap() would succeeed and user space might
accidentially access unplugged memory.
So once the driver was loaded and detected the device along the
device-managed region, we just want to disallow any access via /dev/mem to
it.
In an ideal world, we would mark the whole region as busy ("owned by a
driver") and exclude it; however, that would be wrong, as we don't really
have actual system RAM at these ranges added to Linux ("busy system RAM").
Instead, we want to mark such ranges as "not actual busy system RAM but
still soft-reserved and prepared by a driver for future use."
Let's teach iomem_is_exclusive() to reject access to any range with
"IORESOURCE_SYSTEM_RAM | IORESOURCE_EXCLUSIVE", even if not busy and even
if "iomem=relaxed" is set. Introduce EXCLUSIVE_SYSTEM_RAM to make it
easier for applicable drivers to depend on this setting in their Kconfig.
For now, there are no applicable ranges and we'll modify virtio-mem next
to properly set IORESOURCE_EXCLUSIVE on the parent resource container it
creates to contain all actual busy system RAM added via
add_memory_driver_managed().
Link: https://lkml.kernel.org/r/20210920142856.17758-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "virtio-mem: disallow mapping virtio-mem memory via /dev/mem", v5.
Let's add the basic infrastructure to exclude some physical memory regions
marked as "IORESOURCE_SYSTEM_RAM" completely from /dev/mem access, even
though they are not marked IORESOURCE_BUSY and even though "iomem=relaxed"
is set. Resource IORESOURCE_EXCLUSIVE for that purpose instead of adding
new flags to express something similar to "soft-busy" or "not busy yet,
but already prepared by a driver and not to be mapped by user space".
Use it for virtio-mem, to disallow mapping any virtio-mem memory via
/dev/mem to user space after the virtio-mem driver was loaded.
This patch (of 3):
We end up traversing subtrees of ranges we are not interested in; let's
optimize this case, skipping such subtrees, cleaning up the function a
bit.
For example, in the following configuration (/proc/iomem):
00000000-00000fff : Reserved
00001000-00057fff : System RAM
00058000-00058fff : Reserved
00059000-0009cfff : System RAM
0009d000-000fffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c3fff : PCI Bus 0000:00
000c4000-000c7fff : PCI Bus 0000:00
000c8000-000cbfff : PCI Bus 0000:00
000cc000-000cffff : PCI Bus 0000:00
000d0000-000d3fff : PCI Bus 0000:00
000d4000-000d7fff : PCI Bus 0000:00
000d8000-000dbfff : PCI Bus 0000:00
000dc000-000dffff : PCI Bus 0000:00
000e0000-000e3fff : PCI Bus 0000:00
000e4000-000e7fff : PCI Bus 0000:00
000e8000-000ebfff : PCI Bus 0000:00
000ec000-000effff : PCI Bus 0000:00
000f0000-000fffff : PCI Bus 0000:00
000f0000-000fffff : System ROM
00100000-3fffffff : System RAM
40000000-403fffff : Reserved
40000000-403fffff : pnp 00:00
40400000-80a79fff : System RAM
...
We don't have to look at any children of "0009d000-000fffff : Reserved"
if we can just skip these 15 items directly because the parent range is
not of interest.
Link: https://lkml.kernel.org/r/20210920142856.17758-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210920142856.17758-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kcov code mixes local_irq_save() and spin_lock() in
kcov_remote_{start|end}(). This creates a warning on PREEMPT_RT because
local_irq_save() disables interrupts and spin_lock_t is turned into a
sleeping lock which can not be acquired in a section with disabled
interrupts.
The kcov_remote_lock is used to synchronize the access to the hash-list
kcov_remote_map. The local_irq_save() block protects access to the
per-CPU data kcov_percpu_data.
There is no compelling reason to change the lock type to raw_spin_lock_t
to make it work with local_irq_save(). Changing it would require to
move memory allocation (in kcov_remote_add()) and deallocation outside
of the locked section.
Adding an unlimited amount of entries to the hashlist will increase the
IRQ-off time during lookup. It could be argued that this is debug code
and the latency does not matter. There is however no need to do so and
it would allow to use this facility in an RT enabled build.
Using a local_lock_t instead of local_irq_save() has the befit of adding
a protection scope within the source which makes it obvious what is
protected. On a !PREEMPT_RT && !LOCKDEP build the local_lock_irqsave()
maps directly to local_irq_save() so there is overhead at runtime.
Replace the local_irq_save() section with a local_lock_t.
Link: https://lkml.kernel.org/r/20210923164741.1859522-6-bigeasy@linutronix.de
Link: https://lore.kernel.org/r/20210830172627.267989-6-bigeasy@linutronix.de
Reported-by: Clark Williams <williams@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Marco Elver <elver@google.com>
Tested-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kcov_remote_start() may need to allocate memory in the in_task() case
(otherwise per-CPU memory has been pre-allocated) and therefore requires
enabled interrupts.
The interrupts are enabled before checking if the allocation is required
so if no allocation is required then the interrupts are needlessly enabled
and disabled again.
Enable interrupts only if memory allocation is performed.
Link: https://lkml.kernel.org/r/20210923164741.1859522-5-bigeasy@linutronix.de
Link: https://lore.kernel.org/r/20210830172627.267989-5-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Marco Elver <elver@google.com>
Tested-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During boot kcov allocates per-CPU memory which is used later if remote/
softirq processing is enabled.
Allocate the per-CPU memory on the CPU local node to avoid cross node
memory access.
Link: https://lkml.kernel.org/r/20210923164741.1859522-4-bigeasy@linutronix.de
Link: https://lore.kernel.org/r/20210830172627.267989-4-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Marco Elver <elver@google.com>
Tested-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use swap() instead of reimplementing it.
Link: https://lkml.kernel.org/r/20210909022046.8151-1-ran.xiaokai@zte.com.cn
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexey Gladkov <legion@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The core_kernel_text() should check the gate area, as it is part of kernel
text range, use is_kernel_text() in core_kernel_text().
Link: https://lkml.kernel.org/r/20210930071143.63410-9-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The is_kernel_inittext() and init_kernel_text() are with same
functionality, let's just keep is_kernel_inittext() and move it into
sections.h, then update all the callers.
Link: https://lkml.kernel.org/r/20210930071143.63410-5-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move core_kernel_data() into sections.h and rename it to
is_kernel_core_data(), also make it return bool value, then update all the
callers.
Link: https://lkml.kernel.org/r/20210930071143.63410-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "sections: Unify kernel sections range check and use", v4.
There are three head files(kallsyms.h, kernel.h and sections.h) which
include the kernel sections range check, let's make some cleanup and unify
them.
1. cleanup arch specific text/data check and fix address boundary check
in kallsyms.h
2. make all the basic/core kernel range check function into sections.h
3. update all the callers, and use the helper in sections.h to simplify
the code
After this series, we have 5 APIs about kernel sections range check in
sections.h
* is_kernel_rodata() --- already in sections.h
* is_kernel_core_data() --- come from core_kernel_data() in kernel.h
* is_kernel_inittext() --- come from kernel.h and kallsyms.h
* __is_kernel_text() --- add new internal helper
* __is_kernel() --- add new internal helper
Note: For the last two helpers, people should not use directly, consider to
use corresponding function in kallsyms.h.
This patch (of 11):
Remove arch specific text and data check after commit 4ba66a9760 ("arch:
remove blackfin port"), no need arch-specific text/data check.
Link: https://lkml.kernel.org/r/20210930071143.63410-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20210930071143.63410-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A single patch this cycle. We replace some open-coded routines to
classify task states with the scheduler's own function to do this.
Alongside the obvious benefits of removing funky code and aligning
more exactly with the scheduler's task classification, this also
fixes a long standing compiler warning by removing the open-coded
routines that generated the warning.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAmGJA90ACgkQfOMlXTn3
iKFJcw/9FLy94FS+y/F8xpTU89d2j92f8q1mxS9g7ToDzDOiIPLyNazbX+4PsXVQ
FGgLpTqzNZX3+D5eAnOA/BNwWXtsvdxpsNnkY5ZCsVY5kZ0zsBYHe1O5CM2TbcMg
bOvJRQVI/FjydlrwqxIz9gAD7FmT/QvyecIbHZm/zFiCxdQwZy3rFwREd5ENsjoG
wumCCCH8Gh/afi9Pu3ZKHoZggNy/gmtSP3h3wmyoQneVFIJ4Vw5J61GFCvMPD+pN
wuAXWpuzWaND5IPTr4aZMKHNSaxqADQoEpNWxkgRh0cNL4NGBKsdLZMcqTTiyWww
TJSDtQKqocQB99eouwzQoA8SBsZwRvKRf/33QUXrWCAjl5YRK+9fSd8+dEf9Zd0o
A3sh99ecmHXknY6K2uO7NFjUPLSA/QeMGBzNx9lt7RoL+14tjqZkrAvWXooZzBY3
j39gwI1kSplmmCSoXwoW3AFVcCLJcGzE9qh0NUmZgt3kv8K1SUo3gxotKs8KwKj/
xVozOokmZV2ZuCTf8oIw7ntLwIFjiUaYBE7JY+c8mT8VWCbs5ztyOb11I35YYT0V
InXMDICLxZBD85eNOHyPC0fAud5emfboHl5GSUxo2hPgrRKuBmqElGtxG9CC8DLR
SItPjKfrYI1CJd4uoFX54nC3GmwLVSAq3xDwpYsN4A4lLbJJytc=
=vIfI
-----END PGP SIGNATURE-----
Merge tag 'kgdb-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb update from Daniel Thompson:
"A single patch this cycle.
We replace some open-coded routines to classify task states with the
scheduler's own function to do this. Alongside the obvious benefits of
removing funky code and aligning more exactly with the scheduler's
task classification, this also fixes a long standing compiler warning
by removing the open-coded routines that generated the warning"
* tag 'kgdb-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kdb: Adopt scheduler's task classification
As requested by Jessica I'm stepping in to help with modules
maintenance. This is my first pull request to you.
I've collected only two patches for modules for the 5.16-rc1 merge
window. These patches are from Shuah Khan as she debugged some corner
case error with modules. The error messages are improved for
elf_validity_check(). While doing this work a corner case fix was
spotted on validate_section_offset() due to a possible overflow bug
on 64-bit. The impact of this fix is low given this just limits
module section headers placed within the 32-bit boundary, and we
obviously don't have insane module sizes. Even if a specially crafted
module is constructed later checks would invalidate the module right
away.
I've let this sit through 0-day testing since October 15th with no
issues found.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmGFrvcSHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinhFAP/1BBXuM/vevC1IdZaEU4M8pg07NOpkZt
PYJc8CxWKTtEg5hrLJMqOexXGwvAg/nq28IFWvUKh3bGtEghPyrQu6+I4mXsjnjJ
t9/AO+BOYU14DJGDAYEuReNsaAcyeRooHLriuUaNvhhaN9q+v+FRyBWNphmA6Tz7
VkCtmCNMFJZlhd9Cu4jOZpJe6CIe9gZ0czYfRshAl/3ZRSQjYaddtbYf1Cs8Vwah
by4o2YyvctrRzeOj/Fy+kbqZw2St39nZ5fKYwijRn1ZwHRQo6NQqrlMeS8rI0LgG
1YwWgNWO1FjaPzyIFcAhk2bUF2TxEf5/eVpXn2qXHnmVZ55oBPP/O7Th0/5OK9gD
utOMbO1nqBLBXUyX/1dO/UT36XcrqtUP0Y9VgjIvj9n8Y82RGYmBScH/TOU1f7A7
sH56sW9/3YvIOe8AShBHJ7IKqZXU0inIGasFYwKKm2pAOLtajaC9Sr5fqVbuyfNF
J2+nXipVzjI0f9SGTqmE41jynFGln6nfd1pgOOiysg9ZqxieINB0J8l0OHe6fZz/
zU4TehXZHE9DApP8D+rVpP0ltwR2YWs2u0zRqHr/0GEWYH00JZu2ymDR13W7izSp
KiiveBxhwBpewgV5cyua8TDyeKhn3mEJFNmijlaq4yq1P2oKeWTQRDRZjwUP8EZY
s16oV+BW7Kp+
=Evek
-----END PGP SIGNATURE-----
Merge tag 'modules-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull module updates from Luis Chamberlain:
"As requested by Jessica I'm stepping in to help with modules
maintenance. This is my first pull request to you.
I've collected only two patches for modules for the 5.16-rc1 merge
window. These patches are from Shuah Khan as she debugged some corner
case error with modules. The error messages are improved for
elf_validity_check(). While doing this work a corner case fix was
spotted on validate_section_offset() due to a possible overflow bug on
64-bit. The impact of this fix is low given this just limits module
section headers placed within the 32-bit boundary, and we obviously
don't have insane module sizes. Even if a specially crafted module is
constructed later checks would invalidate the module right away.
I've let this sit through 0-day testing since October 15th with no
issues found"
* tag 'modules-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
module: change to print useful messages from elf_validity_check()
module: fix validate_section_offset() overflow bug on 64-bit
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmGFN6IACgkQnJ2qBz9k
QNkfYwgA1w5x/CsN2IMZdx6FTuZFgbOvQpBMTry8iuOPKK3UyIkZaUirTVLKR0cm
k3QbBR9/vTfQTNg5weuFJcbPZZaCXKEvlPGvDh+pumMbfTkMwL3FADweNBoZ3PzO
EiRrV45AbRgSMOzsfURzCz1T53Gd8fYM3pXxmNXG+bnE7+Ea+heKgor8/jFc4U3w
kAKZTfyCiheo7KxVhFGnkGI3ZhIbnbZne4seY/CE4qtv7/bmBE7bhGpmv8LT5FUn
h/JBDLjFU0fzJpplXE6n/VHXeGaUwb8adnYpzojWQ0lLYFrMIZFQ0KkDK6PNwmJF
MKWGqRxDkf54oeWuEAJ9t4/OorqM9A==
=ltE7
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"Support for reporting filesystem errors through fanotify so that
system health monitoring daemons can watch for these and act instead
of scraping system logs"
* tag 'fsnotify_for_v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (34 commits)
samples: remove duplicate include in fs-monitor.c
samples: Fix warning in fsnotify sample
docs: Fix formatting of literal sections in fanotify docs
samples: Make fs-monitor depend on libc and headers
docs: Document the FAN_FS_ERROR event
samples: Add fs error monitoring example
ext4: Send notifications on error
fanotify: Allow users to request FAN_FS_ERROR events
fanotify: Emit generic error info for error event
fanotify: Report fid info for file related file system errors
fanotify: WARN_ON against too large file handles
fanotify: Add helpers to decide whether to report FID/DFID
fanotify: Wrap object_fh inline space in a creator macro
fanotify: Support merging of error events
fanotify: Support enqueueing of error events
fanotify: Pre-allocate pool of error events
fanotify: Reserve UAPI bits for FAN_FS_ERROR
fsnotify: Support FS_ERROR event type
fanotify: Require fid_mode for any non-fd event
fanotify: Encode empty file handle when no inode is provided
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmGFXBkUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vx6Tg/7BsGWm8f+uw/mr9lLm47q2mc4XyoO
7bR9KDp5NM84W/8ZOU7dqqqsnY0ddrSOLBRyhJJYMW3SwJd1y1ajTBsL1Ujqv+eN
z+JUFmhq4Laqm4k6Spc9CEJE+Ol5P6gGUtxLYo6PM2R0VxnSs/rDxctT5i7YOpCi
COJ+NVT/mc/by2loz1kLTSR9GgtBBgd+Y8UA33GFbHKssROw02L0OI3wffp81Oba
EhMGPoD+0FndAniDw+vaOSoO+YaBuTfbM92T/O00mND69Fj1PWgmNWZz7gAVgsXb
3RrNENUFxgw6CDt7LZWB8OyT04iXe0R2kJs+PA9gigFCGbypwbd/Nbz5M7e9HUTR
ray+1EpZib6+nIksQBL2mX8nmtyHMcLiM57TOEhq0+ECDO640MiRm8t0FIG/1E8v
3ZYd9w20o/NxlFNXHxxpZ3D/osGH5ocyF5c5m1rfB4RGRwztZGL172LWCB0Ezz9r
eHB8sWxylxuhrH+hp2BzQjyddg7rbF+RA4AVfcQSxUpyV01hoRocKqknoDATVeLH
664nJIINFxKJFwfuL3E6OhrInNe1LnAhCZsHHqbS+NNQFgvPRznbixBeLkI9dMf5
Yf6vpsWO7ur8lHHbRndZubVu8nxklXTU7B/w+C11sq6k9LLRJSHzanr3Fn9WA80x
sznCxwUvbTCu1r0=
=nsMh
-----END PGP SIGNATURE-----
Merge tag 'pci-v5.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Conserve IRQs by setting up portdrv IRQs only when there are users
(Jan Kiszka)
- Rework and simplify _OSC negotiation for control of PCIe features
(Joerg Roedel)
- Remove struct pci_dev.driver pointer since it's redundant with the
struct device.driver pointer (Uwe Kleine-König)
Resource management:
- Coalesce contiguous host bridge apertures from _CRS to accommodate
BARs that cover more than one aperture (Kai-Heng Feng)
Sysfs:
- Check CAP_SYS_ADMIN before parsing user input (Krzysztof
Wilczyński)
- Return -EINVAL consistently from "store" functions (Krzysztof
Wilczyński)
- Use sysfs_emit() in endpoint "show" functions to avoid buffer
overruns (Kunihiko Hayashi)
PCIe native device hotplug:
- Ignore Link Down/Up caused by resets during error recovery so
endpoint drivers can remain bound to the device (Lukas Wunner)
Virtualization:
- Avoid bus resets on Atheros QCA6174, where they hang the device
(Ingmar Klein)
- Work around Pericom PI7C9X2G switch packet drop erratum by using
store and forward mode instead of cut-through (Nathan Rossi)
- Avoid trying to enable AtomicOps on VFs; the PF setting applies to
all VFs (Selvin Xavier)
MSI:
- Document that /sys/bus/pci/devices/.../irq contains the legacy INTx
interrupt or the IRQ of the first MSI (not MSI-X) vector (Barry
Song)
VPD:
- Add pci_read_vpd_any() and pci_write_vpd_any() to access anywhere
in the possible VPD space; use these to simplify the cxgb3 driver
(Heiner Kallweit)
Peer-to-peer DMA:
- Add (not subtract) the bus offset when calculating DMA address
(Wang Lu)
ASPM:
- Re-enable LTR at Downstream Ports so they don't report Unsupported
Requests when reset or hot-added devices send LTR messages
(Mingchuang Qiao)
Apple PCIe controller driver:
- Add driver for Apple M1 PCIe controller (Alyssa Rosenzweig, Marc
Zyngier)
Cadence PCIe controller driver:
- Return success when probe succeeds instead of falling into error
path (Li Chen)
HiSilicon Kirin PCIe controller driver:
- Reorganize PHY logic and add support for external PHY drivers
(Mauro Carvalho Chehab)
- Support PERST# GPIOs for HiKey970 external PEX 8606 bridge (Mauro
Carvalho Chehab)
- Add Kirin 970 support (Mauro Carvalho Chehab)
- Make driver removable (Mauro Carvalho Chehab)
Intel VMD host bridge driver:
- If IOMMU supports interrupt remapping, leave VMD MSI-X remapping
enabled (Adrian Huang)
- Number each controller so we can tell them apart in
/proc/interrupts (Chunguang Xu)
- Avoid building on UML because VMD depends on x86 bare metal APIs
(Johannes Berg)
Marvell Aardvark PCIe controller driver:
- Define macros for PCI_EXP_DEVCTL_PAYLOAD_* (Pali Rohár)
- Set Max Payload Size to 512 bytes per Marvell spec (Pali Rohár)
- Downgrade PIO Response Status messages to debug level (Marek Behún)
- Preserve CRS SV (Config Request Retry Software Visibility) bit in
emulated Root Control register (Pali Rohár)
- Fix issue in configuring reference clock (Pali Rohár)
- Don't clear status bits for masked interrupts (Pali Rohár)
- Don't mask unused interrupts (Pali Rohár)
- Avoid code repetition in advk_pcie_rd_conf() (Marek Behún)
- Retry config accesses on CRS response (Pali Rohár)
- Simplify emulated Root Capabilities initialization (Pali Rohár)
- Fix several link training issues (Pali Rohár)
- Fix link-up checking via LTSSM (Pali Rohár)
- Fix reporting of Data Link Layer Link Active (Pali Rohár)
- Fix emulation of W1C bits (Marek Behún)
- Fix MSI domain .alloc() method to return zero on success (Marek
Behún)
- Read entire 16-bit MSI vector in MSI handler, not just low 8 bits
(Marek Behún)
- Clear Root Port I/O Space, Memory Space, and Bus Master Enable bits
at startup; PCI core will set those as necessary (Pali Rohár)
- When operating as a Root Port, set class code to "PCI Bridge"
instead of the default "Mass Storage Controller" (Pali Rohár)
- Add emulation for PCI_BRIDGE_CTL_BUS_RESET since aardvark doesn't
implement this per spec (Pali Rohár)
- Add emulation of option ROM BAR since aardvark doesn't implement
this per spec (Pali Rohár)
MediaTek MT7621 PCIe controller driver:
- Add MediaTek MT7621 PCIe host controller driver and DT binding
(Sergio Paracuellos)
Qualcomm PCIe controller driver:
- Add SC8180x compatible string (Bjorn Andersson)
- Add endpoint controller driver and DT binding (Manivannan
Sadhasivam)
- Restructure to use of_device_get_match_data() (Prasad Malisetty)
- Add SC7280-specific pcie_1_pipe_clk_src handling (Prasad Malisetty)
Renesas R-Car PCIe controller driver:
- Remove unnecessary includes (Geert Uytterhoeven)
Rockchip DesignWare PCIe controller driver:
- Add DT binding (Simon Xue)
Socionext UniPhier Pro5 controller driver:
- Serialize INTx masking/unmasking (Kunihiko Hayashi)
Synopsys DesignWare PCIe controller driver:
- Run dwc .host_init() method before registering MSI interrupt
handler so we can deal with pending interrupts left by bootloader
(Bjorn Andersson)
- Clean up Kconfig dependencies (Andy Shevchenko)
- Export symbols to allow more modular drivers (Luca Ceresoli)
TI DRA7xx PCIe controller driver:
- Allow host and endpoint drivers to be modules (Luca Ceresoli)
- Enable external clock if present (Luca Ceresoli)
TI J721E PCIe driver:
- Disable PHY when probe fails after initializing it (Christophe
JAILLET)
MicroSemi Switchtec management driver:
- Return error to application when command execution fails because an
out-of-band reset has cleared the device BARs, Memory Space Enable,
etc (Kelvin Cao)
- Fix MRPC error status handling issue (Kelvin Cao)
- Mask out other bits when reading of management VEP instance ID
(Kelvin Cao)
- Return EOPNOTSUPP instead of ENOTSUPP from sysfs show functions
(Kelvin Cao)
- Add check of event support (Logan Gunthorpe)
Miscellaneous:
- Remove unused pci_pool wrappers, which have been replaced by
dma_pool (Cai Huoqing)
- Use 'unsigned int' instead of bare 'unsigned' (Krzysztof
Wilczyński)
- Use kstrtobool() directly, sans strtobool() wrapper (Krzysztof
Wilczyński)
- Fix some sscanf(), sprintf() format mismatches (Krzysztof
Wilczyński)
- Update PCI subsystem information in MAINTAINERS (Krzysztof
Wilczyński)
- Correct some misspellings (Krzysztof Wilczyński)"
* tag 'pci-v5.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (137 commits)
PCI: Add ACS quirk for Pericom PI7C9X2G switches
PCI: apple: Configure RID to SID mapper on device addition
iommu/dart: Exclude MSI doorbell from PCIe device IOVA range
PCI: apple: Implement MSI support
PCI: apple: Add INTx and per-port interrupt support
PCI: kirin: Allow removing the driver
PCI: kirin: De-init the dwc driver
PCI: kirin: Disable clkreq during poweroff sequence
PCI: kirin: Move the power-off code to a common routine
PCI: kirin: Add power_off support for Kirin 960 PHY
PCI: kirin: Allow building it as a module
PCI: kirin: Add MODULE_* macros
PCI: kirin: Add Kirin 970 compatible
PCI: kirin: Support PERST# GPIOs for HiKey970 external PEX 8606 bridge
PCI: apple: Set up reference clocks when probing
PCI: apple: Add initial hardware bring-up
PCI: of: Allow matching of an interrupt-map local to a PCI device
of/irq: Allow matching of an interrupt-map local to an interrupt controller
irqdomain: Make of_phandle_args_to_fwspec() generally available
PCI: Do not enable AtomicOps on VFs
...
Merge misc updates from Andrew Morton:
"257 patches.
Subsystems affected by this patch series: scripts, ocfs2, vfs, and
mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
cleanups, kfence, and damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
mm/damon: remove return value from before_terminate callback
mm/damon: fix a few spelling mistakes in comments and a pr_debug message
mm/damon: simplify stop mechanism
Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
Docs/admin-guide/mm/damon/start: simplify the content
Docs/admin-guide/mm/damon/start: fix a wrong link
Docs/admin-guide/mm/damon/start: fix wrong example commands
mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
mm/damon: remove unnecessary variable initialization
Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
selftests/damon: support watermarks
mm/damon/dbgfs: support watermarks
mm/damon/schemes: activate schemes based on a watermarks mechanism
tools/selftests/damon: update for regions prioritization of schemes
mm/damon/dbgfs: support prioritization weights
mm/damon/vaddr,paddr: support pageout prioritization
mm/damon/schemes: prioritize regions within the quotas
mm/damon/selftests: support schemes quotas
mm/damon/dbgfs: support quotas of schemes
...
filter_irq_stacks() has little to do with the stackdepot implementation,
except that it is usually used by users (such as KASAN) of stackdepot to
reduce the stack trace.
However, filter_irq_stacks() itself is not useful without a stack trace
as obtained by stack_trace_save() and friends.
Therefore, move filter_irq_stacks() to kernel/stacktrace.c, so that new
users of filter_irq_stacks() do not have to start depending on
STACKDEPOT only for filter_irq_stacks().
Link: https://lkml.kernel.org/r/20210923104803.2620285-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: Taras Madan <tarasmadan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED,
indicating that we're dealing with a memory region that is never
indicated in the firmware-provided memory map, but always detected and
added by a driver.
Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such
memory regions like ordinary MEMBLOCK_NONE memory regions -- for
example, when selecting memory regions to add to the vmcore for dumping
in the crashkernel via for_each_mem_range().
However, especially kexec_file is not supposed to select such memblocks
via for_each_free_mem_range() / for_each_free_mem_range_reverse() to
place kexec images, similar to how we handle
IORESOURCE_SYSRAM_DRIVER_MANAGED without CONFIG_ARCH_KEEP_MEMBLOCK.
We'll make sure that memory hotplug code sets the flag where applicable
(IORESOURCE_SYSRAM_DRIVER_MANAGED) next. This prepares architectures
that need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem
support.
Note that kexec *must not* indicate this memory to the second kernel and
*must not* place kexec-images on this memory. Let's add a comment to
kexec_walk_memblock(), documenting how we handle MEMBLOCK_DRIVER_MANAGED
now just like using IORESOURCE_SYSRAM_DRIVER_MANAGED in
locate_mem_hole_callback() for kexec_walk_resources().
Also note that MEMBLOCK_HOTPLUG cannot be reused due to different
semantics:
MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the
firmware-provided memory map and added to the system early during
boot; kexec *has to* indicate this memory to the second kernel and
can place kexec-images on this memory. After memory hotunplug,
kexec has to be re-armed. We mostly ignore this flag when
"movable_node" is not set on the kernel command line, because
then we're told to not care about hotunpluggability of such
memory regions.
MEMBLOCK_DRIVER_MANAGED: memory is not indicated as "System RAM" in
the firmware-provided memory map; this memory is always detected
and added to the system by a driver; memory might not actually be
physically hotunpluggable. kexec *must not* indicate this memory to
the second kernel and *must not* place kexec-images on this memory.
Link: https://lkml.kernel.org/r/20211004093605.5830-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Jianyong Wu <Jianyong.Wu@arm.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Shahab Vahedi <shahab@synopsys.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename memblock_free_ptr() to memblock_free() and use memblock_free()
when freeing a virtual pointer so that memblock_free() will be a
counterpart of memblock_alloc()
The callers are updated with the below semantic patch and manual
addition of (void *) casting to pointers that are represented by
unsigned long variables.
@@
identifier vaddr;
expression size;
@@
(
- memblock_phys_free(__pa(vaddr), size);
+ memblock_free(vaddr, size);
|
- memblock_free_ptr(vaddr, size);
+ memblock_free(vaddr, size);
)
[sfr@canb.auug.org.au: fixup]
Link: https://lkml.kernel.org/r/20211018192940.3d1d532f@canb.auug.org.au
Link: https://lkml.kernel.org/r/20210930185031.18648-7-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since memblock_free() operates on a physical range, make its name
reflect it and rename it to memblock_phys_free(), so it will be a
logical counterpart to memblock_phys_alloc().
The callers are updated with the below semantic patch:
@@
expression addr;
expression size;
@@
- memblock_free(addr, size);
+ memblock_phys_free(addr, size);
Link: https://lkml.kernel.org/r/20210930185031.18648-6-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memblock_free_early_nid() is unused and memblock_free_early() is an
alias for memblock_free().
Replace calls to memblock_free_early() with calls to memblock_free() and
remove memblock_free_early() and memblock_free_early_nid().
Link: https://lkml.kernel.org/r/20210930185031.18648-4-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 7a5da02de8 ("locking/lockdep: check for freed initmem in
static_obj()") added arch_is_kernel_initmem_freed() which is supposed to
report whether an object is part of already freed init memory.
For the time being, the generic version of
arch_is_kernel_initmem_freed() always reports 'false', allthough
free_initmem() is generically called on all architectures.
Therefore, change the generic version of arch_is_kernel_initmem_freed()
to check whether free_initmem() has been called. If so, then check if a
given address falls into init memory.
To ease the use of system_state, move it out of line into its only
caller which is lockdep.c
Link: https://lkml.kernel.org/r/1d40783e676e07858be97d881f449ee7ea8adfb1.1633001016.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
core_kernel_text() considers that until system_state in at least
SYSTEM_RUNNING, init memory is valid.
But init memory is freed a few lines before setting SYSTEM_RUNNING, so
we have a small period of time when core_kernel_text() is wrong.
Create an intermediate system state called SYSTEM_FREEING_INIT that is
set before starting freeing init memory, and use it in
core_kernel_text() to report init memory invalid earlier.
Link: https://lkml.kernel.org/r/9ecfdee7dd4d741d172cb93ff1d87f1c58127c9a.1633001016.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There was a report that starting an Ubuntu in docker while using cpuset
to bind it to movable nodes (a node only has movable zone, like a node
for hotplug or a Persistent Memory node in normal usage) will fail due
to memory allocation failure, and then OOM is involved and many other
innocent processes got killed.
It can be reproduced with command:
$ docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"
(where node 4 is a movable node)
runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
Call Trace:
dump_stack+0x6b/0x88
dump_header+0x4a/0x1e2
oom_kill_process.cold+0xb/0x10
out_of_memory.part.0+0xaf/0x230
out_of_memory+0x3d/0x80
__alloc_pages_slowpath.constprop.0+0x954/0xa20
__alloc_pages_nodemask+0x2d3/0x300
pipe_write+0x322/0x590
new_sync_write+0x196/0x1b0
vfs_write+0x1c3/0x1f0
ksys_write+0xa7/0xe0
do_syscall_64+0x52/0xd0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Mem-Info:
active_anon:392832 inactive_anon:182 isolated_anon:0
active_file:68130 inactive_file:151527 isolated_file:0
unevictable:2701 dirty:0 writeback:7
slab_reclaimable:51418 slab_unreclaimable:116300
mapped:45825 shmem:735 pagetables:2540 bounce:0
free:159849484 free_pcp:73 free_cma:0
Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0 0
Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB
oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
The reason is that in this case, the target cpuset nodes only have
movable zone, while the creation of an OS in docker sometimes needs to
allocate memory in non-movable zones (dma/dma32/normal) like
GFP_HIGHUSER, and the cpuset limit forbids the allocation, then
out-of-memory killing is involved even when normal nodes and movable
nodes both have many free memory.
The OOM killer cannot help to resolve the situation as there is no
usable memory for the request in the cpuset scope. The only reasonable
measure to take is to fail the allocation right away and have the caller
to deal with it.
So add a check for cases like this in the slowpath of allocation, and
bail out early returning NULL for the allocation.
As page allocation is one of the hottest path in kernel, this check will
hurt all users with sane cpuset configuration, add a static branch check
and detect the abnormal config in cpuset memory binding setup so that
the extra check cost in page allocation is not paid by everyone.
[thanks to Micho Hocko and David Rientjes for suggesting not handling
it inside OOM code, adding cpuset check, refining comments]
Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Fix NUMA without SMP".
SuperH is the only architecture which still supports NUMA without SMP,
for good reasons (various memories scattered around the address space,
each with varying latencies).
This series fixes two build errors due to variables and functions used
by the NUMA code being provided by SMP-only source files or sections.
This patch (of 2):
If CONFIG_NUMA=y, but CONFIG_SMP=n (e.g. sh/migor_defconfig):
sh4-linux-gnu-ld: mm/page_alloc.o: in function `get_page_from_freelist':
page_alloc.c:(.text+0x2c24): undefined reference to `node_reclaim_distance'
Fix this by moving the declaration of node_reclaim_distance from an
SMP-only to a generic file.
Link: https://lkml.kernel.org/r/cover.1631781495.git.geert+renesas@glider.be
Link: https://lkml.kernel.org/r/6432666a648dde85635341e6c918cee97c97d264.1631781495.git.geert+renesas@glider.be
Fixes: a55c7454a8 ("sched/topology: Improve load balancing on AMD EPYC systems")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Suggested-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Cc: Rich Felker <dalias@libc.org>
Cc: Gon Solo <gonsolo@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The variable mm->total_vm could be accessed concurrently during mmaping
and system accounting as noticed by KCSAN,
BUG: KCSAN: data-race in __acct_update_integrals / mmap_region
read-write to 0xffffa40267bd14c8 of 8 bytes by task 15609 on cpu 3:
mmap_region+0x6dc/0x1400
do_mmap+0x794/0xca0
vm_mmap_pgoff+0xdf/0x150
ksys_mmap_pgoff+0xe1/0x380
do_syscall_64+0x37/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
read to 0xffffa40267bd14c8 of 8 bytes by interrupt on cpu 2:
__acct_update_integrals+0x187/0x1d0
acct_account_cputime+0x3c/0x40
update_process_times+0x5c/0x150
tick_sched_timer+0x184/0x210
__run_hrtimer+0x119/0x3b0
hrtimer_interrupt+0x350/0xaa0
__sysvec_apic_timer_interrupt+0x7b/0x220
asm_call_irq_on_stack+0x12/0x20
sysvec_apic_timer_interrupt+0x4d/0x80
asm_sysvec_apic_timer_interrupt+0x12/0x20
smp_call_function_single+0x192/0x2b0
perf_install_in_context+0x29b/0x4a0
__se_sys_perf_event_open+0x1a98/0x2550
__x64_sys_perf_event_open+0x63/0x70
do_syscall_64+0x37/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported by Kernel Concurrency Sanitizer on:
CPU: 2 PID: 15610 Comm: syz-executor.3 Not tainted 5.10.0+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
In vm_stat_account which called by mmap_region, increase total_vm, and
__acct_update_integrals may read total_vm at the same time. This will
cause a data race which lead to undefined behaviour. To avoid potential
bad read/write, volatile property and barrier are both used to avoid
undefined behaviour.
Link: https://lkml.kernel.org/r/20210913105550.1569419-1-liupeng256@huawei.com
Signed-off-by: Peng Liu <liupeng256@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shuah Khan reported:
| When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
| kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
| it tries to allocate memory attempting to acquire spinlock in page
| allocation code while holding workqueue pool raw_spinlock.
|
| There are several instances of this problem when block layer tries
| to __queue_work(). Call trace from one of these instances is below:
|
| kblockd_mod_delayed_work_on()
| mod_delayed_work_on()
| __queue_delayed_work()
| __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
| insert_work()
| kasan_record_aux_stack()
| kasan_save_stack()
| stack_depot_save()
| alloc_pages()
| __alloc_pages()
| get_page_from_freelist()
| rm_queue()
| rm_queue_pcplist()
| local_lock_irqsave(&pagesets.lock, flags);
| [ BUG: Invalid wait context triggered ]
The default kasan_record_aux_stack() calls stack_depot_save() with
GFP_NOWAIT, which in turn can then call alloc_pages(GFP_NOWAIT, ...).
In general, however, it is not even possible to use either GFP_ATOMIC
nor GFP_NOWAIT in certain non-preemptive contexts, including
raw_spin_locks (see gfp.h and commmit ab00db216c).
Fix it by instructing stackdepot to not expand stack storage via
alloc_pages() in case it runs out by using
kasan_record_aux_stack_noalloc().
While there is an increased risk of failing to insert the stack trace,
this is typically unlikely, especially if the same insertion had already
succeeded previously (stack depot hit).
For frequent calls from the same location, it therefore becomes
extremely unlikely that kasan_record_aux_stack_noalloc() fails.
Link: https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org
Link: https://lkml.kernel.org/r/20210913112609.2651084-7-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Shuah Khan <skhan@linuxfoundation.org>
Tested-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Taras Madan <tarasmadan@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is to fix an out-of-bound access issue when jit-ing the
bpf_pseudo_func insn (i.e. ld_imm64 with src_reg == BPF_PSEUDO_FUNC)
In jit_subprog(), it currently reuses the subprog index cached in
insn[1].imm. This subprog index is an index into a few array related
to subprogs. For example, in jit_subprog(), it is an index to the newly
allocated 'struct bpf_prog **func' array.
The subprog index was cached in insn[1].imm after add_subprog(). However,
this could become outdated (and too big in this case) if some subprogs
are completely removed during dead code elimination (in
adjust_subprog_starts_after_remove). The cached index in insn[1].imm
is not updated accordingly and causing out-of-bound issue in the later
jit_subprog().
Unlike bpf_pseudo_'func' insn, the current bpf_pseudo_'call' insn
is handling the DCE properly by calling find_subprog(insn->imm) to
figure out the index instead of caching the subprog index.
The existing bpf_adj_branches() will adjust the insn->imm
whenever insn is added or removed.
Instead of having two ways handling subprog index,
this patch is to make bpf_pseudo_func works more like
bpf_pseudo_call.
First change is to stop caching the subprog index result
in insn[1].imm after add_subprog(). The verification
process will use find_subprog(insn->imm) to figure
out the subprog index.
Second change is in bpf_adj_branches() and have it to
adjust the insn->imm for the bpf_pseudo_func insn also
whenever insn is added or removed.
Third change is in jit_subprog(). Like the bpf_pseudo_call handling,
bpf_pseudo_func temporarily stores the find_subprog() result
in insn->off. It is fine because the prog's insn has been finalized
at this point. insn->off will be reset back to 0 later to avoid
confusing the userspace prog dump tool.
Fixes: 69c087ba62 ("bpf: Add bpf_for_each_map_elem() helper")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211106014014.651018-1-kafai@fb.com
elf_validity_check() checks ELF headers for errors and ELF Spec.
compliance and if any of them fail it returns -ENOEXEC from all of
these error paths. Almost all of them don't print any messages.
When elf_validity_check() returns an error, load_module() prints an
error message without error code. It is hard to determine why the
module ELF structure is invalid, even if load_module() prints the
error code which is -ENOEXEC in all of these cases.
Change to print useful error messages from elf_validity_check() to
clearly say what went wrong and why the ELF validity checks failed.
Remove the load_module() error message which is no longer needed.
This patch includes changes to fix build warns on 32-bit platforms:
warning: format '%llu' expects argument of type 'long long unsigned int',
but argument 3 has type 'Elf32_Off' {aka 'unsigned int'}
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
validate_section_offset() uses unsigned long local variable to
add/store shdr->sh_offset and shdr->sh_size on all platforms.
unsigned long is too short when sh_offset is Elf64_Off which
would be the case on 64bit ELF headers.
Without this fix applied we were shorting the design of modules
to have section headers placed within the 32-bit boundary (4 GiB)
instead of 64-bits when on 64-bit architectures (which allows for
up to 16,777,216 TiB). In practice this just meant we were limiting
modules sections to below 4 GiB even on 64-bit systems. This then
should not really affect any real-world use case as modules these
days obviously should likely never exceed 1 GiB in size overall.
A specially crafted invalid module might succeed to skip validation
in validate_section_offset() due to this mistake, but in such case
no impact is observed through code inspection given the correct data
types are used for the copy of the module when needed on move_module()
when the section type is not SHT_NOBITS (which indicates no the
section occupies no space on the file).
Fix the overflow problem using the right size local variable when
CONFIG_64BIT is defined.
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
[mcgrof: expand commit log with possible impact if not applied]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Add support for the per-port interrupt controller that deals with both INTx
signalling and management interrupts.
This allows the Link-up/Link-down interrupts to be wired, allowing the
bring-up to be synchronised (and provide debug information). The framework
can further be used to handle the rest of the per port events if and when
necessary.
Likewise, INTx signalling is implemented so that end-points can actually be
used.
Link: https://lore.kernel.org/r/20210929163847.2807812-7-maz@kernel.org
Link: https://lore.kernel.org/r/20211004150552.3844830-1-maz@kernel.org
Tested-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
of_phandle_args_to_fwspec() can be generally useful to code extracting a DT
of_phandle and using an irq_fwspec to use the hierarchical irqdomain API.
Make it visible to the rest of the kernel, including modules.
Link: https://lore.kernel.org/r/20210929163847.2807812-2-maz@kernel.org
Tested-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Including:
- Intel IOMMU Updates fro Lu Baolu:
- Dump DMAR translation structure when DMA fault occurs
- An optimization in the page table manipulation code
- Use second level for GPA->HPA translation
- Various cleanups
- Arm SMMU Updates from Will
- Minor optimisations to SMMUv3 command creation and submission
- Numerous new compatible string for Qualcomm SMMUv2 implementations
- Fixes for the SWIOTLB based implemenation of dma-iommu code for
untrusted devices
- Add support for r8a779a0 to the Renesas IOMMU driver and DT matching
code for r8a77980
- A couple of cleanups and fixes for the Apple DART IOMMU driver
- Make use of generic report_iommu_fault() interface in the AMD IOMMU
driver
- Various smaller fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmGD6NQACgkQK/BELZcB
GuOSfg/9FKXl5ym86BP3tAS1fREKH7p59JRGZrrIR89NyHAcEUjtNG3YLPao+YxU
3CDgLkru+vlDpYY54QoyqcY5FgIHT3Cna/Cdk4zekRmSO/14gHp47jtZRheOUzLF
rvwfaplcbbtT8akpsVFzvw8YpQLGSDiDQSl7xL2+40Z9hiYX/gS9Af+PH98tAXsa
yZKZj6gU+JXM58VihO3M7umyE06tovyBaYgcsBZtbf66bGc0ySu+fe75UVWbueRt
Z8jwqa7TUfVXiYC8h+LqtGET6gtzNSsxAU3VllRe7Brf6K8i/yaRs/TO2Hp83d7/
q/fcK3vNQ5v3aDNci/DjBB8SEySzCmRz/9ocCOCx8ByuRp+5lwVRPPq3WcUMtsZY
QpYo9Fk7luFz2Gj5LObKAVBvOoeBZ5Km3oPs4HVmQ6epxn/rVckJDnJnVSLJuATq
tSZC2heRfFlg1dT6WFaynCTP2RI1LlNEdKhHirV6L368rSjmF0ZdQxdTpHULsHr1
yMjqL21OfcSkLW91rvfb3g68EsIwDbCPGTOlQWZLmAtwOWtHSCLPgwwEG7WefZbH
yaslpmlUTOurUnFmpxlfLicy5sqsBL2ASzGJkEKrgunw82Ke96zzkRzi+9j9HeS6
g0AyIWMi1cUAjONVUZtV4yjImXh63HIPiKx730a9teodusoxm+Q=
=waUR
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- Intel IOMMU Updates fro Lu Baolu:
- Dump DMAR translation structure when DMA fault occurs
- An optimization in the page table manipulation code
- Use second level for GPA->HPA translation
- Various cleanups
- Arm SMMU Updates from Will
- Minor optimisations to SMMUv3 command creation and submission
- Numerous new compatible string for Qualcomm SMMUv2 implementations
- Fixes for the SWIOTLB based implemenation of dma-iommu code for
untrusted devices
- Add support for r8a779a0 to the Renesas IOMMU driver and DT matching
code for r8a77980
- A couple of cleanups and fixes for the Apple DART IOMMU driver
- Make use of generic report_iommu_fault() interface in the AMD IOMMU
driver
- Various smaller fixes and cleanups
* tag 'iommu-updates-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (35 commits)
iommu/dma: Fix incorrect error return on iommu deferred attach
iommu/dart: Initialize DART_STREAMS_ENABLE
iommu/dma: Use kvcalloc() instead of kvzalloc()
iommu/tegra-smmu: Use devm_bitmap_zalloc when applicable
iommu/dart: Use kmemdup instead of kzalloc and memcpy
iommu/vt-d: Avoid duplicate removing in __domain_mapping()
iommu/vt-d: Convert the return type of first_pte_in_page to bool
iommu/vt-d: Clean up unused PASID updating functions
iommu/vt-d: Delete dev_has_feat callback
iommu/vt-d: Use second level for GPA->HPA translation
iommu/vt-d: Check FL and SL capability sanity in scalable mode
iommu/vt-d: Remove duplicate identity domain flag
iommu/vt-d: Dump DMAR translation structure when DMA fault occurs
iommu/vt-d: Do not falsely log intel_iommu is unsupported kernel option
iommu/arm-smmu-qcom: Request direct mapping for modem device
iommu: arm-smmu-qcom: Add compatible for QCM2290
dt-bindings: arm-smmu: Add compatible for QCM2290 SoC
iommu/arm-smmu-qcom: Add SM6350 SMMU compatible
dt-bindings: arm-smmu: Add compatible for SM6350 SoC
iommu/arm-smmu-v3: Properly handle the return value of arm_smmu_cmdq_build_cmd()
...
Pull per signal_struct coredumps from Eric Biederman:
"Current coredumps are mixed up with the exit code, the signal handling
code, and the ptrace code making coredumps much more complicated than
necessary and difficult to follow.
This series of changes starts with ptrace_stop and cleans it up,
making it easier to follow what is happening in ptrace_stop. Then
cleans up the exec interactions with coredumps. Then cleans up the
coredump interactions with exit. Finally the coredump interactions
with the signal handling code is cleaned up.
The first and last changes are bug fixes for minor bugs.
I believe the fact that vfork followed by execve can kill the process
the called vfork if exec fails is sufficient justification to change
the userspace visible behavior.
In previous discussions some of these changes were organized
differently and individually appeared to make the code base worse. As
currently written I believe they all stand on their own as cleanups
and bug fixes.
Which means that even if the worst should happen and the last change
needs to be reverted for some unimaginable reason, the code base will
still be improved.
If the worst does not happen there are a more cleanups that can be
made. Signals that generate coredumps can easily become eligible for
short circuit delivery in complete_signal. The entire rendezvous for
generating a coredump can move into get_signal. The function
force_sig_info_to_task be written in a way that does not modify the
signal handling state of the target task (because coredumps are
eligible for short circuit delivery). Many of these future cleanups
can be done another way but nothing so cleanly as if coredumps become
per signal_struct"
* 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
coredump: Limit coredumps to a single thread group
coredump: Don't perform any cleanups before dumping core
exit: Factor coredump_exit_mm out of exit_mm
exec: Check for a pending fatal signal instead of core_state
ptrace: Remove the unnecessary arguments from arch_ptrace_stop
signal: Remove the bogus sigkill_pending in ptrace_stop
As Andy pointed out that there are races between
force_sig_info_to_task and sigaction[1] when force_sig_info_task. As
Kees discovered[2] ptrace is also able to change these signals.
In the case of seeccomp killing a process with a signal it is a
security violation to allow the signal to be caught or manipulated.
Solve this problem by introducing a new flag SA_IMMUTABLE that
prevents sigaction and ptrace from modifying these forced signals.
This flag is carefully made kernel internal so that no new ABI is
introduced.
Longer term I think this can be solved by guaranteeing short circuit
delivery of signals in this case. Unfortunately reliable and
guaranteed short circuit delivery of these signals is still a ways off
from being implemented, tested, and merged. So I have implemented a much
simpler alternative for now.
[1] https://lkml.kernel.org/r/b5d52d25-7bde-4030-a7b1-7c6f8ab90660@www.fastmail.com
[2] https://lkml.kernel.org/r/202110281136.5CE65399A7@keescook
Cc: stable@vger.kernel.org
Fixes: 307d522f5e ("signal/seccomp: Refactor seccomp signal and coredump generation")
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently kdb contains some open-coded routines to generate a summary
character for each task. This code currently issues warnings, is
almost certainly broken and won't make sense to any kernel dev who
has ever used /proc to examine task states.
Fix both the warning and the potential for confusion by adopting the
scheduler's task classification. Whilst doing this we also simplify the
filtering by using mask strings directly (which means we don't have to
guess all the characters the scheduler might give us).
Unfortunately we can't quite match the scheduler classification completely.
We add four extra states: - for idle loops and i, m and s for sleeping
system daemons (which means kthreads in one of the I, M and S states).
These extra states are used to manage the filters for tools to make the
output of ps and bta less noisy.
Note: The Fixes below is the last point the original dubious code was
moved; it was not introduced by that patch. However it gives us
the last point to which this patch can be easily backported.
Happily that should be enough to cover the introduction of
CONFIG_WERROR!
Fixes: 2f064a59a1 ("sched: Change task_struct::state")
Link: https://lore.kernel.org/r/20211102173158.3315227-1-daniel.thompson@linaro.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>