In an upcoming change we would like to add a flag to
GFP_HIGHUSER_MOVABLE so that it would no longer be an OR
of GFP_HIGHUSER and __GFP_MOVABLE. This poses a problem for
alloc_zeroed_user_highpage_movable() which passes __GFP_MOVABLE
into an arch-specific __alloc_zeroed_user_highpage() hook which ORs
in GFP_HIGHUSER.
Since __alloc_zeroed_user_highpage() is only ever called from
alloc_zeroed_user_highpage_movable(), we can remove one level
of indirection here. Remove __alloc_zeroed_user_highpage(),
make alloc_zeroed_user_highpage_movable() the hook, and use
GFP_HIGHUSER_MOVABLE in the hook implementations so that they will
pick up the new flag that we are going to add.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/Ic6361c657b2cdcd896adbe0cf7cb5a7fbb1ed7bf
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20210602235230.3928842-2-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
Kprobes has a counter 'nmissed', that is used to count the number of
times a probe handler was not called. This generally happens when we hit
a kprobe while handling another kprobe.
However, if one of the probe handlers causes a fault, we are currently
incrementing 'nmissed'. The comment in fault handler indicates that this
can be used to account faults taken by the probe handlers. But, this has
never been the intention as is evident from the comment above 'nmissed'
in 'struct kprobe':
/*count the number of times this probe was temporarily disarmed */
unsigned long nmissed;
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20210601120150.672652-1-naveen.n.rao@linux.vnet.ibm.com
The reason for kprobe::fault_handler(), as given by their comment:
* We come here because instructions in the pre/post
* handler caused the page_fault, this could happen
* if handler tries to access user space by
* copy_from_user(), get_user() etc. Let the
* user-specified handler try to fix it first.
Is just plain bad. Those other handlers are ran from non-preemptible
context and had better use _nofault() functions. Also, there is no
upstream usage of this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20210525073213.561116662@infradead.org
arch/$(SRCARCH)/Kbuild is useful for Makefile cleanups because you can
use the obj-y syntax.
Add an empty file if it is missing in arch/$(SRCARCH)/.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Now that all architectures implement ARCH_ATOMIC, we can make it
mandatory, removing the Kconfig symbol and logic for !ARCH_ATOMIC.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210525140232.53872-33-mark.rutland@arm.com
Subsequent patches will move architectures over to the ARCH_ATOMIC API,
after preparing the asm-generic atomic implementations to function with
or without ARCH_ATOMIC.
As some architectures use the asm-generic implementations exclusively
(and don't have a local atomic.h), and to avoid the risk that
ARCH_ATOMIC isn't defined in some cases we expect, let's make the
ARCH_ATOMIC macro a Kconfig symbol instead, so that we can guarantee it
is consistently available where needed.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210525140232.53872-2-mark.rutland@arm.com
In commit fa8b90070a ("quota: wire up quotactl_path") we have wired up
new quotactl_path syscall. However some people in LWN discussion have
objected that the path based syscall is missing dirfd and flags argument
which is mostly standard for contemporary path based syscalls. Indeed
they have a point and after a discussion with Christian Brauner and
Sascha Hauer I've decided to disable the syscall for now and update its
API. Since there is no userspace currently using that syscall and it
hasn't been released in any major release, we should be fine.
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Sascha Hauer <s.hauer@pengutronix.de>
Link: https://lore.kernel.org/lkml/20210512153621.n5u43jsytbik4yze@wittgenstein
Signed-off-by: Jan Kara <jack@suse.cz>
As pointed out by commit
de9b8f5dcb ("sched: Fix crash trying to dequeue/enqueue the idle thread")
init_idle() can and will be invoked more than once on the same idle
task. At boot time, it is invoked for the boot CPU thread by
sched_init(). Then smp_init() creates the threads for all the secondary
CPUs and invokes init_idle() on them.
As the hotplug machinery brings the secondaries to life, it will issue
calls to idle_thread_get(), which itself invokes init_idle() yet again.
In this case it's invoked twice more per secondary: at _cpu_up(), and at
bringup_cpu().
Given smp_init() already initializes the idle tasks for all *possible*
CPUs, no further initialization should be required. Now, removing
init_idle() from idle_thread_get() exposes some interesting expectations
with regards to the idle task's preempt_count: the secondary startup always
issues a preempt_disable(), requiring some reset of the preempt count to 0
between hot-unplug and hotplug, which is currently served by
idle_thread_get() -> idle_init().
Given the idle task is supposed to have preemption disabled once and never
see it re-enabled, it seems that what we actually want is to initialize its
preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove
init_idle() from idle_thread_get().
Secondary startups were patched via coccinelle:
@begone@
@@
-preempt_disable();
...
cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com
- add support for system call stack randomization.
- handle stale PCI deconfiguration events.
- couple of defconfig updates.
- some fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmCURRgACgkQIg7DeRsp
bsLiVw//ThqXjgP7koJtawL0MFvSo1V69KTw1QNoMmUvrCynZ8nJlt4sHj1LIuEN
m7kHoWUsvNcg8r8QbxL1eZ2f/Qf43qrFjXIKi5iTdOtO/LF9NNNQYFnA3cT3h9oE
7hfycj8o5yi+KYY3Ca2HjlQ0i7zKYfPul1+Yms5h0nAgcvOXuPltVAlyYrrtddrM
cfpolZZd1IB/lMHSa8/qLviRB5ADlrNx4N6Y1ROeCPCWDbO8flrnDOPTDG8a8sCN
llQ0/vBTmenkGyT7UjG5bx9P/gX1FsMShBtyZMa8t8leIJfruDiwdo87wvSDf5IT
I612xdbLpMfGy6i/LnJHhnw61FkpwBKJZ3UrVVkrmjY8IVN8tVdAjy5s4Fplhgjj
BUbk9Ep03YCqfO6fpqh5DkBxCF0dnj4dZrcHA881/DnZuUkxpMJhyNjJDIx1OLup
PC+y9eILFAnDveFvhZJZeMpH7wAheyrW/WgKZsZNLYZ+61pKPyGn9RrE5UBgI7ra
CSIi9Km/lAuNCd9o4n5/5wCd3a9dW47kCrRe3S20oF57v5RU3AVtbC2YSUqPYahf
NR4ZgL+zDhByLPRVij5FJ1LLeaJJftKM9uUO9egHOk1JnSxDc9EyU41x4838SKYv
CfhQJw1ISTTRNCGeflE7+CBfEYCKX7h+DySVCtxTsg6PelcQHLs=
=C1HZ
-----END PGP SIGNATURE-----
Merge tag 's390-5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Heiko Carstens:
- add support for system call stack randomization
- handle stale PCI deconfiguration events
- couple of defconfig updates
- some fixes and cleanups
* tag 's390-5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: fix detection of vector enhancements facility 1 vs. vector packed decimal facility
s390/entry: add support for syscall stack randomization
s390/configs: change CONFIG_VIRTIO_CONSOLE to "m"
s390/cio: remove invalid condition on IO_SCH_UNREG
s390/cpumf: remove call to perf_event_update_userpage
s390/cpumf: move counter set size calculation to common place
s390/cpumf: beautify if-then-else indentation
s390/configs: enable CONFIG_PCI_IOV
s390/pci: handle stale deconfiguration events
s390/pci: rename zpci_configure_device()
Merge more updates from Andrew Morton:
"The remainder of the main mm/ queue.
143 patches.
Subsystems affected by this patch series (all mm): pagecache, hugetlb,
userfaultfd, vmscan, compaction, migration, cma, ksm, vmstat, mmap,
kconfig, util, memory-hotplug, zswap, zsmalloc, highmem, cleanups, and
kfence"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (143 commits)
kfence: use power-efficient work queue to run delayed work
kfence: maximize allocation wait timeout duration
kfence: await for allocation using wait_event
kfence: zero guard page after out-of-bounds access
mm/process_vm_access.c: remove duplicate include
mm/mempool: minor coding style tweaks
mm/highmem.c: fix coding style issue
btrfs: use memzero_page() instead of open coded kmap pattern
iov_iter: lift memzero_page() to highmem.h
mm/zsmalloc: use BUG_ON instead of if condition followed by BUG.
mm/zswap.c: switch from strlcpy to strscpy
arm64/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
x86/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
mm,memory_hotplug: add kernel boot option to enable memmap_on_memory
acpi,memhotplug: enable MHP_MEMMAP_ON_MEMORY when supported
mm,memory_hotplug: allocate memmap from the added memory range
mm,memory_hotplug: factor out adjusting present pages into adjust_present_page_count()
mm,memory_hotplug: relax fully spanned sections check
drivers/base/memory: introduce memory_block_{online,offline}
mm/memory_hotplug: remove broken locking of zone PCP structures during hot remove
...
Patch series "hugetlb: Disable huge pmd unshare for uffd-wp", v4.
This series tries to disable huge pmd unshare of hugetlbfs backed memory
for uffd-wp. Although uffd-wp of hugetlbfs is still during rfc stage,
the idea of this series may be needed for multiple tasks (Axel's uffd
minor fault series, and Mike's soft dirty series), so I picked it out
from the larger series.
This patch (of 4):
It is a preparation work to be able to behave differently in the per
architecture huge_pte_alloc() according to different VMA attributes.
Pass it deeper into huge_pmd_share() so that we can avoid the find_vma() call.
[peterx@redhat.com: build fix]
Link: https://lkml.kernel.org/r/20210304164653.GB397383@xz-x1Link: https://lkml.kernel.org/r/20210218230633.15028-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210218230633.15028-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The PoP documents:
134: The vector packed decimal facility is installed in the
z/Architecture architectural mode. When bit 134 is
one, bit 129 is also one.
135: The vector enhancements facility 1 is installed in
the z/Architecture architectural mode. When bit 135
is one, bit 129 is also one.
Looks like we confuse the vector enhancements facility 1 ("EXT") with the
Vector packed decimal facility ("BCD"). Let's fix the facility checks.
Detected while working on QEMU/tcg z14 support and only unlocking
the vector enhancements facility 1, but not the vector packed decimal
facility.
Fixes: 2583b848ca ("s390: report new vector facilities")
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20210503121244.25232-1-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEgycj0O+d1G2aycA8rZhLv9lQBTwFAmCInP4ACgkQrZhLv9lQ
BTza0g//dTeb9woC9H7qlEhK4l9yk62lTss60Q8X7m7ZSNfdL4tiEbi64SgK+iOW
OOegbrOEb8Kzh4KJJYmVlVZ5YUWyH4szgmee1wnylBdsWiWaPLPF3Cflz77apy6T
TiiBsJd7rRE29FKheaMt34B41BMh8QHESN+DzjzJWsFoi/uNxjgSs2W16XuSupKu
bpRmB1pYNXMlrkzz7taL05jndZYE5arVriqlxgAsuLOFOp/ER7zecrjImdCM/4kL
W6ej0R1fz2Geh6CsLBJVE+bKWSQ82q5a4xZEkSYuQHXgZV5eywE5UKu8ssQcRgQA
VmGUY5k73rfY9Ofupf2gCaf/JSJNXKO/8Xjg0zAdklKtmgFjtna5Tyg9I90j7zn+
5swSpKuRpilN8MQH+6GWAnfqQlNoviTOpFeq3LwBtNVVOh08cOg6lko/bmebBC+R
TeQPACKS0Q0gCDPm9RYoU1pMUuYgfOwVfVRZK1prgi2Co7ZBUMOvYbNoKYoPIydr
ENBYljlU1OYwbzgR2nE+24fvhU8xdNOVG1xXYPAEHShu+p7dLIWRLhl8UCtRQpSR
1ofeVaJjgjrp29O+1OIQjB2kwCaRdfv/Gq1mztE/VlMU/r++E62OEzcH0aS+mnrg
yzfyUdI8IFv1q6FGT9yNSifWUWxQPmOKuC8kXsKYfqfJsFwKmHM=
=uCN4
-----END PGP SIGNATURE-----
Merge tag 'landlock_v34' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull Landlock LSM from James Morris:
"Add Landlock, a new LSM from Mickaël Salaün.
Briefly, Landlock provides for unprivileged application sandboxing.
From Mickaël's cover letter:
"The goal of Landlock is to enable to restrict ambient rights (e.g.
global filesystem access) for a set of processes. Because Landlock
is a stackable LSM [1], it makes possible to create safe security
sandboxes as new security layers in addition to the existing
system-wide access-controls. This kind of sandbox is expected to
help mitigate the security impact of bugs or unexpected/malicious
behaviors in user-space applications. Landlock empowers any
process, including unprivileged ones, to securely restrict
themselves.
Landlock is inspired by seccomp-bpf but instead of filtering
syscalls and their raw arguments, a Landlock rule can restrict the
use of kernel objects like file hierarchies, according to the
kernel semantic. Landlock also takes inspiration from other OS
sandbox mechanisms: XNU Sandbox, FreeBSD Capsicum or OpenBSD
Pledge/Unveil.
In this current form, Landlock misses some access-control features.
This enables to minimize this patch series and ease review. This
series still addresses multiple use cases, especially with the
combined use of seccomp-bpf: applications with built-in sandboxing,
init systems, security sandbox tools and security-oriented APIs [2]"
The cover letter and v34 posting is here:
https://lore.kernel.org/linux-security-module/20210422154123.13086-1-mic@digikod.net/
See also:
https://landlock.io/
This code has had extensive design discussion and review over several
years"
Link: https://lore.kernel.org/lkml/50db058a-7dde-441b-a7f9-f6837fe8b69f@schaufler-ca.com/ [1]
Link: https://lore.kernel.org/lkml/f646e1c7-33cf-333f-070c-0a40ad0468cd@digikod.net/ [2]
* tag 'landlock_v34' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
landlock: Enable user space to infer supported features
landlock: Add user and kernel documentation
samples/landlock: Add a sandbox manager example
selftests/landlock: Add user space tests
landlock: Add syscall implementations
arch: Wire up Landlock syscalls
fs,security: Add sb_delete hook
landlock: Support filesystem access-control
LSM: Infrastructure management of the superblock
landlock: Add ptrace restrictions
landlock: Set up the security framework and manage credentials
landlock: Add ruleset and domain management
landlock: Add object management
- Stage-2 isolation for the host kernel when running in protected mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation,
zap under read lock, enable/disably dirty page logging under
read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing
the architecture-specific code
- Some selftests improvements
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCJ13kUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroM1HAgAqzPxEtiTPTFeFJV5cnPPJ3dFoFDK
y/juZJUQ1AOtvuWzzwuf175ewkv9vfmtG6rVohpNSkUlJYeoc6tw7n8BTTzCVC1b
c/4Dnrjeycr6cskYlzaPyV6MSgjSv5gfyj1LA5UEM16LDyekmaynosVWY5wJhju+
Bnyid8l8Utgz+TLLYogfQJQECCrsU0Wm//n+8TWQgLf1uuiwshU5JJe7b43diJrY
+2DX+8p9yWXCTz62sCeDWNahUv8AbXpMeJ8uqZPYcN1P0gSEUGu8xKmLOFf9kR7b
M4U1Gyz8QQbjd2lqnwiWIkvRLX6gyGVbq2zH0QbhUe5gg3qGUX7JjrhdDQ==
=AXUi
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"This is a large update by KVM standards, including AMD PSP (Platform
Security Processor, aka "AMD Secure Technology") and ARM CoreSight
(debug and trace) changes.
ARM:
- CoreSight: Add support for ETE and TRBE
- Stage-2 isolation for the host kernel when running in protected
mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- AMD PSP driver changes
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation, zap under
read lock, enable/disably dirty page logging under read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing the
architecture-specific code
- a handful of "Get rid of oprofile leftovers" patches
- Some selftests improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits)
KVM: selftests: Speed up set_memory_region_test
selftests: kvm: Fix the check of return value
KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
KVM: SVM: Skip SEV cache flush if no ASIDs have been used
KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids()
KVM: SVM: Drop redundant svm_sev_enabled() helper
KVM: SVM: Move SEV VMCB tracking allocation to sev.c
KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup()
KVM: SVM: Unconditionally invoke sev_hardware_teardown()
KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported)
KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y
KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables
KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features
KVM: SVM: Move SEV module params/variables to sev.c
KVM: SVM: Disable SEV/SEV-ES if NPT is disabled
KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails
KVM: SVM: Zero out the VMCB array used to track SEV ASID association
x86/sev: Drop redundant and potentially misleading 'sev_enabled'
KVM: x86: Move reverse CPUID helpers to separate header file
KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
...
Merge misc updates from Andrew Morton:
"A few misc subsystems and some of MM.
175 patches.
Subsystems affected by this patch series: ia64, kbuild, scripts, sh,
ocfs2, kfifo, vfs, kernel/watchdog, and mm (slab-generic, slub,
kmemleak, debug, pagecache, msync, gup, memremap, memcg, pagemap,
mremap, dma, sparsemem, vmalloc, documentation, kasan, initialization,
pagealloc, and memory-failure)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (175 commits)
mm/memory-failure: unnecessary amount of unmapping
mm/mmzone.h: fix existing kernel-doc comments and link them to core-api
mm: page_alloc: ignore init_on_free=1 for debug_pagealloc=1
net: page_pool: use alloc_pages_bulk in refill code path
net: page_pool: refactor dma_map into own function page_pool_dma_map
SUNRPC: refresh rq_pages using a bulk page allocator
SUNRPC: set rq_page_end differently
mm/page_alloc: inline __rmqueue_pcplist
mm/page_alloc: optimize code layout for __alloc_pages_bulk
mm/page_alloc: add an array-based interface to the bulk page allocator
mm/page_alloc: add a bulk page allocator
mm/page_alloc: rename alloced to allocated
mm/page_alloc: duplicate include linux/vmalloc.h
mm, page_alloc: avoid page_to_pfn() in move_freepages()
mm/Kconfig: remove default DISCONTIGMEM_MANUAL
mm: page_alloc: dump migrate-failed pages
mm/mempolicy: fix mpol_misplaced kernel-doc
mm/mempolicy: rewrite alloc_pages_vma documentation
mm/mempolicy: rewrite alloc_pages documentation
mm/mempolicy: rename alloc_pages_current to alloc_pages
...
- Enable KFENCE for 32-bit.
- Implement EBPF for 32-bit.
- Convert 32-bit to do interrupt entry/exit in C.
- Convert 64-bit BookE to do interrupt entry/exit in C.
- Changes to our signal handling code to use user_access_begin/end() more extensively.
- Add support for time namespaces (CONFIG_TIME_NS)
- A series of fixes that allow us to reenable STRICT_KERNEL_RWX.
- Other smaller features, fixes & cleanups.
Thanks to: Alexey Kardashevskiy, Andreas Schwab, Andrew Donnellan, Aneesh Kumar K.V,
Athira Rajeev, Bhaskar Chowdhury, Bixuan Cui, Cédric Le Goater, Chen Huang, Chris
Packham, Christophe Leroy, Christopher M. Riedl, Colin Ian King, Dan Carpenter, Daniel
Axtens, Daniel Henrique Barboza, David Gibson, Davidlohr Bueso, Denis Efremov,
dingsenjie, Dmitry Safonov, Dominic DeMarco, Fabiano Rosas, Ganesh Goudar, Geert
Uytterhoeven, Geetika Moolchandani, Greg Kurz, Guenter Roeck, Haren Myneni, He Ying,
Jiapeng Chong, Jordan Niethe, Laurent Dufour, Lee Jones, Leonardo Bras, Li Huafei,
Madhavan Srinivasan, Mahesh Salgaonkar, Masahiro Yamada, Nathan Chancellor, Nathan
Lynch, Nicholas Piggin, Oliver O'Halloran, Paul Menzel, Pu Lehui, Randy Dunlap, Ravi
Bangoria, Rosen Penev, Russell Currey, Santosh Sivaraj, Sebastian Andrzej Siewior,
Segher Boessenkool, Shivaprasad G Bhat, Srikar Dronamraju, Stephen Rothwell, Thadeu Lima
de Souza Cascardo, Thomas Gleixner, Tony Ambardar, Tyrel Datwyler, Vaibhav Jain,
Vincenzo Frascino, Xiongwei Song, Yang Li, Yu Kuai, Zhang Yunkai.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmCLV1kTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgLUyD/4jrTolG4sVec211hYO+0VuJzoqN4Cf
j2CA2Ju39butnSMiq4LJUPRB7QRZY1OofkoNFpZeDQspjfZXPz2ulpYAz+SxHWE2
ReHPmWH1rOABlUPXFboePF4OLwmAs9eR5mN2z9HpKXbT3k78HaToLqiONyB4fVCr
Q5TkJeRn/Y7ZJLdyPLTpczHHleQ8KoM6kT7ncXnTm6p97JOBJSrGaJ5N/8X5a4+e
6jtgB7Pvw8jNDShSr8BDLBgBZZcmoTiuG8KfgwRZ+m+mKB1yI2X8S/a54w/lDi9g
UcSv3jQcFLJuW+T/pYe4R330uWDYa0cwjJOtMmsJ98S4EYOevoe9fZuL97qNshme
xtBr4q1i03G1icYOJJ8dXtvabG2rUzj8t1SCDpwYfrynzTWVRikiQYTXUBhRSFoK
nsoklvKd2IZa485XYJ2ljSyClMy8S4yJJ9RuzZ94DTXDSJUesKuyRWGnso4mhkcl
wvl4wwMTJvnCMKVo6dsJyV24QWfd6dABxzm04uPA94CKhG33UwK8252jXVeaohSb
WSO7qWBONgDXQLJ0mXRcEYa9NHvFS4Jnp6APbxnHr1gS+K+PNkD4gPBf34FoyN0E
9s27kvEYk5vr8APUclETF6+FkbGUD5bFbusjt3hYloFpAoHQ/k5pFVDsOZNPA8sW
fDIRp05KunDojw==
=dfKL
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Enable KFENCE for 32-bit.
- Implement EBPF for 32-bit.
- Convert 32-bit to do interrupt entry/exit in C.
- Convert 64-bit BookE to do interrupt entry/exit in C.
- Changes to our signal handling code to use user_access_begin/end()
more extensively.
- Add support for time namespaces (CONFIG_TIME_NS)
- A series of fixes that allow us to reenable STRICT_KERNEL_RWX.
- Other smaller features, fixes & cleanups.
Thanks to Alexey Kardashevskiy, Andreas Schwab, Andrew Donnellan, Aneesh
Kumar K.V, Athira Rajeev, Bhaskar Chowdhury, Bixuan Cui, Cédric Le
Goater, Chen Huang, Chris Packham, Christophe Leroy, Christopher M.
Riedl, Colin Ian King, Dan Carpenter, Daniel Axtens, Daniel Henrique
Barboza, David Gibson, Davidlohr Bueso, Denis Efremov, dingsenjie,
Dmitry Safonov, Dominic DeMarco, Fabiano Rosas, Ganesh Goudar, Geert
Uytterhoeven, Geetika Moolchandani, Greg Kurz, Guenter Roeck, Haren
Myneni, He Ying, Jiapeng Chong, Jordan Niethe, Laurent Dufour, Lee
Jones, Leonardo Bras, Li Huafei, Madhavan Srinivasan, Mahesh Salgaonkar,
Masahiro Yamada, Nathan Chancellor, Nathan Lynch, Nicholas Piggin,
Oliver O'Halloran, Paul Menzel, Pu Lehui, Randy Dunlap, Ravi Bangoria,
Rosen Penev, Russell Currey, Santosh Sivaraj, Sebastian Andrzej Siewior,
Segher Boessenkool, Shivaprasad G Bhat, Srikar Dronamraju, Stephen
Rothwell, Thadeu Lima de Souza Cascardo, Thomas Gleixner, Tony Ambardar,
Tyrel Datwyler, Vaibhav Jain, Vincenzo Frascino, Xiongwei Song, Yang Li,
Yu Kuai, and Zhang Yunkai.
* tag 'powerpc-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (302 commits)
powerpc/signal32: Fix erroneous SIGSEGV on RT signal return
powerpc: Avoid clang uninitialized warning in __get_user_size_allowed
powerpc/papr_scm: Mark nvdimm as unarmed if needed during probe
powerpc/kvm: Fix build error when PPC_MEM_KEYS/PPC_PSERIES=n
powerpc/kasan: Fix shadow start address with modules
powerpc/kernel/iommu: Use largepool as a last resort when !largealloc
powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE() to save TCEs
powerpc/44x: fix spelling mistake in Kconfig "varients" -> "variants"
powerpc/iommu: Annotate nested lock for lockdep
powerpc/iommu: Do not immediately panic when failed IOMMU table allocation
powerpc/iommu: Allocate it_map by vmalloc
selftests/powerpc: remove unneeded semicolon
powerpc/64s: remove unneeded semicolon
powerpc/eeh: remove unneeded semicolon
powerpc/selftests: Add selftest to test concurrent perf/ptrace events
powerpc/selftests/perf-hwbreak: Add testcases for 2nd DAWR
powerpc/selftests/perf-hwbreak: Coalesce event creation code
powerpc/selftests/ptrace-hwbreak: Add testcases for 2nd DAWR
powerpc/configs: Add IBMVNIC to some 64-bit configs
selftests/powerpc: Add uaccess flush test
...
mem_init_print_info() is called in mem_init() on each architecture, and
pass NULL argument, so using void argument and move it into mm_init().
Link: https://lkml.kernel.org/r/20210317015210.33641-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86]
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> [powerpc]
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Anatoly Pugachev <matorola@gmail.com> [sparc64]
Acked-by: Russell King <rmk+kernel@armlinux.org.uk> [arm]
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Guo Ren <guoren@kernel.org>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds support for adding a random offset to the stack while handling
syscalls. The patch uses get_tod_clock_fast() as this is considered good
enough and has much less performance penalty compared to using
get_random_int(). The patch also adds randomization in pgm_check_handler()
as the sigreturn/rt_sigreturn system calls might be called from there.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Link: https://lore.kernel.org/r/20210429091451.1062594-1-svens@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
In former times, the virtio-console code had to be compiled into
the kernel since the old guest virtio transport had some hard de-
pendencies. But since the old virtio transport has been removed in
commit 7fb2b2d512 ("s390/virtio: remove the old KVM virtio transport"),
we do not have this limitation anymore.
Commit bb533ec8ba ("s390/config: do not select VIRTIO_CONSOLE via
Kconfig") then also lifted the hard setting in the Kconfig system, so
we can finally switch the CONFIG_VIRTIO_CONSOLE knob to compile this
driver as a module now, making it more flexible for the user to only
load it if it is really required.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Link: https://lore.kernel.org/r/20210428082442.321327-1-thuth@redhat.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The function cpumf_pmu_add and cpumf_pmu_del call function
perf_event_update_userpage(). This calls is obsolete, the calls add and
delete a counter event. Counter events do not sample data and the
event->rb member to access the sampling ring buffer is always NULL.
The function perf_event_update_userpage() simply returns in this case.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by : Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The function to calculate the size of counter sets is renamed from
cf_diag_ctrset_size() to cpum_cf_ctrset_size() and moved to the file
containing common functions for the CPU Measurement Counter Facility.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by : Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Beautify if-then-else indentation to match coding guideline.
Also use shorter pointer notation hwc instead of event->hw.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by : Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
All major distributions ship with CONFIG_PCI_IOV=y so let us enable it
for our defconfigs as well.
Note also that since commit e5794cf1a2 ("s390/pci: create links
between PFs and VFs") we enabled proper linking between PFs and their
associated VFs so with this commit and its fixes applied we can fully
support handling SR-IOV enabled PFs.
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The PCIs event with PEC 0x0303 or 0x0304 are a request to deconfigure
a PCI function, respectively an indication that it was already
deconfigured by the platform. If such an event is queued during boot it
may happen that the platform has already adjusted the configuration flag
of the relevant function in the CLP List PCI Functions result. In this
case we might not have configured the PCI function at all and should
thus ignore the event. Note that no locking is necessary as event
handling only starts after we have fully initialized the zPCI subsystem
and scanned all PCI devices listed in the CLP result.
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
With zpci_configure_device() now always called on a device that has
already been configured on the platform level its name has become
misleading. Rename it to zpci_scan_configured_device() to signify that
the function now only handles the correct scanning of a newly configured
PCI function taking care of the special handling necessary for function
0 and functions parked waiting for a PCI bus that can't be created
without first seeing function 0.
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Core:
- bpf:
- allow bpf programs calling kernel functions (initially to
reuse TCP congestion control implementations)
- enable task local storage for tracing programs - remove the
need to store per-task state in hash maps, and allow tracing
programs access to task local storage previously added for
BPF_LSM
- add bpf_for_each_map_elem() helper, allowing programs to
walk all map elements in a more robust and easier to verify
fashion
- sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
redirection
- lpm: add support for batched ops in LPM trie
- add BTF_KIND_FLOAT support - mostly to allow use of BTF
on s390 which has floats in its headers files
- improve BPF syscall documentation and extend the use of kdoc
parsing scripts we already employ for bpf-helpers
- libbpf, bpftool: support static linking of BPF ELF files
- improve support for encapsulation of L2 packets
- xdp: restructure redirect actions to avoid a runtime lookup,
improving performance by 4-8% in microbenchmarks
- xsk: build skb by page (aka generic zerocopy xmit) - improve
performance of software AF_XDP path by 33% for devices
which don't need headers in the linear skb part (e.g. virtio)
- nexthop: resilient next-hop groups - improve path stability
on next-hops group changes (incl. offload for mlxsw)
- ipv6: segment routing: add support for IPv4 decapsulation
- icmp: add support for RFC 8335 extended PROBE messages
- inet: use bigger hash table for IP ID generation
- tcp: deal better with delayed TX completions - make sure we don't
give up on fast TCP retransmissions only because driver is
slow in reporting that it completed transmitting the original
- tcp: reorder tcp_congestion_ops for better cache locality
- mptcp:
- add sockopt support for common TCP options
- add support for common TCP msg flags
- include multiple address ids in RM_ADDR
- add reset option support for resetting one subflow
- udp: GRO L4 improvements - improve 'forward' / 'frag_list'
co-existence with UDP tunnel GRO, allowing the first to take
place correctly even for encapsulated UDP traffic
- micro-optimize dev_gro_receive() and flow dissection, avoid
retpoline overhead on VLAN and TEB GRO
- use less memory for sysctls, add a new sysctl type, to allow using
u8 instead of "int" and "long" and shrink networking sysctls
- veth: allow GRO without XDP - this allows aggregating UDP
packets before handing them off to routing, bridge, OvS, etc.
- allow specifing ifindex when device is moved to another namespace
- netfilter:
- nft_socket: add support for cgroupsv2
- nftables: add catch-all set element - special element used
to define a default action in case normal lookup missed
- use net_generic infra in many modules to avoid allocating
per-ns memory unnecessarily
- xps: improve the xps handling to avoid potential out-of-bound
accesses and use-after-free when XPS change race with other
re-configuration under traffic
- add a config knob to turn off per-cpu netdev refcnt to catch
underflows in testing
Device APIs:
- add WWAN subsystem to organize the WWAN interfaces better and
hopefully start driving towards more unified and vendor-
-independent APIs
- ethtool:
- add interface for reading IEEE MIB stats (incl. mlx5 and
bnxt support)
- allow network drivers to dump arbitrary SFP EEPROM data,
current offset+length API was a poor fit for modern SFP
which define EEPROM in terms of pages (incl. mlx5 support)
- act_police, flow_offload: add support for packet-per-second
policing (incl. offload for nfp)
- psample: add additional metadata attributes like transit delay
for packets sampled from switch HW (and corresponding egress
and policy-based sampling in the mlxsw driver)
- dsa: improve support for sandwiched LAGs with bridge and DSA
- netfilter:
- flowtable: use direct xmit in topologies with IP
forwarding, bridging, vlans etc.
- nftables: counter hardware offload support
- Bluetooth:
- improvements for firmware download w/ Intel devices
- add support for reading AOSP vendor capabilities
- add support for virtio transport driver
- mac80211:
- allow concurrent monitor iface and ethernet rx decap
- set priority and queue mapping for injected frames
- phy: add support for Clause-45 PHY Loopback
- pci/iov: add sysfs MSI-X vector assignment interface
to distribute MSI-X resources to VFs (incl. mlx5 support)
New hardware/drivers:
- dsa: mv88e6xxx: add support for Marvell mv88e6393x -
11-port Ethernet switch with 8x 1-Gigabit Ethernet
and 3x 10-Gigabit interfaces.
- dsa: support for legacy Broadcom tags used on BCM5325, BCM5365
and BCM63xx switches
- Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
- ath11k: support for QCN9074 a 802.11ax device
- Bluetooth: Broadcom BCM4330 and BMC4334
- phy: Marvell 88X2222 transceiver support
- mdio: add BCM6368 MDIO mux bus controller
- r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
- mana: driver for Microsoft Azure Network Adapter (MANA)
- Actions Semi Owl Ethernet MAC
- can: driver for ETAS ES58X CAN/USB interfaces
Pure driver changes:
- add XDP support to: enetc, igc, stmmac
- add AF_XDP support to: stmmac
- virtio:
- page_to_skb() use build_skb when there's sufficient tailroom
(21% improvement for 1000B UDP frames)
- support XDP even without dedicated Tx queues - share the Tx
queues with the stack when necessary
- mlx5:
- flow rules: add support for mirroring with conntrack,
matching on ICMP, GTP, flex filters and more
- support packet sampling with flow offloads
- persist uplink representor netdev across eswitch mode
changes
- allow coexistence of CQE compression and HW time-stamping
- add ethtool extended link error state reporting
- ice, iavf: support flow filters, UDP Segmentation Offload
- dpaa2-switch:
- move the driver out of staging
- add spanning tree (STP) support
- add rx copybreak support
- add tc flower hardware offload on ingress traffic
- ionic:
- implement Rx page reuse
- support HW PTP time-stamping
- octeon: support TC hardware offloads - flower matching on ingress
and egress ratelimitting.
- stmmac:
- add RX frame steering based on VLAN priority in tc flower
- support frame preemption (FPE)
- intel: add cross time-stamping freq difference adjustment
- ocelot:
- support forwarding of MRP frames in HW
- support multiple bridges
- support PTP Sync one-step timestamping
- dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
learning, flooding etc.
- ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
SC7280 SoCs)
- mt7601u: enable TDLS support
- mt76:
- add support for 802.3 rx frames (mt7915/mt7615)
- mt7915 flash pre-calibration support
- mt7921/mt7663 runtime power management fixes
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCKFPIACgkQMUZtbf5S
Irtw0g/+NA8bWdHNgG4H5rya0pv2z3IieLRmSdDfKRQQXcJpklawc5MKVVaTee/Q
5/QqgPdCsu1LAU6JXBKsKmyDDaMlQKdWuKbOqDSiAQKoMesZStTEHf9d851ZzgxA
Cdb6O7BD3lBl/IN+oxNG+KcmD1LKquTPKGySq2mQtEdLO12ekAsranzmj4voKffd
q9tBShpXQ7Dq77DLYfiQXVCvsizNcbbJFuxX0o9Lpb9+61ZyYAbogZSa9ypiZZwR
I/9azRBtJg7UV1aD/cLuAfy66Qh7t63+rCxVazs5Os8jVO26P/jQdisnnOe/x+p9
wYEmKm3GSu0V4SAPxkWW+ooKusflCeqDoMIuooKt6kbP6BRj540veGw3Ww/m5YFr
7pLQkTSP/tSjuGQIdBE1LOP5LBO8DZeC8Kiop9V0fzAW9hFSZbEq25WW0bPj8QQO
zA4Z7yWlslvxcfY2BdJX3wD8klaINkl/8fDWZFFsBdfFX2VeLtm7Xfduw34BJpvU
rYT3oWr6PhtkPAKR32SUcemSfeWgIVU41eSshzRz3kez1NngBUuLlSGGSEaKbes5
pZVt6pYFFVByyf6MTHFEoQvafZfEw04JILZpo4R5V8iTHzom0kD3Py064sBiXEw2
B6t+OW4qgcxGblpFkK2lD4kR2s1TPUs0ckVO6sAy1x8q60KKKjY=
=vcbA
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- bpf:
- allow bpf programs calling kernel functions (initially to
reuse TCP congestion control implementations)
- enable task local storage for tracing programs - remove the
need to store per-task state in hash maps, and allow tracing
programs access to task local storage previously added for
BPF_LSM
- add bpf_for_each_map_elem() helper, allowing programs to walk
all map elements in a more robust and easier to verify fashion
- sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
redirection
- lpm: add support for batched ops in LPM trie
- add BTF_KIND_FLOAT support - mostly to allow use of BTF on
s390 which has floats in its headers files
- improve BPF syscall documentation and extend the use of kdoc
parsing scripts we already employ for bpf-helpers
- libbpf, bpftool: support static linking of BPF ELF files
- improve support for encapsulation of L2 packets
- xdp: restructure redirect actions to avoid a runtime lookup,
improving performance by 4-8% in microbenchmarks
- xsk: build skb by page (aka generic zerocopy xmit) - improve
performance of software AF_XDP path by 33% for devices which don't
need headers in the linear skb part (e.g. virtio)
- nexthop: resilient next-hop groups - improve path stability on
next-hops group changes (incl. offload for mlxsw)
- ipv6: segment routing: add support for IPv4 decapsulation
- icmp: add support for RFC 8335 extended PROBE messages
- inet: use bigger hash table for IP ID generation
- tcp: deal better with delayed TX completions - make sure we don't
give up on fast TCP retransmissions only because driver is slow in
reporting that it completed transmitting the original
- tcp: reorder tcp_congestion_ops for better cache locality
- mptcp:
- add sockopt support for common TCP options
- add support for common TCP msg flags
- include multiple address ids in RM_ADDR
- add reset option support for resetting one subflow
- udp: GRO L4 improvements - improve 'forward' / 'frag_list'
co-existence with UDP tunnel GRO, allowing the first to take place
correctly even for encapsulated UDP traffic
- micro-optimize dev_gro_receive() and flow dissection, avoid
retpoline overhead on VLAN and TEB GRO
- use less memory for sysctls, add a new sysctl type, to allow using
u8 instead of "int" and "long" and shrink networking sysctls
- veth: allow GRO without XDP - this allows aggregating UDP packets
before handing them off to routing, bridge, OvS, etc.
- allow specifing ifindex when device is moved to another namespace
- netfilter:
- nft_socket: add support for cgroupsv2
- nftables: add catch-all set element - special element used to
define a default action in case normal lookup missed
- use net_generic infra in many modules to avoid allocating
per-ns memory unnecessarily
- xps: improve the xps handling to avoid potential out-of-bound
accesses and use-after-free when XPS change race with other
re-configuration under traffic
- add a config knob to turn off per-cpu netdev refcnt to catch
underflows in testing
Device APIs:
- add WWAN subsystem to organize the WWAN interfaces better and
hopefully start driving towards more unified and vendor-
independent APIs
- ethtool:
- add interface for reading IEEE MIB stats (incl. mlx5 and bnxt
support)
- allow network drivers to dump arbitrary SFP EEPROM data,
current offset+length API was a poor fit for modern SFP which
define EEPROM in terms of pages (incl. mlx5 support)
- act_police, flow_offload: add support for packet-per-second
policing (incl. offload for nfp)
- psample: add additional metadata attributes like transit delay for
packets sampled from switch HW (and corresponding egress and
policy-based sampling in the mlxsw driver)
- dsa: improve support for sandwiched LAGs with bridge and DSA
- netfilter:
- flowtable: use direct xmit in topologies with IP forwarding,
bridging, vlans etc.
- nftables: counter hardware offload support
- Bluetooth:
- improvements for firmware download w/ Intel devices
- add support for reading AOSP vendor capabilities
- add support for virtio transport driver
- mac80211:
- allow concurrent monitor iface and ethernet rx decap
- set priority and queue mapping for injected frames
- phy: add support for Clause-45 PHY Loopback
- pci/iov: add sysfs MSI-X vector assignment interface to distribute
MSI-X resources to VFs (incl. mlx5 support)
New hardware/drivers:
- dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port
Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit
interfaces.
- dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and
BCM63xx switches
- Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
- ath11k: support for QCN9074 a 802.11ax device
- Bluetooth: Broadcom BCM4330 and BMC4334
- phy: Marvell 88X2222 transceiver support
- mdio: add BCM6368 MDIO mux bus controller
- r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
- mana: driver for Microsoft Azure Network Adapter (MANA)
- Actions Semi Owl Ethernet MAC
- can: driver for ETAS ES58X CAN/USB interfaces
Pure driver changes:
- add XDP support to: enetc, igc, stmmac
- add AF_XDP support to: stmmac
- virtio:
- page_to_skb() use build_skb when there's sufficient tailroom
(21% improvement for 1000B UDP frames)
- support XDP even without dedicated Tx queues - share the Tx
queues with the stack when necessary
- mlx5:
- flow rules: add support for mirroring with conntrack, matching
on ICMP, GTP, flex filters and more
- support packet sampling with flow offloads
- persist uplink representor netdev across eswitch mode changes
- allow coexistence of CQE compression and HW time-stamping
- add ethtool extended link error state reporting
- ice, iavf: support flow filters, UDP Segmentation Offload
- dpaa2-switch:
- move the driver out of staging
- add spanning tree (STP) support
- add rx copybreak support
- add tc flower hardware offload on ingress traffic
- ionic:
- implement Rx page reuse
- support HW PTP time-stamping
- octeon: support TC hardware offloads - flower matching on ingress
and egress ratelimitting.
- stmmac:
- add RX frame steering based on VLAN priority in tc flower
- support frame preemption (FPE)
- intel: add cross time-stamping freq difference adjustment
- ocelot:
- support forwarding of MRP frames in HW
- support multiple bridges
- support PTP Sync one-step timestamping
- dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
learning, flooding etc.
- ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
SC7280 SoCs)
- mt7601u: enable TDLS support
- mt76:
- add support for 802.3 rx frames (mt7915/mt7615)
- mt7915 flash pre-calibration support
- mt7921/mt7663 runtime power management fixes"
* tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits)
net: selftest: fix build issue if INET is disabled
net: netrom: nr_in: Remove redundant assignment to ns
net: tun: Remove redundant assignment to ret
net: phy: marvell: add downshift support for M88E1240
net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255
net/sched: act_ct: Remove redundant ct get and check
icmp: standardize naming of RFC 8335 PROBE constants
bpf, selftests: Update array map tests for per-cpu batched ops
bpf: Add batched ops support for percpu array
bpf: Implement formatted output helpers with bstr_printf
seq_file: Add a seq_bprintf function
sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues
net:nfc:digital: Fix a double free in digital_tg_recv_dep_req
net: fix a concurrency bug in l2tp_tunnel_register()
net/smc: Remove redundant assignment to rc
mpls: Remove redundant assignment to err
llc2: Remove redundant assignment to rc
net/tls: Remove redundant initialization of record
rds: Remove redundant assignment to nr_sig
dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmCJU1UACgkQnJ2qBz9k
QNk62AgAgp05OIXU/AgObb7DvSyI3ycwCV8PeWBpwD8yoDAh5x0tmT7vnJu974p6
yHdnF7rr69ZzvbNCHLJ5kRykRlUao9W7cO5fdOW1uTpL7Ic60QuJMks/NfgVTHp1
2zIQmBDerfn1/LTK8r2pPGcvtcjRcr7Ep4beN0Duw57lfVMJhjsNRPnBbXGBcp0r
QzKk4/8V3DCZvOw+XNC3nto7avjvf+nU9sJmuh83546eqh0atjWivvO5aAlDOe6W
rhBiLlmP0in5u2n1fYqzI1OQvtgtleyEZT2G0CrbAZn0xjmV/if9wl+3K6TOwDvR
778xDEX7sZCaO/xkB+WK3hrd15ftKg==
=0kYE
-----END PGP SIGNATURE-----
Merge tag 'for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota, ext2, reiserfs updates from Jan Kara:
- support for path (instead of device) based quotactl syscall
(quotactl_path(2))
- ext2 conversion to kmap_local()
- other minor cleanups & fixes
* tag 'for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fs/reiserfs/journal.c: delete useless variables
fs/ext2: Replace kmap() with kmap_local_page()
ext2: Match up ext2_put_page() with ext2_dotdot() and ext2_find_entry()
fs/ext2/: fix misspellings using codespell tool
quota: report warning limits for realtime space quotas
quota: wire up quotactl_path
quota: Add mountpath based quota support
- fix buffer size for in-kernel disassembler for ebpf programs.
- fix two memory leaks in zcrypt driver.
- expose PCI device UID as index, including an indicator if the uid is
unique.
- remove some oprofile leftovers.
- improve stack unwinder tests.
- don't use gcc atomic builtins anymore, just like all other
architectures. Even though I'm sure the current code is ok, I
totally dislike that s390 is the only architecture being special
here; especially considering that there was a lengthly discussion
about this topic and the outcome was not to use the builtins.
Therefore open-code atomic ops again with inline assembly and switch
to gcc builtins as soon as other architectures are doing.
- couple of other changes to atomic and cmpxchg, and use
atomic-instrumented.h for KASAN.
- separate zbus creation, registration, and scanning in our PCI code
which allows for cleaner and easier handling.
- a rather large change to the vfio-ap code to fix circular locking
dependencies when updating crypto masks.
- move QAOB handling from qdio layer down to drivers.
- add CRW inject facility to common I/O layer. This adds debugs files
which allow to generate artificial events from user space for
testing purposes.
- increase SCLP console line length from 80 to 320 characters to avoid
odd wrapped lines.
- add protected virtualization guest and host indication files, which
indicate either that a guest is running in pv mode or if the
hypervisor is capable of starting pv guests.
- various other small fixes and improvements all over the place.
-----BEGIN PGP SIGNATURE-----
iQIyBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmCICNwACgkQIg7DeRsp
bsJgSQ/0Cn3KTymF8SJ7tXLpYBNHmZBL1sQ284pPNYmqluwephUaX643f/48RR5y
rlbaewyHCbYyR+gwkUykuUQm/d+iwihip/uFmyEktr9JutOOS1RQKd8ujeyE2BUb
aaDyE0J5VFYdd/ZA92n9FhkuNDOMZRFAK1SOfifT9jNWCl8iYz+pXV1Gx7LbJYVY
KWfJ9D/zgzLoOTWhj4jWu8LutfLEqK+hq5nqxBII8APCV/QDYnjkwpwW01LoMtOv
eHhtSz0JboFRk0FYf8oyR7AXQBz76+Ru3aivJNL7sr1S3N2yMSzNQbk/ATVBLER9
VMQX2TfGGT/Ln3P4rYEoP2vGITRn765wg4KWNB2u3pY2try12G39fmjzOwVfbxQw
BDAcLwxU7Tw0vJY+yI6ZWkPDXcs+uAWQwNiYoMtfUPfMYLEpLFffbWGwdZPKZRrH
fy4e5ZFuavZsZr8Zeu4WYILJZoDnvhbs59gPzaLBPKosR0ZGNi8q6bnztnqnrYhi
Oirt6aPOOyEoN/IT2bO1sDhIzIpCorIwCMZTRIQzerFRcjJgy0xHM9MRYRLMj6iW
xltgWNt01SbTm6pbimwMUnjN5AdMTbHlTSzD8G34eWO21cVgHGpndbcT/M3uemyy
Wf034Er2eFqlhXyhiAYTnNhVGbd+YMido7eo/CbGvqCxNvmKbA==
=e+mh
-----END PGP SIGNATURE-----
Merge tag 's390-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- fix buffer size for in-kernel disassembler for ebpf programs.
- fix two memory leaks in zcrypt driver.
- expose PCI device UID as index, including an indicator if the uid is
unique.
- remove some oprofile leftovers.
- improve stack unwinder tests.
- don't use gcc atomic builtins anymore, just like all other
architectures. Even though I'm sure the current code is ok, I totally
dislike that s390 is the only architecture being special here;
especially considering that there was a lengthly discussion about
this topic and the outcome was not to use the builtins. Therefore
open-code atomic ops again with inline assembly and switch to gcc
builtins as soon as other architectures are doing.
- couple of other changes to atomic and cmpxchg, and use
atomic-instrumented.h for KASAN.
- separate zbus creation, registration, and scanning in our PCI code
which allows for cleaner and easier handling.
- a rather large change to the vfio-ap code to fix circular locking
dependencies when updating crypto masks.
- move QAOB handling from qdio layer down to drivers.
- add CRW inject facility to common I/O layer. This adds debugs files
which allow to generate artificial events from user space for testing
purposes.
- increase SCLP console line length from 80 to 320 characters to avoid
odd wrapped lines.
- add protected virtualization guest and host indication files, which
indicate either that a guest is running in pv mode or if the
hypervisor is capable of starting pv guests.
- various other small fixes and improvements all over the place.
* tag 's390-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (53 commits)
s390/disassembler: increase ebpf disasm buffer size
s390/archrandom: add parameter check for s390_arch_random_generate
s390/zcrypt: fix zcard and zqueue hot-unplug memleak
s390/pci: expose a PCI device's UID as its index
s390/atomic,cmpxchg: always inline __xchg/__cmpxchg
s390/smp: fix do_restart() prototype
s390: get rid of oprofile leftovers
s390/atomic,cmpxchg: make constraints work with old compilers
s390/test_unwind: print test suite start/end info
s390/cmpxchg: use unsigned long values instead of void pointers
s390/test_unwind: add WARN if tests failed
s390/test_unwind: unify error handling paths
s390: update defconfigs
s390/spinlock: use R constraint in inline assembly
s390/atomic,cmpxchg: switch to use atomic-instrumented.h
s390/cmpxchg: get rid of gcc atomic builtins
s390/atomic: get rid of gcc atomic builtins
s390/atomic: use proper constraints
s390/atomic: move remaining inline assemblies to atomic_ops.h
s390/bitops: make bitops only work on longs
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCGmYIACgkQEsHwGGHe
VUr45w/8CSXr7MXaFBj4To0hTWJXSZyF6YGqlZOSJXFcFh4cWTNwfVOoFaV47aDo
+HsCNTkGENcKhLrDUWDRiG/Uo46jxtOtl1vhq7U4pGemSYH871XWOKfb5k5XNMwn
/uhaHMI4aEfd6bUFnF518NeyRIsD0BdqFj4tB7RbAiyFwdETDX9Tkj/uBKnQ4zon
4tEDoXgThuK5YKK9zVQg5pa7aFp2zg1CAdX/WzBkS8BHVBPXSV0CF97AJYQOM/V+
lUHv+BN3wp97GYHPQMPsbkNr8IuFoe2mIvikwjxg8iOFpzEU1G1u09XV9R+PXByX
LclFTRqK/2uU5hJlcsBiKfUuidyErYMRYImbMAOREt2w0ogWVu2zQ7HkjVve25h1
sQPwPudbAt6STbqRxvpmB3yoV4TCYwnF91FcWgEy+rcEK2BDsHCnScA45TsK5I1C
kGR1K17pHXprgMZFPveH+LgxewB6smDv+HllxQdSG67LhMJXcs2Epz0TsN8VsXw8
dlD3lGReK+5qy9FTgO7mY0xhiXGz1IbEdAPU4eRBgih13puu03+jqgMaMabvBWKD
wax+BWJUrPtetwD5fBPhlS/XdJDnd8Mkv2xsf//+wT0s4p+g++l1APYxeB8QEehm
Pd7Mvxm4GvQkfE13QEVIPYQRIXCMH/e9qixtY5SHUZDBVkUyFM0=
=bO1i
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 cleanups from Borislav Petkov:
"Trivial cleanups and fixes all over the place"
* tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Remove me from IDE/ATAPI section
x86/pat: Do not compile stubbed functions when X86_PAT is off
x86/asm: Ensure asm/proto.h can be included stand-alone
x86/platform/intel/quark: Fix incorrect kernel-doc comment syntax in files
x86/msr: Make locally used functions static
x86/cacheinfo: Remove unneeded dead-store initialization
x86/process/64: Move cpu_current_top_of_stack out of TSS
tools/turbostat: Unmark non-kernel-doc comment
x86/syscalls: Fix -Wmissing-prototypes warnings from COND_SYSCALL()
x86/fpu/math-emu: Fix function cast warning
x86/msr: Fix wr/rdmsr_safe_regs_on_cpu() prototypes
x86: Fix various typos in comments, take #2
x86: Remove unusual Unicode characters from comments
x86/kaslr: Return boolean values from a function returning bool
x86: Fix various typos in comments
x86/setup: Remove unused RESERVE_BRK_ARRAY()
stacktrace: Move documentation for arch_stack_walk_reliable() to header
x86: Remove duplicate TSC DEADLINE MSR definitions
New features:
- Stage-2 isolation for the host kernel when running in protected mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
- Alexandru is now a reviewer (not really a new feature...)
Fixes:
- Proper emulation of the GICR_TYPER register
- Handle the complete set of relocation in the nVHE EL2 object
- Get rid of the oprofile dependency in the PMU code (and of the
oprofile body parts at the same time)
- Debug and SPE fixes
- Fix vcpu reset
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmCCpuAPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpD2G8QALWQYeBggKnNmAJfuihzZ2WariBmgcENs2R2
qNZ/Py6dIF+b69P68nmgrEV1x2Kp35cPJbBwXnnrS4FCB5tk0b8YMaj00QbiRIYV
UXbPxQTmYO1KbevpoEcw8NmR4bZJ/hRYPuzcQG7CCMKIZw0zj2cMcBofzQpTOAp/
CgItdcv7at3iwamQatfU9vUmC0nDdnjdIwSxTAJOYMVV1ENwtnYSNgZVo4XLTg7n
xR/5Qx27PKBJw7GyTRAIIxKAzNXG2tDL+GVIHe4AnRp3z3La8sr6PJf7nz9MCmco
ISgeY7EGQINzmm4LahpnV+2xwwxOWo8QotxRFGNuRTOBazfARyAbp97yJ6eXJUpa
j0qlg3xK9neyIIn9BQKkKx4sY9V45yqkuVDsK6odmqPq3EE01IMTRh1N/XQi+sTF
iGrlM3ZW4AjlT5zgtT9US/FRXeDKoYuqVCObJeXZdm3sJSwEqTAs0JScnc0YTsh7
m30CODnomfR2y5X6GoaubbQ0wcZ2I20K1qtIm+2F6yzD5P1/3Yi8HbXMxsSWyYWZ
1ldoSa+ZUQlzV9Ot0S3iJ4PkphLKmmO96VlxE2+B5gQG50PZkLzsr8bVyYOuJC8p
T83xT9xd07cy+FcGgF9veZL99Y6BLHMa6ZwFUolYNbzJxqrmqyR1aiJMEBIcX+aP
ACeKW1w5
=fpey
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for Linux 5.13
New features:
- Stage-2 isolation for the host kernel when running in protected mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
- Alexandru is now a reviewer (not really a new feature...)
Fixes:
- Proper emulation of the GICR_TYPER register
- Handle the complete set of relocation in the nVHE EL2 object
- Get rid of the oprofile dependency in the PMU code (and of the
oprofile body parts at the same time)
- Debug and SPE fixes
- Fix vcpu reset
A review of the code showed, that this function which is exposed
within the whole kernel should do a parameter check for the
amount of bytes requested. If this requested bytes is too high
an unsigned int overflow could happen causing this function to
try to memcpy a really big memory chunk.
This is not a security issue as there are only two invocations
of this function from arch/s390/include/asm/archrandom.h and both
are not exposed to userland.
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
On s390 each PCI device has a user-defined ID (UID) exposed under
/sys/bus/pci/devices/<dev>/uid. This ID was designed to serve as the PCI
device's primary index and to match the device within Linux to the
device configured in the hypervisor. To serve as a primary identifier
the UID must be unique within the Linux instance, this is guaranteed by
the platform if and only if the UID Uniqueness Checking flag is set
within the CLP List PCI Functions response.
In this sense the UID serves an analogous function as the SMBIOS
instance number or ACPI index exposed as the "index" respectively
"acpi_index" device attributes and used by e.g. systemd to set interface
names. As s390 does not use and will likely never use ACPI nor SMBIOS
there is no conflict and we can just expose the UID under the "index"
attribute whenever UID Uniqueness Checking is active and get systemd's
interface naming support for free.
Link: https://lore.kernel.org/lkml/20210412135905.1434249-1-schnelle@linux.ibm.com/
Acked-by: Viktor Mihajlovski <mihajlov@linux.ibm.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Narendra K <narendra_k@dell.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Make sure to always inline __xchg() and __cmpxchg() otherwise the
compiler might decide to generate out-of-line versions which will
fail at link time:
s390-linux-ld: lib/atomic64_test.o: in function `__xchg':
>> atomic64_test.c:(.text.unlikely+0xa4): undefined reference to `__xchg_called_with_bad_pointer'
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/linux-mm/202104170449.SIIFKVjT-lkp@intel.com/
Fixes: d2b1f6d2d3 ("s390/cmpxchg: get rid of gcc atomic builtins")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Funciton do_restart() is a callback invoked from the
restart CPU routine and passed a single parameter.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
- keep the ZC code, drop the code related to reinit
net/bridge/netfilter/ebtables.c
- fix build after move to net_generic
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Define KVM_GUESTDBG_VALID_MASK and use it to implement this capabiity.
Compile tested only.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210401135451.1004564-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is a potential race for preemptible kernels, where
the host kernel would get a fault when it is preempted as
the wrong point in time.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJgeo0hAAoJEBF7vIC1phx8Um4QAID4KCVuRhAiRs3z2m4DYHsQ
cKTUGuBkty7gJfrO5byT9brvN9nf58Sxm22U/fUzgSD+W4wBQMVUl2nJFECg7ZoH
GITCOl9UCT35Sllp6v2ZJB/RtVGESklhmS8rJo7FAXjR2SlJJaW0nZvFuI//jcjX
5O+DSj2PoqJPSmwasZWCyCvHJouswcEFkF+1wI3oUww7XMBFF31MPI1g8jZ4DRtj
BI8uDx5W41qnpbccMQNHmi15J8ff+Of3qWe8y2+z+68puNHdNYV/fwybfa0OhelV
bgkdNA1HOeUVcKkf+JpDsl/1LmIfrWbwieDlGuUapjJU4ohMXwS8/m5lePq7Gmnn
Zf03aSk+GfD4T4l5HJcFEqy0HxHWrGYgGVMWKlvXm9qkdQ/1tl5DhWHgHKbg8L6f
btEpKrwAuzTE/5zDd163pB/E4oVXXqvSn8pfCEsx5T7azxDiGllxCAP+oU7tSwlS
wjgwJYwJvKTvsgVSR8FeCWUgcCDD3Y6yI5KZZcpzPuwcfNQsl50Z1GYFmS/WTl9J
cqmAFsanNR/PC1SmVnuJgucOPx3vyVqcHQ8AWK2TirHuRx5q53oBqFBioB3dY96G
8/SkXOskwvlsI2lzrNGaSm9Sd63Su82pU9NlU7crHzhQScoHNNIYI1dd3zW9k9Nr
Y8KTpV79FyZdyomnoRH+
=CsvI
-----END PGP SIGNATURE-----
Merge tag 'kvm-s390-next-5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: Fix potential crash in preemptible kernels
There is a potential race for preemptible kernels, where
the host kernel would get a fault when it is preempted as
the wrong point in time.
Old gcc versions may fail with an internal compiler error if only the
T or S constraint is specified for an operand, and no displacement is
needed at all.
To fix this use RT and QS as constraints, which reflects the union of
both. Later gcc versions do the right thing and always accept single T
and S constraints.
See gcc commit 3e4be43f69da ("S/390: Memory constraint cleanup").
Fixes: ca897bb181 ("s390/atomic: use proper constraints")
Fixes: b23eb636d7 ("s390/atomic: get rid of gcc atomic builtins")
Fixes: d2b1f6d2d3 ("s390/cmpxchg: get rid of gcc atomic builtins")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add couple of additional info lines to make it easier to match
test suite output and results.
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
gcc and clang warn about incompatible pointer types due to the recent
cmpxchg changes:
drivers/gpu/drm/drm_lock.c:75:10: error: passing 'typeof (lock)' (aka 'volatile unsigned int *') to parameter of type 'void *' discards qualifiers [-Werror,-Wincompatible-pointer-types-discards-qualifiers]
prev = cmpxchg(lock, old, new);
^~~~~~~~~~~~~~~~~~~~~~~
include/asm-generic/atomic-instrumented.h:1685:2: note: expanded from macro 'cmpxchg'
arch_cmpxchg(__ai_ptr, __VA_ARGS__); \
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid this simply cast pointers to unsigned long and use them
instead of void pointers. This allows to stay with functions, instead
of using complex defines and having to deal with all their potential
side effects.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: d2b1f6d2d3 ("s390/cmpxchg: get rid of gcc atomic builtins")
Link: https://lore.kernel.org/linux-s390/202104130131.sMmSqpb5-lkp@intel.com/
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
store_regs_fmt2() has an ordering problem: first the guarded storage
facility is enabled on the local cpu, then preemption disabled, and
then the STGSC (store guarded storage controls) instruction is
executed.
If the process gets scheduled away between enabling the guarded
storage facility and before preemption is disabled, this might lead to
a special operation exception and therefore kernel crash as soon as
the process is scheduled back and the STGSC instruction is executed.
Fixes: 4e0b1ab72b ("KVM: s390: gs support for kvm guests")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Cc: <stable@vger.kernel.org> # 4.12
Link: https://lore.kernel.org/r/20210415080127.1061275-1-hca@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
For the same reason as commit e876f0b69d ("lib/vdso: Allow
architectures to provide the vdso data pointer"), powerpc wants to
avoid calculation of relative position to code.
As the timens_vdso_data is next page to vdso_data, provide
vdso_data pointer to __arch_get_timens_vdso_data() in order
to ease the calculation on powerpc in following patches.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Acked-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/539c4204b1baa77c55f758904a1ea239abbc7a5c.1617209142.git.christophe.leroy@csgroup.eu
Trigger a warning if any of unwinder tests fail. This should help to
prevent quiet ignoring of test results when panic_on_warn is enabled.
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Handle the case of "unwind state reliable but addr is 0" like other error
cases in this function and trigger output of failing stacktrace to aid
debugging.
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add arch_ prefix to all atomic operations, and define ARCH_ATOMIC.
This enables KASAN instrumentation for all atomic operations on s390.
This is the s390 variant of commit 8bf705d130 ("locking/atomic/x86:
Switch atomic.h to use atomic-instrumented.h").
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
s390 is the only architecture in the kernel which makes use of gcc's
atomic builtin functions. Even though I don't see any technical
problem with that right now, remove this code and open-code
compare-and-swap loops again, like every other architecture is doing
it also.
We can switch to a generic implementation when other architectures are
doing that also.
See also https://lwn.net/Articles/586838/ for forther details.
This basically reverts commit f318a1229b ("s390/cmpxchg: use
compiler builtins").
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
s390 is the only architecture in the kernel which makes use of gcc's
atomic builtin functions. Even though I don't see any technical
problem with that right now, remove this code and open-code
compare-and-swap loops again, like every other architecture is doing
it also.
We can switch to a generic implementation when other architectures are
doing that also.
See also https://lwn.net/Articles/586838/ for forther details.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Use the R,T, and S constraints instead of the Q constraint in atomic
inline assemblies wherever possible. This allows the compiler to
generate better code. (~ -2kb code size).
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Move all remaining inline assemblies from atomic.h to
atomic_ops.h. That way all atomic inline assemblies are
contained within only a single header file.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The bitops code was optimized to generate test under mask instructions
with the __bitops_byte() helper. However that was many years ago and
in the meantime a lot of new instructions were introduced.
Changing the code so that it always operates on longs nowadays even
generates shorter code (~ -20kb, defconfig, gcc 10, march=zE12).
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add conditional trap handlers similar to conditional system calls
(COND_SYSCALL), to reduce the number of ifdefs.
Trap handlers which may or may not exist depending on config options
are supposed to have a COND_TRAP entry, which redirects to
default_trap_handler() for non-existent trap handlers during link
time.
This allows to get rid of the secure execution trap handlers for the
!PGSTE case.
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Currently zpci_configure_device() can be called on a zPCI function in
two completely different states. Either the underlying zPCI function has
already been configured by the platform and we are only doing the
scanning to get it usable by Linux drivers. Or the underlying function
is in Standby and we first do an SCLP to get it configured. This makes
zpci_configure_device() harder to reason about. Since calling
zpci_configure_device() on a function in Standby only happens in
enable_slot() simply pull out the SCLP call and setting of zdev->state
and thus call zpci_configure_device() under the same circumstances as
in the event handling code.
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Now that the zbus can be created without being scanned we can go one
step further and make registering a device to a zbus independent from
scanning it. This way the zbus handling becomes much more natural
in that functions can be registered on the zbus to be scanned later more
closely resembling the handling of both real PCI hardware and other
virtual PCI busses like Hyper-V's virtual PCI bus (see for example
drivers/pci/controller/pci-hyperv.c:create_root_hv_pci_bus()).
Having zbus registration separate from scanning allows us to return
fully initialized but still disabled zdevs from zpci_create_device()
which can then be configured just as we would configure a zdev from
standby (minus the SCLP Configure already done by the platform). There
is still the exception that a PCI function with non-zero devfn can be
plugged before its PCI bus, which depends on the function with zero
devfn, is created. In this case the zdev returend from
zpci_create_device() is still missing its bus, hotplug slot, and
resources which need to be created later but at least it doesn't wait in
the enabled state and can otherwise be treated as initialized.
With this we also separate the initial PCI scan using CLP List PCI
Functions into two phases. In the CLP loop's callback we only register
each function with a virtual zbus creating the latter as needed. Then,
after we have built this virtual PCI topology based on our list of
zbusses, we can make use of the common code functionality to scan each
complete zbus as a separate child bus.
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
In a later change we will first collect all PCI functions from the CLP
List PCI functions call, then register them to/creating the relevant
zbus. Then only after we've created our virtual bus structure will we
scan all zbusses iterating over the zbus list. Since scanning is
relatively slow a spinlock is a bad fit for protecting the
loop over the devices on the zbus. Furthermore doing the probing on the
bus we need to use pci_lock_rescan_remove() as devices are added to
the PCI subsystem and that is a mutex which can't be locked nested
inside a spinlock section. Note that the contention of this lock should
be very low either way as zbusses are only added/removed concurrently on
hotplug events.
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
In the existing code the creation of the PCI bus and the scanning of
function zero all happens in zpci_scan_bus(). This in turn requires
functions to be enabled and their resources to be available before the
PCI bus is even created.
This not only means that functions are enabled long before they are
actually made available to the common PCI subsystem. In case of
functions with non-zero devfn which appeared before the function with
devfn zero they can wait arbitrarily long in this enabled but not
scanned state.
Fix this by separating the creation of the PCI bus from scanning it and
only prepare, that is enable and setup MMIO bus resources, functions
just before they are scanned. As they may be scanned multiple times
track if we already created resources in the zdev.
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Pull setting the maximum bus speed and multifunction attribute into
zpci_bus_scan() in preparation for handling bus creation separately
from scanning the bus.
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
To match zpci_bus_scan_device() and the PCI common code terminology and
to remove some code duplication, we pull the multiple uses of
pci_scan_single_device() into a function. For now this has the side
effect of adding each device to the PCI bus separately and locking and
unlocking the rescan/remove lock for each instead of just once per bus.
This is clearly less efficient but provides a correct intermediate
behavior until a follow on change does both the adding and scanning only
once per bus.
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Convert the program check table to C. Which allows to get rid of yet
another assembler file, and also enables proper type checking for the
table.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baisong Zhong <zhongbaisong@huawei.com>
Fixes: 37564ed834 ("s390/uv: add prot virt guest/host indication files")
Link: https://lore.kernel.org/r/2f7d62a4-3e75-b2b4-951b-75ef8ef59d16@huawei.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
* fixes:
s390/entry: save the caller of psw_idle
s390/entry: avoid setting up backchain in ext|io handlers
s390/setup: use memblock_free_late() to free old stack
s390/irq: fix reading of ext_params2 field from lowcore
s390/unwind: add machine check handler stack
s390/cpcmd: fix inline assembly register clobbering
MAINTAINERS: add backups for s390 vfio drivers
s390/vdso: fix initializing and updating of vdso_data
s390/vdso: fix tod_steering_delta type
s390/vdso: copy tod_steering_delta value to vdso_data page
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Currently psw_idle does not allocate a stack frame and does not
save its r14 and r15 into the save area. Even though this is valid from
call ABI point of view, because psw_idle does not make any calls
explicitly, in reality psw_idle is an entry point for controlled
transition into serving interrupts. So, in practice, psw_idle stack
frame is analyzed during stack unwinding. Depending on build options
that r14 slot in the save area of psw_idle might either contain a value
saved by previous sibling call or complete garbage.
[task 0000038000003c28] do_ext_irq+0xd6/0x160
[task 0000038000003c78] ext_int_handler+0xba/0xe8
[task *0000038000003dd8] psw_idle_exit+0x0/0x8 <-- pt_regs
([task 0000038000003dd8] 0x0)
[task 0000038000003e10] default_idle_call+0x42/0x148
[task 0000038000003e30] do_idle+0xce/0x160
[task 0000038000003e70] cpu_startup_entry+0x36/0x40
[task 0000038000003ea0] arch_call_rest_init+0x76/0x80
So, to make a stacktrace nicer and actually point for the real caller of
psw_idle in this frequently occurring case, make psw_idle save its r14.
[task 0000038000003c28] do_ext_irq+0xd6/0x160
[task 0000038000003c78] ext_int_handler+0xba/0xe8
[task *0000038000003dd8] psw_idle_exit+0x0/0x6 <-- pt_regs
([task 0000038000003dd8] arch_cpu_idle+0x3c/0xd0)
[task 0000038000003e10] default_idle_call+0x42/0x148
[task 0000038000003e30] do_idle+0xce/0x160
[task 0000038000003e70] cpu_startup_entry+0x36/0x40
[task 0000038000003ea0] arch_call_rest_init+0x76/0x80
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Currently when interrupt arrives to cpu while in kernel context
INT_HANDLER macro (used for ext_int_handler and io_int_handler)
allocates new stack frame and pt_regs on the kernel stack and
sets up the backchain to jump over the pt_regs to the frame which has
been interrupted. This is not ideal to two reasons:
1. This hides the fact that kernel stack contains interrupt frame in it
and hence breaks arch_stack_walk_reliable(), which needs to know that to
guarantee "reliability" and checks that there are no pt_regs on the way.
2. It breaks the backchain unwinder logic, which assumes that the next
stack frame after an interrupt frame is reliable, while it is not.
In some cases (when r14 contains garbage) this leads to early unwinding
termination with an error, instead of marking frame as unreliable
and continuing.
To address that, only set backchain to 0.
Fixes: 56e62a7370 ("s390: convert to generic entry")
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Conflicts:
MAINTAINERS
- keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
- simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
- trivial
include/linux/ethtool.h
- trivial, fix kdoc while at it
include/linux/skmsg.h
- move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
- add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
- trivial
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use memblock_free_late() to free the old machine check stack to the
buddy allocator instead of leaking it.
Fixes: b61b159512 ("s390: add stack for machine check handler")
Cc: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Due to historical reasons mark_kernel_pXd() functions
misuse the notion of physical vs virtual addresses
difference.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
struct ccw1 is declared twice. One has been declared
at 21st line. Remove the duplicate.
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Acked-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
On s390 each PCI device has a user-defined ID (UID) exposed under
/sys/bus/pci/devices/<dev>/uid. This ID was designed to serve as the PCI
device's primary index and to match the device within Linux to the
device configured in the hypervisor. To serve as a primary identifier
the UID must be unique within the Linux instance, this is guaranteed by
the platform if and only if the UID Uniqueness Checking flag is set
within the CLP List PCI Functions response.
While the UID has been exposed to userspace since commit ac4995b9d5
("s390/pci: add some new arch specific pci attributes") whether or not
the platform guarantees its uniqueness for the lifetime of the Linux
instance while defined is not visible from userspace. Remedy this by
exposing this as a per device attribute at
/sys/bus/pci/devices/<dev>/uid_is_unique
Keeping this a per device attribute allows for maximum flexibility if we
ever end up with some devices not having a UID or not enjoying the
guaranteed uniqueness.
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Viktor Mihajlovski <mihajlov@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
In commit dee60c0dbc ("s390/pci: add zpci_event_hard_deconfigured()")
we added a zdev_enabled() check to what was previously an uncoditional
call to zpci_disable_device(). There are two problems with that. Firstly
zpci_had_deconfigured() is only called on event 0x0304 for which the
device is always already disabled by the platform so it is always false.
Secondly zpci_disable_device() not only disables the device but also
calls zpci_dma_exit_device() which is thus not called and we leak the
DMA tables.
Fix this by calling zpci_disable_device() unconditionally to perform
Linux side cleanup including the freeing of DMA tables.
Fixes: dee60c0dbc ("s390/pci: add zpci_event_hard_deconfigured()")
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Acked-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-03-24
The following pull-request contains BPF updates for your *net-next* tree.
We've added 37 non-merge commits during the last 15 day(s) which contain
a total of 65 files changed, 3200 insertions(+), 738 deletions(-).
The main changes are:
1) Static linking of multiple BPF ELF files, from Andrii.
2) Move drop error path to devmap for XDP_REDIRECT, from Lorenzo.
3) Spelling fixes from various folks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Li Wang reported that clock_gettime(CLOCK_MONOTONIC_RAW, ...) returns
incorrect values when time is provided via vdso instead of system call:
vdso_ts_nsec = 4484351380985507, vdso_ts.tv_sec = 4484351, vdso_ts.tv_nsec = 380985507
sys_ts_nsec = 1446923235377, sys_ts.tv_sec = 1446, sys_ts.tv_nsec = 923235377
Within the s390 specific vdso function __arch_get_hw_counter() reads
tod clock steering values from the arch_data member of the passed in
vdso_data structure.
Problem is that only for the CS_HRES_COARSE vdso_data arch_data is
initialized and gets updated. The CS_RAW specific vdso_data does not
contain any valid tod_clock_steering information, which explains the
different values.
Fix this by initializing and updating all vdso_datas.
Reported-by: Li Wang <liwang@redhat.com>
Tested-by: Li Wang <liwang@redhat.com>
Fixes: 1ba2d6c0fd ("s390/vdso: simplify __arch_get_hw_counter()")
Link: https://lore.kernel.org/linux-s390/YFnxr1ZlMIOIqjfq@osiris
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The s390 specific vdso function __arch_get_hw_counter() is supposed to
consider tod clock steering.
If a tod clock steering event happens and the tod clock is set to a
new value __arch_get_hw_counter() will not return the real tod clock
value but slowly drift it from the old delta until the returned value
finally matches the real tod clock value again.
Unfortunately the type of tod_steering_delta unsigned while it is
supposed to be signed. It depends on if tod_steering_delta is negative
or positive in which direction the vdso code drifts the clock value.
Worst case is now that instead of drifting the clock slowly it will
jump into the opposite direction by a factor of two.
Fix this by simply making tod_steering_delta signed.
Fixes: 4bff8cb545 ("s390: convert to GENERIC_VDSO")
Cc: <stable@vger.kernel.org> # 5.10
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
When converting the vdso assembler code to C it was forgotten to
actually copy the tod_steering_delta value to vdso_data page.
Which in turn means that tod clock steering will not work correctly.
Fix this by simply copying the value whenever it is updated.
Fixes: 4bff8cb545 ("s390: convert to GENERIC_VDSO")
Cc: <stable@vger.kernel.org> # 5.10
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
prot_virt_host is only available if CONFIG_KVM is enabled. So lets use
a variable initialized to zero and overwrite it when that config
option is set with prot_virt_host.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Fixes: 37564ed834 ("s390/uv: add prot virt guest/host indication files")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Prefixing needs to be applied to the guest real address to translate it
into a guest absolute address.
The value of MSO needs to be added to a guest-absolute address in order to
obtain the host-virtual.
Fixes: bdf7509bbe ("s390/kvm: VSIE: correctly handle MVPG when in VSIE")
Reported-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210322140559.500716-3-imbrenda@linux.ibm.com
[borntraeger@de.ibm.com simplify mso]
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
A new function _kvm_s390_real_to_abs will apply prefixing to a real address
with a given prefix value.
The old kvm_s390_real_to_abs becomes now a wrapper around the new function.
This is needed to avoid code duplication in vSIE.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210322140559.500716-2-imbrenda@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Extend kvm_s390_shadow_fault to return the pointer to the valid leaf
DAT table entry, or to the invalid entry.
Also return some flags in the lower bits of the address:
PEI_DAT_PROT: indicates that DAT protection applies because of the
protection bit in the segment (or, if EDAT, region) tables.
PEI_NOT_PTE: indicates that the address of the DAT table entry returned
does not refer to a PTE, but to a segment or region table.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: stable@vger.kernel.org
Reviewed-by: Janosch Frank <frankja@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Link: https://lore.kernel.org/r/20210302174443.514363-3-imbrenda@linux.ibm.com
[borntraeger@de.ibm.com: fold in a fix from Claudio]
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
We are spending way too much effort on qdio-internal bookkeeping for
QAOB management & caching, and it's still not robust. Once qdio's
TX path has detached the QAOB from a PENDING buffer, we lost all
track of it until it shows up in a CQ notification again. So if the
device is torn down before that notification arrives, we leak the QAOB.
Just have the driver take care of it, and simply pass down a QAOB if
they want a TX with async-completion capability. For a buffer in PENDING
state that requires the QAOB for final completion, qeth can now also try
to recycle the buffer's QAOB rather than unconditionally freeing it.
This also eliminates the qdio_outbuf_state array, which was only needed
to transfer the aob->user1 tag from the driver to the qdio layer.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The zpci_remove_device() function removes the device from the PCI common
code core which is an operation dealing primarily with the zbus and PCI
bus code. With that and to match an upcoming refactoring of the
symmetric scanning part move it to the bus code.
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
A zPCI event with PEC 0x0301 for an existing zPCI device goes through
the same actions as enable_slot(). Similarly a zPCI event with PEC
0x0303 does the same steps as disable_slot().
We can thus unify both actions as zpci_configure_device() respectively
zpci_deconfigure_device().
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
This patch introduces the mechanism to inject artificial events to the
CIO layer.
One of the main-event type which triggers the CommonIO operations are
Channel Report events. When a malfunction or other condition affecting
channel-subsystem operation is recognized, a Channel Report Word
(consisting of one or more CRWs) describing the condition is made
pending for retrieval and analysis by the program. The CRW contains
information concerning the identity and state of a facility following
the detection of the malfunction or other condition.
The patch introduces two debugfs interfaces which can be used to inject
'artificial' events from the userspace. It is intended to provide an easy
means to increase the test coverage for CIO code. And this functionality
can be enabled via a new configuration option CONFIG_CIO_INJECT.
The newly introduces debugfs interfaces can be used as mentioned below
to generate different fake-events. To use the crw_inject, first we should
enable it by using enable_inject interface.
i.e
echo 1 > /sys/kernel/debug/s390/cio/enable_inject
After the first step, user can simulate CRW as follows:
echo <solicited> <overflow> <chaining> <rsc> <ancillary> <erc> <rsid> \
> /sys/kernel/debug/s390/cio/crw_inject
Example:
A permanent error ERC on CHPID 0x60 would look like this:
echo 0 0 0 4 0 6 0x60 > /sys/kernel/debug/s390/cio/crw_inject
and an initialized ERC on the same CHPID:
echo 0 0 0 4 0 2 0x60 > /sys/kernel/debug/s390/cio/crw_inject
Signed-off-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Extract the handling of PEC 0x0304 into a function and make sure we only
attempt to disable the function if it is enabled. Also check for errors
returned by zpci_disable_device() and leave the function alone if there
are any.
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
When zpci_release_device() is called on a zPCI function that is still
configured it would not be deconfigured. Until now this hasn't caused
any problems because zpci_zdev_put() is only ever called for devices
in Standby or Reserved. Fix it by adding sclp_pci_deconfigure() to the
switch when in Configured state.
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The current zdev->state mixes the configuration states supported by CLP
with an additional Online state which is used inconsistently to include
enabled zPCI functions which are not yet visible to the common PCI
subsytem. In preparation for a clean separation between architected
configuration states and fine grained function states remove the Online
function state.
Where we previously checked for Online it is more accurate to check if
the function is enabled to avoid an edge case where a disabled device
was still treated as Online. This also simplifies checks whether
a function is configured as this is now directly reflected by its
function state.
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Let's export the prot_virt_guest and prot_virt_host variables into the
UV sysfs firmware interface to make them easily consumable by
administrators.
prot_virt_host being 1 indicates that we did the UV
initialization (opt-in)
prot_virt_guest being 1 indicates that the UV indicates the share and
unshare ultravisor calls which is an indication that we are running as
a protected guest.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Implement BPF_AND, BPF_OR and BPF_XOR as the existing BPF_ADD. Since
the corresponding machine instructions return the old value, BPF_FETCH
happens by itself, the only additional thing that is required is
zero-extension.
There is no single instruction that implements BPF_XCHG on s390, so use
a COMPARE AND SWAP loop.
BPF_CMPXCHG, on the other hand, can be implemented by a single COMPARE
AND SWAP. Zero-extension is automatically inserted by the verifier.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210304233002.149096-1-iii@linux.ibm.com
In commit 05bc1be6db ("s390/pci: create zPCI bus") we removed the
pci_dev_put() call matching the earlier pci_get_slot() done as part of
__zpci_event_availability(). This was based on the wrong understanding
that the device_put() done as part of pci_destroy_device() would counter
the pci_get_slot() when it only counters the initial reference. This
same understanding and existing bad example also lead to not doing
a pci_dev_put() in zpci_remove_device().
Since releasing the PCI devices, unlike releasing the PCI slot, does not
print any debug message for testing I added one in pci_release_dev().
This revealed that we are indeed leaking the PCI device on PCI
hotunplug. Further testing also revealed another missing pci_dev_put() in
disable_slot().
Fix this by adding the missing pci_dev_put() in disable_slot() and fix
zpci_remove_device() with the correct pci_dev_put() calls. Also instead
of calling pci_get_slot() in __zpci_event_availability() to determine if
a PCI device is registered and then doing the same again in
zpci_remove_device() do this once in zpci_remove_device() which makes
sure that the pdev in __zpci_event_availability() is only used for the
result of pci_scan_single_device() which does not need a reference count
decremnt as its ownership goes to the PCI bus.
Also move the check if zdev->zbus->bus is set into zpci_remove_device()
since it may be that we're removing a device with devfn != 0 which never
had a PCI bus. So we can still set the pdev->error_state to indicate
that the device is not usable anymore, add a flag to set the error state.
Fixes: 05bc1be6db ("s390/pci: create zPCI bus")
Cc: <stable@vger.kernel.org> # 5.8+: e1bff843cd s390/pci: remove superfluous zdev->zbus check
Cc: <stable@vger.kernel.org> # 5.8+: ba764dd703 s390/pci: refactor zpci_create_device()
Cc: <stable@vger.kernel.org> # 5.8+
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Commit 152e9b8676 ("s390/vtime: steal time exponential moving average")
inadvertently changed the input value for account_steal_time() from
"cputime_to_nsecs(steal)" to just "steal", resulting in broken increased
steal time accounting.
Fix this by changing it back to "cputime_to_nsecs(steal)".
Fixes: 152e9b8676 ("s390/vtime: steal time exponential moving average")
Cc: <stable@vger.kernel.org> # 5.1
Reported-by: Sabine Forkel <sabine.forkel@de.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The following BUG message was triggered repeatedly when complete counter
sets are extracted from the CPUMF:
BUG: using smp_processor_id() in preemptible [00000000]
code: psvc-readsets/7759
caller is cf_diag_needspace+0x2c/0x100
CPU: 7 PID: 7759 Comm: psvc-readsets Not tainted 5.12.0
Hardware name: IBM 3906 M03 703 (LPAR)
Call Trace:
[<00000000c7043f78>] show_stack+0x90/0xf8
[<00000000c705776a>] dump_stack+0xba/0x108
[<00000000c705d91c>] check_preemption_disabled+0xec/0xf0
[<00000000c63eb1c4>] cf_diag_needspace+0x2c/0x100
[<00000000c63ecbcc>] cf_diag_ioctl_start+0x10c/0x240
[<00000000c63ece9a>] cf_diag_ioctl+0x19a/0x238
[<00000000c675f3f4>] __s390x_sys_ioctl+0xc4/0x100
[<00000000c63ca762>] do_syscall+0x82/0xd0
[<00000000c705bdd8>] __do_syscall+0xc0/0xd8
[<00000000c706d532>] system_call+0x72/0x98
2 locks held by psvc-readsets/7759:
#0: 00000000c75a57c0 (cpu_hotplug_lock){++++}-{0:0},
at: cf_diag_ioctl+0x44/0x238
#1: 00000000c75a3078 (cf_diag_ctrset_mutex){+.+.}-{3:3},
at: cf_diag_ioctl+0x54/0x238
This issue is a missing get_cpu_ptr/put_cpu_ptr pair in function
cf_diag_needspace. Add it.
Fixes: cf6acb8bdb ("s390/cpumf: Add support for complete counter set extraction")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Currently arch_stack_walk_reliable() is documented with an identical
comment in both x86 and S/390 implementations which is a bit redundant.
Move this to the header and convert to kerneldoc while we're at it.
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/20210309194125.652-1-broonie@kernel.org
Split kvm_s390_logical_to_effective to a generic function called
_kvm_s390_logical_to_effective. The new function takes a PSW and an address
and returns the address with the appropriate bits masked off. The old
function now calls the new function with the appropriate PSW from the vCPU.
This is needed to avoid code duplication for vSIE.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: stable@vger.kernel.org # for VSIE: correctly handle MVPG when in VSIE
Link: https://lore.kernel.org/r/20210302174443.514363-2-imbrenda@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
When we intercept a DIAG_9C from the guest we verify that the
target real CPU associated with the virtual CPU designated by
the guest is running and if not we forward the DIAG_9C to the
target real CPU.
To avoid a diag9c storm we allow a maximal rate of diag9c forwarding.
The rate is calculated as a count per second defined as a new
parameter of the s390 kvm module: diag9c_forwarding_hz .
The default value of 0 is to not forward diag9c.
Signed-off-by: Pierre Morel <pmorel@linux.ibm.com>
Link: https://lore.kernel.org/r/1613997661-22525-2-git-send-email-pmorel@linux.ibm.com
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Remove the 60 seconds read interval limit. Do not impose any limit
at all and allow read of complete counter sets.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The cpumask being checked in cpu_group_map() must have at least one
cpu set; therefore remove the check.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Get rid of unsigned long long, and use unsigned long instead
everywhere. The usage of unsigned long long is a leftover from
31 bit kernel support.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmA4JRkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpoWqD/9dbbqe8L701U6May1A/4hRsqL4THTA2flx
vNCNRBl6XV3l/wBCtL6waKy6tyO4lyM8XdUdEvo3Kxl2kGPb8eVfpyYL/+77HqyH
ctT4RMrs+84Mxn+5N6cM97hS1qVI2moTxxyvOEl/JTB7BYrutz9gvAoeY3/Dto47
J66oSaPeuqJ32TyihxfQHVxQopJcqFzDjyoYHGDu6ATio1PXfaIdTu8ywVYSECAh
pWI4rwnqdurGuHMNpxyL1bA6CT/jC7s+sqU7bUYUCgtYI3eG0u3V0bp5gAQQIgl9
5sxxE3DidYGAkYZsosrelshBtzGddLdz4Qrt2ungMYv8RsGNpFQ095jDPKDwFaZj
bSvSsfplCo7iFsJByb1TtpNEOW8eAwi81PmBDVQ9Oq5P5ygTYno9GBDc/20ql0Fk
q6wcX28coE3IBw44ne0hIwvBOtXV4WJyluG/gqOxfbTH+kOy3pDsN8lWcY/P4X0U
yzdU2MLHe8BNMyYlUiBF47Amzt4ltr85P4XD3WZ4bX71iwri6HvrdGWLuuKwX+Ie
66QiIDDQIYZQ6NMMJWS9DGW3y3DBizpSXGxONbOw1J2bQdNmtToR0D2UnK/9UnKp
msnvkUNk8fkYGS4aptpJ6HxbmjMEG5YtbiGlPj6fz5/7MTvhRjPxt7A0LWrUIdqR
f88+sHUMqg==
=oc8u
-----END PGP SIGNATURE-----
Merge tag 'io_uring-worker.v3-2021-02-25' of git://git.kernel.dk/linux-block
Pull io_uring thread rewrite from Jens Axboe:
"This converts the io-wq workers to be forked off the tasks in question
instead of being kernel threads that assume various bits of the
original task identity.
This kills > 400 lines of code from io_uring/io-wq, and it's the worst
part of the code. We've had several bugs in this area, and the worry
is always that we could be missing some pieces for file types doing
unusual things (recent /dev/tty example comes to mind, userfaultfd
reads installing file descriptors is another fun one... - both of
which need special handling, and I bet it's not the last weird oddity
we'll find).
With these identical workers, we can have full confidence that we're
never missing anything. That, in itself, is a huge win. Outside of
that, it's also more efficient since we're not wasting space and code
on tracking state, or switching between different states.
I'm sure we're going to find little things to patch up after this
series, but testing has been pretty thorough, from the usual
regression suite to production. Any issue that may crop up should be
manageable.
There's also a nice series of further reductions we can do on top of
this, but I wanted to get the meat of it out sooner rather than later.
The general worry here isn't that it's fundamentally broken. Most of
the little issues we've found over the last week have been related to
just changes in how thread startup/exit is done, since that's the main
difference between using kthreads and these kinds of threads. In fact,
if all goes according to plan, I want to get this into the 5.10 and
5.11 stable branches as well.
That said, the changes outside of io_uring/io-wq are:
- arch setup, simple one-liner to each arch copy_thread()
implementation.
- Removal of net and proc restrictions for io_uring, they are no
longer needed or useful"
* tag 'io_uring-worker.v3-2021-02-25' of git://git.kernel.dk/linux-block: (30 commits)
io-wq: remove now unused IO_WQ_BIT_ERROR
io_uring: fix SQPOLL thread handling over exec
io-wq: improve manager/worker handling over exec
io_uring: ensure SQPOLL startup is triggered before error shutdown
io-wq: make buffered file write hashed work map per-ctx
io-wq: fix race around io_worker grabbing
io-wq: fix races around manager/worker creation and task exit
io_uring: ensure io-wq context is always destroyed for tasks
arch: ensure parisc/powerpc handle PF_IO_WORKER in copy_thread()
io_uring: cleanup ->user usage
io-wq: remove nr_process accounting
io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS
net: remove cmsg restriction from io_uring based send/recvmsg calls
Revert "proc: don't allow async path resolution of /proc/self components"
Revert "proc: don't allow async path resolution of /proc/thread-self components"
io_uring: move SQPOLL thread io-wq forked worker
io-wq: make io_wq_fork_thread() available to other users
io-wq: only remove worker from free_list, if it was there
io_uring: remove io_identity
io_uring: remove any grabbing of context
...
- Fix physical vs virtual confusion in some basic mm macros and
routines. Caused by __pa == __va on s390 currently.
- Get rid of on-stack cpu masks.
- Add support for complete CPU counter set extraction.
- Add arch_irq_work_raise implementation.
- virtio-ccw revision and opcode fixes.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmA5UHMACgkQjYWKoQLX
FBh/Vgf/ezaC/qx8cPAJKemWTST5tK/cc3g63L5SlvIsiKTTcO08+PYpGEo5ajU8
0DDpDZdP+DYeDgLatrFFj+MlQyIL4wd602uKiRqD1LwjTR1oA+6HDDQtE41v2Z2a
t2Kuv32dGVT5361RythnerTdMx18XG6k77JprP6b7zCFa2qCpc8DaKk7FvcqnQt+
gfebknC/hdL15IJhtpPrIoqqerDXxB+HU+dSouipQwiamtENGFOuJ5csgl9XIJuR
ILzq3Ad/6u8yKf77c+q9WRvu3DTIsXXSY8xwl3zJjNqXcAnN2sa4apc7+noNJ0YZ
UkbokSshQomfOYMhIxyLs9AKEEcltg==
=YVkj
-----END PGP SIGNATURE-----
Merge tag 's390-5.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Vasily Gorbik:
- Fix physical vs virtual confusion in some basic mm macros and
routines. Caused by __pa == __va on s390 currently.
- Get rid of on-stack cpu masks.
- Add support for complete CPU counter set extraction.
- Add arch_irq_work_raise implementation.
- virtio-ccw revision and opcode fixes.
* tag 's390-5.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/cpumf: Add support for complete counter set extraction
virtio/s390: implement virtio-ccw revision 2 correctly
s390/smp: implement arch_irq_work_raise()
s390/topology: move cpumasks away from stack
s390/smp: smp_emergency_stop() - move cpumask away from stack
s390/smp: __smp_rescan_cpus() - move cpumask away from stack
s390/smp: consolidate locking for smp_rescan()
s390/mm: fix phys vs virt confusion in vmem_*() functions family
s390/mm: fix phys vs virt confusion in pgtable allocation routines
s390/mm: fix invalid __pa() usage in pfn_pXd() macros
s390/mm: make pXd_deref() macros return a pointer
s390/opcodes: rename selhhhr to selfhr
This overrides arch_get_mappabble_range() on s390 platform which will be
used with recently added generic framework. It modifies the existing
range check in vmem_add_mapping() using arch_get_mappable_range(). It
also adds a VM_BUG_ON() check that would ensure that mhp_range_allowed()
has already been called on the hotplug path.
Link: https://lkml.kernel.org/r/1612149902-7867-4-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmA2xiQUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vzRDA/9GCyEskI9DMtyT9UeoTMzpHcUZpaU
eCbLa2BSPjOKlrHLnPY7IwE0nT7ihe4OOcm8uOYOWtulE46XJNCHfxlUYP3SbI0Y
JlG0FBCh4ldzCzzKsftwkSvVhk+gn+ms9ucJ8q2iBSOXVhG/41IbX7++8IfbQM4v
VHjdYUmTCCiOSRDtBVi82p4+GAHxH8IhaB0gDNb1Q7myj+qJKL5nKjK/nukgO0fO
UpCnSxyua48Ij+c59Y1QAIhGeORq5Gg5Q4ussY3FxS9ovhZODEGQwCFniTfilqRw
wEB9Fb8tiPY60ljEyDPnERMkiW69zutTJqOY4LfwmoRM9IEbxD6VPIqF5gin8sB7
pHhX4KUU+eB1hQdK9SGKjkwyehquNKzTdxsu2jccltOKwBm5jcXYeOvu2bJTzZn+
rrZPYJoA1dQig3bEuOzsBxvW4Jaj7IsVfVcao4OzXyh8Y7tLr9kVDXxr7JC/EkPM
zRK24yglERD2J1JXgNMvOuJQj6JmRHhEbV/faZci8x8ZEaz1FawRAUZqHf/gGmnW
2CllarHbRnchPyD8btv03Mp84WG6fCfKy7zG2D8HxOsiStDO/5ICehHtGcvYg7IL
RuE4Tj8OKdcbw/8cO4C3842FqiSj34+jooNIHSLyBqcpJam6VsN4XqNIZCL+DeG5
Q2JXruAaahTWOZg=
=GXL5
-----END PGP SIGNATURE-----
Merge tag 'pci-v5.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
"Enumeration:
- Remove unnecessary locking around _OSC (Bjorn Helgaas)
- Clarify message about _OSC failure (Bjorn Helgaas)
- Remove notification of PCIe bandwidth changes (Bjorn Helgaas)
- Tidy checking of syscall user config accessors (Heiner Kallweit)
Resource management:
- Decline to resize resources if boot config must be preserved (Ard
Biesheuvel)
- Fix pci_register_io_range() memory leak (Geert Uytterhoeven)
Error handling (Keith Busch):
- Clear error status from the correct device
- Retain error recovery status so drivers can use it after reset
- Log the type of Port (Root or Switch Downstream) that we reset
- Always request a reset for Downstream Ports in frozen state
Endpoint framework and NTB (Kishon Vijay Abraham I):
- Make *_get_first_free_bar() take into account 64 bit BAR
- Add helper API to get the 'next' unreserved BAR
- Make *_free_bar() return error codes on failure
- Remove unused pci_epf_match_device()
- Add support to associate secondary EPC with EPF
- Add support in configfs to associate two EPCs with EPF
- Add pci_epc_ops to map MSI IRQ
- Add pci_epf_ops to expose function-specific attrs
- Allow user to create sub-directory of 'EPF Device' directory
- Implement ->msi_map_irq() ops for cadence
- Configure LM_EP_FUNC_CFG based on epc->function_num_map for cadence
- Add EP function driver to provide NTB functionality
- Add support for EPF PCI Non-Transparent Bridge
- Add specification for PCI NTB function device
- Add PCI endpoint NTB function user guide
- Add configfs binding documentation for pci-ntb endpoint function
Broadcom STB PCIe controller driver:
- Add support for BCM4908 and external PERST# signal controller
(Rafał Miłecki)
Cadence PCIe controller driver:
- Retrain Link to work around Gen2 training defect (Nadeem Athani)
- Fix merge botch in cdns_pcie_host_map_dma_ranges() (Krzysztof
Wilczyński)
Freescale Layerscape PCIe controller driver:
- Add LX2160A rev2 EP mode support (Hou Zhiqiang)
- Convert to builtin_platform_driver() (Michael Walle)
MediaTek PCIe controller driver:
- Fix OF node reference leak (Krzysztof Wilczyński)
Microchip PolarFlare PCIe controller driver:
- Add Microchip PolarFire PCIe controller driver (Daire McNamara)
Qualcomm PCIe controller driver:
- Use PHY_REFCLK_USE_PAD only for ipq8064 (Ansuel Smith)
- Add support for ddrss_sf_tbu clock for sm8250 (Dmitry Baryshkov)
Renesas R-Car PCIe controller driver:
- Drop PCIE_RCAR config option (Lad Prabhakar)
- Always allocate MSI addresses in 32bit space (Marek Vasut)
Rockchip PCIe controller driver:
- Add FriendlyARM NanoPi M4B DT binding (Chen-Yu Tsai)
- Make 'ep-gpios' DT property optional (Chen-Yu Tsai)
Synopsys DesignWare PCIe controller driver:
- Work around ECRC configuration hardware defect (Vidya Sagar)
- Drop support for config space in DT 'ranges' (Rob Herring)
- Change size to u64 for EP outbound iATU (Shradha Todi)
- Add upper limit address for outbound iATU (Shradha Todi)
- Make dw_pcie ops optional (Jisheng Zhang)
- Remove unnecessary dw_pcie_ops from al driver (Jisheng Zhang)
Xilinx Versal CPM PCIe controller driver:
- Fix OF node reference leak (Pan Bian)
Miscellaneous:
- Remove tango host controller driver (Arnd Bergmann)
- Remove IRQ handler & data together (altera-msi, brcmstb, dwc)
(Martin Kaiser)
- Fix xgene-msi race in installing chained IRQ handler (Martin
Kaiser)
- Apply CONFIG_PCI_DEBUG to entire drivers/pci hierarchy (Junhao He)
- Fix pci-bridge-emul array overruns (Russell King)
- Remove obsolete uses of WARN_ON(in_interrupt()) (Sebastian Andrzej
Siewior)"
* tag 'pci-v5.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (69 commits)
PCI: qcom: Use PHY_REFCLK_USE_PAD only for ipq8064
PCI: qcom: Add support for ddrss_sf_tbu clock
dt-bindings: PCI: qcom: Document ddrss_sf_tbu clock for sm8250
PCI: al: Remove useless dw_pcie_ops
PCI: dwc: Don't assume the ops in dw_pcie always exist
PCI: dwc: Add upper limit address for outbound iATU
PCI: dwc: Change size to u64 for EP outbound iATU
PCI: dwc: Drop support for config space in 'ranges'
PCI: layerscape: Convert to builtin_platform_driver()
PCI: layerscape: Add LX2160A rev2 EP mode support
dt-bindings: PCI: layerscape: Add LX2160A rev2 compatible strings
PCI: dwc: Work around ECRC configuration issue
PCI/portdrv: Report reset for frozen channel
PCI/AER: Specify the type of Port that was reset
PCI/ERR: Retain status from error notification
PCI/AER: Clear AER status from Root Port when resetting Downstream Port
PCI/ERR: Clear status of the reporting device
dt-bindings: arm: rockchip: Add FriendlyARM NanoPi M4B
PCI: rockchip: Make 'ep-gpios' DT property optional
Documentation: PCI: Add PCI endpoint NTB function user guide
...
The irq stack switching was moved out of the ASM entry code in course of
the entry code consolidation. It ended up being suboptimal in various
ways.
- Make the stack switching inline so the stackpointer manipulation is not
longer at an easy to find place.
- Get rid of the unnecessary indirect call.
- Avoid the double stack switching in interrupt return and reuse the
interrupt stack for softirq handling.
- A objtool fix for CONFIG_FRAME_POINTER=y builds where it got confused
about the stack pointer manipulation.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmA21OcTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaX0D/9S0ud6oqbsIvI8LwhvYub63a2cjKP9
liHAJ7xwMYYVwzf0skwsPb/QE6+onCzdq0upJkgG/gEYm2KbiaMWZ4GgHdj0O7ER
qXKJONDd36AGxSEdaVzLY5kPuD/mkomGk5QdaZaTmjruthkNzg4y/N2wXUBIMZR0
FdpSpp5fGspSZCn/DXDx6FjClwpLI53VclvDs6DcZ2DIBA0K+F/cSLb1UQoDLE1U
hxGeuNa+GhKeeZ5C+q5giho1+ukbwtjMW9WnKHAVNiStjm0uzdqq7ERGi/REvkcB
LY62u5uOSW1zIBMmzUjDDQEqvypB0iFxFCpN8g9sieZjA0zkaUioRTQyR+YIQ8Cp
l8LLir0dVQivR1bHghHDKQJUpdw/4zvDj4mMH10XHqbcOtIxJDOJHC5D00ridsAz
OK0RlbAJBl9FTdLNfdVReBCoehYAO8oefeyMAG12nZeSh5XVUWl238rvzmzIYNhG
cEtkSx2wIUNEA+uSuI+xvfmwpxL7voTGvqmiRDCAFxyO7Bl/GBu9OEBFA1eOvHB+
+wTmPDMswRetQNh4QCRXzk1JzP1Wk5CobUL9iinCWFoTJmnsPPSOWlosN6ewaNXt
kYFpRLy5xt9EP7dlfgBSjiRlthDhTdMrFjD5bsy1vdm1w7HKUo82lHa4O8Hq3PHS
tinKICUqRsbjig==
=Sqr1
-----END PGP SIGNATURE-----
Merge tag 'x86-entry-2021-02-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 irq entry updates from Thomas Gleixner:
"The irq stack switching was moved out of the ASM entry code in course
of the entry code consolidation. It ended up being suboptimal in
various ways.
This reworks the X86 irq stack handling:
- Make the stack switching inline so the stackpointer manipulation is
not longer at an easy to find place.
- Get rid of the unnecessary indirect call.
- Avoid the double stack switching in interrupt return and reuse the
interrupt stack for softirq handling.
- A objtool fix for CONFIG_FRAME_POINTER=y builds where it got
confused about the stack pointer manipulation"
* tag 'x86-entry-2021-02-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix stack-swizzle for FRAME_POINTER=y
um: Enforce the usage of asm-generic/softirq_stack.h
x86/softirq/64: Inline do_softirq_own_stack()
softirq: Move do_softirq_own_stack() to generic asm header
softirq: Move __ARCH_HAS_DO_SOFTIRQ to Kconfig
x86: Select CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
x86/softirq: Remove indirection in do_softirq_own_stack()
x86/entry: Use run_sysvec_on_irqstack_cond() for XEN upcall
x86/entry: Convert device interrupts to inline stack switching
x86/entry: Convert system vectors to irq stack macro
x86/irq: Provide macro for inlining irq stack switching
x86/apic: Split out spurious handling code
x86/irq/64: Adjust the per CPU irq stack pointer by 8
x86/irq: Sanitize irq stack tracking
x86/entry: Fix instrumentation annotation
Add support to the CPU Measurement counter facility device driver
to extract complete counter sets per CPU and per counter set from user
space. This includes a new device named /dev/hwctr and support
for the device driver functions open, close and ioctl. Other
functions are not supported.
The ioctl command supports 3 subcommands:
S390_HWCTR_START: enables counter sets on a list of CPUs.
S390_HWCTR_STOP: disables counter sets on a list of CPUs.
S390_HWCTR_READ: reads counter sets on a list of CPUs.
The ioctl(..., S390_HWCTR_READ, ...) is the only subcommand which
returns data. It requires member data_bytes to be positive and
indicates the maximum amount of data available to store counter set
data. The other ioctl() subcommands do not use this member and it
should be set to zero.
The S390_HWCTR_READ subcommand returns the following data:
The cpuset data is flattened using the following scheme, stored in member
data:
0x0 0x8 0xc 0x10 0x10 0x18 0x20 0x28 0xU-1
+---------+-----+---------+-----+---------+-----+-----+------+------+
| no_cpus | cpu | no_sets | set | no_cnts | cv1 | cv2 | .... | cv_n |
+---------+-----+---------+-----+---------+-----+-----+------+------+
0xU 0xU+4 0xU+8 0xU+10 0xV-1
+-----+---------+-----+-----+------+------+
| set | no_cnts | cv1 | cv2 | .... | cv_n |
+-----+---------+-----+-----+------+------+
0xV 0xV+4 0xV+8 0xV+c
+-----+---------+-----+---------+-----+-----+------+------+
| cpu | no_sets | set | no_cnts | cv1 | cv2 | .... | cv_n |
+-----+---------+-----+---------+-----+-----+------+------+
U and V denote arbitrary hexadezimal addresses.
The first integer represents the number of CPUs data was extracted
from. This is followed by CPU number and number of counter sets extracted.
Both are two integer values. This is followed by the set identifer
and number of counters extracted. Both are two integer values. This is
followed by the counter values, each element is eight bytes in size.
The S390_HWCTR_READ ioctl subcommand is also limited to one call per
minute. This ensures that an application does not read out the
counter sets too often and reduces the overall CPU performance.
The complete counter set extraction is an expensive operation.
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The immediate need to have this is to have bpf_send_signal() send the
signal ASAP instead of during the next hrtimer interrupt. However, it
should also improve irq_work_queue() latencies in general, as well as
get s390 out of the lame architectures list [1].
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/irq_work.c?h=v5.11#n45
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Make cpumasks static variables to avoid potential large stack
frames. There shouldn't be any concurrent callers since all current
callers are serialized with the cpu hotplug lock.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Make "cpumask_t cpumask" a static variable to avoid a potential large
stack frame. Also protect against potential concurrent callers by
introducing a local lock.
Note: smp_emergency_stop() gets only called with irqs and machine
checks disabled, therefore a cpu local deadlock is not possible. For
concurrent callers the first cpu which enters the critical section
wins and will stop all other cpus.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Avoid a potentially large stack frame and overflow by making
"cpumask_t avail" a static variable. There is no concurrent
access due to the existing locking.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Move locking to __smp_rescan() instead of duplicating it to all call sites.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Due to historical reasons vmem_*() functions misuse
or ignore the notion of physical vs virtual addresses
difference.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The physical address of page tables is passed around and
used as virtual address in various locations.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
There is little sense in applying __pa() to a physical
address, but that what pfn_pXd() macros do.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This update fixes semantics of pXd_deref macros which
are expected to return a CPU-addressable pointer.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
=yPaw
-----END PGP SIGNATURE-----
Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull idmapped mounts from Christian Brauner:
"This introduces idmapped mounts which has been in the making for some
time. Simply put, different mounts can expose the same file or
directory with different ownership. This initial implementation comes
with ports for fat, ext4 and with Christoph's port for xfs with more
filesystems being actively worked on by independent people and
maintainers.
Idmapping mounts handle a wide range of long standing use-cases. Here
are just a few:
- Idmapped mounts make it possible to easily share files between
multiple users or multiple machines especially in complex
scenarios. For example, idmapped mounts will be used in the
implementation of portable home directories in
systemd-homed.service(8) where they allow users to move their home
directory to an external storage device and use it on multiple
computers where they are assigned different uids and gids. This
effectively makes it possible to assign random uids and gids at
login time.
- It is possible to share files from the host with unprivileged
containers without having to change ownership permanently through
chown(2).
- It is possible to idmap a container's rootfs and without having to
mangle every file. For example, Chromebooks use it to share the
user's Download folder with their unprivileged containers in their
Linux subsystem.
- It is possible to share files between containers with
non-overlapping idmappings.
- Filesystem that lack a proper concept of ownership such as fat can
use idmapped mounts to implement discretionary access (DAC)
permission checking.
- They allow users to efficiently changing ownership on a per-mount
basis without having to (recursively) chown(2) all files. In
contrast to chown (2) changing ownership of large sets of files is
instantenous with idmapped mounts. This is especially useful when
ownership of a whole root filesystem of a virtual machine or
container is changed. With idmapped mounts a single syscall
mount_setattr syscall will be sufficient to change the ownership of
all files.
- Idmapped mounts always take the current ownership into account as
idmappings specify what a given uid or gid is supposed to be mapped
to. This contrasts with the chown(2) syscall which cannot by itself
take the current ownership of the files it changes into account. It
simply changes the ownership to the specified uid and gid. This is
especially problematic when recursively chown(2)ing a large set of
files which is commong with the aforementioned portable home
directory and container and vm scenario.
- Idmapped mounts allow to change ownership locally, restricting it
to specific mounts, and temporarily as the ownership changes only
apply as long as the mount exists.
Several userspace projects have either already put up patches and
pull-requests for this feature or will do so should you decide to pull
this:
- systemd: In a wide variety of scenarios but especially right away
in their implementation of portable home directories.
https://systemd.io/HOME_DIRECTORY/
- container runtimes: containerd, runC, LXD:To share data between
host and unprivileged containers, unprivileged and privileged
containers, etc. The pull request for idmapped mounts support in
containerd, the default Kubernetes runtime is already up for quite
a while now: https://github.com/containerd/containerd/pull/4734
- The virtio-fs developers and several users have expressed interest
in using this feature with virtual machines once virtio-fs is
ported.
- ChromeOS: Sharing host-directories with unprivileged containers.
I've tightly synced with all those projects and all of those listed
here have also expressed their need/desire for this feature on the
mailing list. For more info on how people use this there's a bunch of
talks about this too. Here's just two recent ones:
https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdfhttps://fosdem.org/2021/schedule/event/containers_idmap/
This comes with an extensive xfstests suite covering both ext4 and
xfs:
https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts
It covers truncation, creation, opening, xattrs, vfscaps, setid
execution, setgid inheritance and more both with idmapped and
non-idmapped mounts. It already helped to discover an unrelated xfs
setgid inheritance bug which has since been fixed in mainline. It will
be sent for inclusion with the xfstests project should you decide to
merge this.
In order to support per-mount idmappings vfsmounts are marked with
user namespaces. The idmapping of the user namespace will be used to
map the ids of vfs objects when they are accessed through that mount.
By default all vfsmounts are marked with the initial user namespace.
The initial user namespace is used to indicate that a mount is not
idmapped. All operations behave as before and this is verified in the
testsuite.
Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users
to setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.
The user namespace the mount will be marked with can be specified by
passing a file descriptor refering to the user namespace as an
argument to the new mount_setattr() syscall together with the new
MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
of extensibility.
The following conditions must be met in order to create an idmapped
mount:
- The caller must currently have the CAP_SYS_ADMIN capability in the
user namespace the underlying filesystem has been mounted in.
- The underlying filesystem must support idmapped mounts.
- The mount must not already be idmapped. This also implies that the
idmapping of a mount cannot be altered once it has been idmapped.
- The mount must be a detached/anonymous mount, i.e. it must have
been created by calling open_tree() with the OPEN_TREE_CLONE flag
and it must not already have been visible in the filesystem.
The last two points guarantee easier semantics for userspace and the
kernel and make the implementation significantly simpler.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes are observed.
The manpage with a detailed description can be found here:
1d7b902e28
In order to support idmapped mounts, filesystems need to be changed
and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
patches to convert individual filesystem are not very large or
complicated overall as can be seen from the included fat, ext4, and
xfs ports. Patches for other filesystems are actively worked on and
will be sent out separately. The xfstestsuite can be used to verify
that port has been done correctly.
The mount_setattr() syscall is motivated independent of the idmapped
mounts patches and it's been around since July 2019. One of the most
valuable features of the new mount api is the ability to perform
mounts based on file descriptors only.
Together with the lookup restrictions available in the openat2()
RESOLVE_* flag namespace which we added in v5.6 this is the first time
we are close to hardened and race-free (e.g. symlinks) mounting and
path resolution.
While userspace has started porting to the new mount api to mount
proper filesystems and create new bind-mounts it is currently not
possible to change mount options of an already existing bind mount in
the new mount api since the mount_setattr() syscall is missing.
With the addition of the mount_setattr() syscall we remove this last
restriction and userspace can now fully port to the new mount api,
covering every use-case the old mount api could. We also add the
crucial ability to recursively change mount options for a whole mount
tree, both removing and adding mount options at the same time. This
syscall has been requested multiple times by various people and
projects.
There is a simple tool available at
https://github.com/brauner/mount-idmapped
that allows to create idmapped mounts so people can play with this
patch series. I'll add support for the regular mount binary should you
decide to pull this in the following weeks:
Here's an example to a simple idmapped mount of another user's home
directory:
u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt
u1001@f2-vm:/$ ls -al /home/ubuntu/
total 28
drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
-rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ ls -al /mnt/
total 28
drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
-rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
-rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ touch /mnt/my-file
u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file
u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file
u1001@f2-vm:/$ ls -al /mnt/my-file
-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file
u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file
u1001@f2-vm:/$ getfacl /mnt/my-file
getfacl: Removing leading '/' from absolute path names
# file: mnt/my-file
# owner: u1001
# group: u1001
user::rw-
user:u1001:rwx
group::rw-
mask::rwx
other::r--
u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
getfacl: Removing leading '/' from absolute path names
# file: home/ubuntu/my-file
# owner: ubuntu
# group: ubuntu
user::rw-
user:ubuntu:rwx
group::rw-
mask::rwx
other::r--"
* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
xfs: support idmapped mounts
ext4: support idmapped mounts
fat: handle idmapped mounts
tests: add mount_setattr() selftests
fs: introduce MOUNT_ATTR_IDMAP
fs: add mount_setattr()
fs: add attr_flags_to_mnt_flags helper
fs: split out functions to hold writers
namespace: only take read lock in do_reconfigure_mnt()
mount: make {lock,unlock}_mount_hash() static
namespace: take lock_mount_hash() directly when changing flags
nfs: do not export idmapped mounts
overlayfs: do not mount on top of idmapped mounts
ecryptfs: do not mount on top of idmapped mounts
ima: handle idmapped mounts
apparmor: handle idmapped mounts
fs: make helpers idmap mount aware
exec: handle idmapped mounts
would_dump: handle idmapped mounts
...
Summary of modules changes for the 5.12 merge window:
- Retire EXPORT_UNUSED_SYMBOL() and EXPORT_SYMBOL_GPL_FUTURE(). These export
types were introduced between 2006 - 2008. All the of the unused symbols have
been long removed and gpl future symbols were converted to gpl quite a long
time ago, and I don't believe these export types have been used ever since.
So, I think it should be safe to retire those export types now. (Christoph Hellwig)
- Refactor and clean up some aged code cruft in the module loader (Christoph Hellwig)
- Build {,module_}kallsyms_on_each_symbol only when livepatching is enabled, as
it is the only caller (Christoph Hellwig)
- Unexport find_module() and module_mutex and fix the last module
callers to not rely on these anymore. Make module_mutex internal to
the module loader. (Christoph Hellwig)
- Harden ELF checks on module load and validate ELF structures before checking
the module signature (Frank van der Linden)
- Fix undefined symbol warning for clang (Fangrui Song)
- Fix smatch warning (Dan Carpenter)
Signed-off-by: Jessica Yu <jeyu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEVrp26glSWYuDNrCUwEV+OM47wXIFAmA0/KMQHGpleXVAa2Vy
bmVsLm9yZwAKCRDARX44zjvBcu0uD/4nmRp18EKAtdUZivsZHat0aEWGlkmrVueY
5huYw6iwM8b/wIAl3xwLki1Iv0/l0a83WXZhLG4ekl0/Nj8kgllA+jtBrZWpoLMH
CZusN5dS9YwwyD2vu3ak83ARcehcDEPeA9thvc3uRFGis6Hi4bt1rkzGdrzsgqR4
tybfN4qaQx4ZAKFxA8bnS58l7QTFwUzTxJfM6WWzl1Q+mLZDr/WP+loJ/f1/oFFg
ufN31KrqqFpdQY5UKq5P4H8FVq/eXE1Mwl8vo3HsnAj598fznyPUmA3D/j+N4GuR
sTGBVZ9CSehUj7uZRs+Qgg6Bd+y3o44N29BrdZWA6K3ieTeQQpA+VgPUNrDBjGhP
J/9Y4ms4PnuNEWWRaa73m9qsVqAsjh9+T2xp9PYn9uWLCM8BvQFtWcY7tw4/nB0/
INmyiP/tIRpwWkkBl47u1TPR09FzBBGDZjBiSn3lm3VX+zCYtHoma5jWyejG11cf
ybDrTsci9ANyHNP2zFQsUOQJkph78PIal0i3k4ODqGJvaC0iEIH3Xjv+0dmE14rq
kGRrG/HN6HhMZPjashudVUktyTZ63+PJpfFlQbcUzdvjQQIkzW0vrCHMWx9vD1xl
Na7vZLl4Nb03WSJp6saY6j2YSRKL0poGETzGqrsUAHEhpEOPHduaiCVlAr/EmeMk
p6SrWv8+UQ==
=T29Q
-----END PGP SIGNATURE-----
Merge tag 'modules-for-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull module updates from Jessica Yu:
- Retire EXPORT_UNUSED_SYMBOL() and EXPORT_SYMBOL_GPL_FUTURE(). These
export types were introduced between 2006 - 2008. All the of the
unused symbols have been long removed and gpl future symbols were
converted to gpl quite a long time ago, and I don't believe these
export types have been used ever since. So, I think it should be safe
to retire those export types now (Christoph Hellwig)
- Refactor and clean up some aged code cruft in the module loader
(Christoph Hellwig)
- Build {,module_}kallsyms_on_each_symbol only when livepatching is
enabled, as it is the only caller (Christoph Hellwig)
- Unexport find_module() and module_mutex and fix the last module
callers to not rely on these anymore. Make module_mutex internal to
the module loader (Christoph Hellwig)
- Harden ELF checks on module load and validate ELF structures before
checking the module signature (Frank van der Linden)
- Fix undefined symbol warning for clang (Fangrui Song)
- Fix smatch warning (Dan Carpenter)
* tag 'modules-for-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: potential uninitialized return in module_kallsyms_on_each_symbol()
module: remove EXPORT_UNUSED_SYMBOL*
module: remove EXPORT_SYMBOL_GPL_FUTURE
module: move struct symsearch to module.c
module: pass struct find_symbol_args to find_symbol
module: merge each_symbol_section into find_symbol
module: remove each_symbol_in_section
module: mark module_mutex static
kallsyms: only build {,module_}kallsyms_on_each_symbol when required
kallsyms: refactor {,module_}kallsyms_on_each_symbol
module: use RCU to synchronize find_module
module: unexport find_module and module_mutex
drm: remove drm_fb_helper_modinit
powerpc/powernv: remove get_cxl_module
module: harden ELF info handling
module: Ignore _GLOBAL_OFFSET_TABLE_ when warning for undefined symbols
PF_IO_WORKER are kernel threads too, but they aren't PF_KTHREAD in the
sense that we don't assign ->set_child_tid with our own structure. Just
ensure that every arch sets up the PF_IO_WORKER threads like kthreads
in the arch implementation of copy_thread().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Convert to using the generic entry infrastructure.
- Add vdso time namespace support.
- Switch s390 and alpha to 64-bit ino_t. As discussed here
lkml.kernel.org/r/YCV7QiyoweJwvN+m@osiris
- Get rid of expensive stck (store clock) usages where possible. Utilize
cpu alternatives to patch stckf when supported.
- Make tod_clock usage less error prone by converting it to a union and
rework code which is using it.
- Machine check handler fixes and cleanups.
- Drop couple of minor inline asm optimizations to fix clang build.
- Default configs changes notably to make libvirt happy.
- Various changes to rework and improve qdio code.
- Other small various fixes and improvements all over the code.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmAyzcwACgkQjYWKoQLX
FBjjMwgAmeY3oMkj93bnUF/OnbYTJQ0ZHmlyeboKt7SnFyvNpOVGyRfl7+fPHsNu
+t9QZQk0f7fSxbcC04gz0ZMw1YbTjWihgZJsN6s+qtrRsv/kVqKr7kvhFrcs8uSZ
rLiwIRWGVAbprnJZWCNqaGpKkOM0wPYZ5W3Mtnoxe4nTM2LwSu2RWI8ibTGYLQPy
FybKos2hYOFBTGQdrxmg1zAvpE8DJg4qQNLhYvnmHd8Bw/FNBmoyhx8rS8z06NmS
dWMk7pfvQaslIIaFC3Yo7/sJVa/JJH33FlBonc+MSO8OZz5O6vG4bk9ZHq6DfHUH
V1I38xiBdYdSXDq8QqT3N9d+CtjeMQ==
=Lt/v
-----END PGP SIGNATURE-----
Merge tag 's390-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Convert to using the generic entry infrastructure.
- Add vdso time namespace support.
- Switch s390 and alpha to 64-bit ino_t. As discussed at
https://lore.kernel.org/linux-mm/YCV7QiyoweJwvN+m@osiris/
- Get rid of expensive stck (store clock) usages where possible.
Utilize cpu alternatives to patch stckf when supported.
- Make tod_clock usage less error prone by converting it to a union and
rework code which is using it.
- Machine check handler fixes and cleanups.
- Drop couple of minor inline asm optimizations to fix clang build.
- Default configs changes notably to make libvirt happy.
- Various changes to rework and improve qdio code.
- Other small various fixes and improvements all over the code.
* tag 's390-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (68 commits)
s390/qdio: remove 'merge_pending' mechanism
s390/qdio: improve handling of PENDING buffers for QEBSM devices
s390/qdio: rework q->qdio_error indication
s390/qdio: inline qdio_kick_handler()
s390/time: remove get_tod_clock_ext()
s390/crypto: use store_tod_clock_ext()
s390/hypfs: use store_tod_clock_ext()
s390/debug: use union tod_clock
s390/kvm: use union tod_clock
s390/vdso: use union tod_clock
s390/time: convert tod_clock_base to union
s390/time: introduce new store_tod_clock_ext()
s390/time: rename store_tod_clock_ext() and use union tod_clock
s390/time: introduce union tod_clock
s390,alpha: switch to 64-bit ino_t
s390: split cleanup_sie
s390: use r13 in cleanup_sie as temp register
s390: fix kernel asce loading when sie is interrupted
s390: add stack for machine check handler
s390: use WRITE_ONCE when re-allocating async stack
...
- Support for userspace to emulate Xen hypercalls
- Raise the maximum number of user memslots
- Scalability improvements for the new MMU. Instead of the complex
"fast page fault" logic that is used in mmu.c, tdp_mmu.c uses an
rwlock so that page faults are concurrent, but the code that can run
against page faults is limited. Right now only page faults take the
lock for reading; in the future this will be extended to some
cases of page table destruction. I hope to switch the default MMU
around 5.12-rc3 (some testing was delayed due to Chinese New Year).
- Cleanups for MAXPHYADDR checks
- Use static calls for vendor-specific callbacks
- On AMD, use VMLOAD/VMSAVE to save and restore host state
- Stop using deprecated jump label APIs
- Workaround for AMD erratum that made nested virtualization unreliable
- Support for LBR emulation in the guest
- Support for communicating bus lock vmexits to userspace
- Add support for SEV attestation command
- Miscellaneous cleanups
PPC:
- Support for second data watchpoint on POWER10
- Remove some complex workarounds for buggy early versions of POWER9
- Guest entry/exit fixes
ARM64
- Make the nVHE EL2 object relocatable
- Cleanups for concurrent translation faults hitting the same page
- Support for the standard TRNG hypervisor call
- A bunch of small PMU/Debug fixes
- Simplification of the early init hypercall handling
Non-KVM changes (with acks):
- Detection of contended rwlocks (implemented only for qrwlocks,
because KVM only needs it for x86)
- Allow __DISABLE_EXPORTS from assembly code
- Provide a saner follow_pfn replacements for modules
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmApSRgUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOc7wf9FnlinKoTFaSk7oeuuhF/CoCVwSFs
Z9+A2sNI99tWHQxFR6dyDkEFeQoXnqSxfLHtUVIdH/JnTg0FkEvFz3NK+0PzY1PF
PnGNbSoyhP58mSBG4gbBAxdF3ZJZMB8GBgYPeR62PvMX2dYbcHqVBNhlf6W4MQK4
5mAUuAnbf19O5N267sND+sIg3wwJYwOZpRZB7PlwvfKAGKf18gdBz5dQ/6Ej+apf
P7GODZITjqM5Iho7SDm/sYJlZprFZT81KqffwJQHWFMEcxFgwzrnYPx7J3gFwRTR
eeh9E61eCBDyCTPpHROLuNTVBqrAioCqXLdKOtO5gKvZI3zmomvAsZ8uXQ==
=uFZU
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"x86:
- Support for userspace to emulate Xen hypercalls
- Raise the maximum number of user memslots
- Scalability improvements for the new MMU.
Instead of the complex "fast page fault" logic that is used in
mmu.c, tdp_mmu.c uses an rwlock so that page faults are concurrent,
but the code that can run against page faults is limited. Right now
only page faults take the lock for reading; in the future this will
be extended to some cases of page table destruction. I hope to
switch the default MMU around 5.12-rc3 (some testing was delayed
due to Chinese New Year).
- Cleanups for MAXPHYADDR checks
- Use static calls for vendor-specific callbacks
- On AMD, use VMLOAD/VMSAVE to save and restore host state
- Stop using deprecated jump label APIs
- Workaround for AMD erratum that made nested virtualization
unreliable
- Support for LBR emulation in the guest
- Support for communicating bus lock vmexits to userspace
- Add support for SEV attestation command
- Miscellaneous cleanups
PPC:
- Support for second data watchpoint on POWER10
- Remove some complex workarounds for buggy early versions of POWER9
- Guest entry/exit fixes
ARM64:
- Make the nVHE EL2 object relocatable
- Cleanups for concurrent translation faults hitting the same page
- Support for the standard TRNG hypervisor call
- A bunch of small PMU/Debug fixes
- Simplification of the early init hypercall handling
Non-KVM changes (with acks):
- Detection of contended rwlocks (implemented only for qrwlocks,
because KVM only needs it for x86)
- Allow __DISABLE_EXPORTS from assembly code
- Provide a saner follow_pfn replacements for modules"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (192 commits)
KVM: x86/xen: Explicitly pad struct compat_vcpu_info to 64 bytes
KVM: selftests: Don't bother mapping GVA for Xen shinfo test
KVM: selftests: Fix hex vs. decimal snafu in Xen test
KVM: selftests: Fix size of memslots created by Xen tests
KVM: selftests: Ignore recently added Xen tests' build output
KVM: selftests: Add missing header file needed by xAPIC IPI tests
KVM: selftests: Add operand to vmsave/vmload/vmrun in svm.c
KVM: SVM: Make symbol 'svm_gp_erratum_intercept' static
locking/arch: Move qrwlock.h include after qspinlock.h
KVM: PPC: Book3S HV: Fix host radix SLB optimisation with hash guests
KVM: PPC: Book3S HV: Ensure radix guest has no SLB entries
KVM: PPC: Don't always report hash MMU capability for P9 < DD2.2
KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path
KVM: PPC: remove unneeded semicolon
KVM: PPC: Book3S HV: Use POWER9 SLBIA IH=6 variant to clear SLB
KVM: PPC: Book3S HV: No need to clear radix host SLB before loading HPT guest
KVM: PPC: Book3S HV: Fix radix guest SLB side channel
KVM: PPC: Book3S HV: Remove support for running HPT guest on RPT host without mixed mode support
KVM: PPC: Book3S HV: Introduce new capability for 2nd DAWR
KVM: PPC: Book3S HV: Add infrastructure to support 2nd DAWR
...
The "oprofile" user-space tools don't use the kernel OPROFILE support any more,
and haven't in a long time. User-space has been converted to the perf
interfaces.
The dcookies stuff is only used by the oprofile code. Now that oprofile's
support is getting removed from the kernel, there is no need for dcookies as
well.
Remove kernel's old oprofile and dcookies support.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJgJMEVAAoJENK5HDyugRIcL8YP/jkmXH5CZT80ntcqrJGWKcG7
lWbach7uNeQteht7B1ZPKvojxizTkmfrN2sClX0B2hbGkc5TiWUQ2ZSnvnfWDZ8+
z2qQcEB11G/ReL2vvRk1fJlWdAOyUfrPee/44AkemnLRv+Niw/8PqnGd87yDQGsK
qy5E1XXfbjUq6Y/uMiLOX3+21I6w6o2Q6I3NNXC93s0wS3awqnft8n0XBC7iAPBj
eowRJxpdRU2Vcuj8UOzzOI7gQlwdjwYImyLPbRy/V8NawC8a+FHrPrf5/GCYlVzl
7TGFBsDQSmzvrBChUfoGz1Rq/VZ1a357p5rhRqemfUrdkjW+vyzelnD8I1W/hb2o
SmBXoPoyl3+UkFHNyJI0mI7obaV+2PzyXMV0JIQUj+IiX/mfeFv0nF4XfZD2IkRt
6xhaYj775Zrx32iBdGZIvvLg5Gh9ZkZmR5vJ7Fi/EIZFe6Z+bZnPKUROnAgS/o0z
+UkSygOhgo/1XbqrzZVk1iweWeu+EUMbY4YQv2qVnFhpvsq4ieThcUGQpWcxGjjH
WP8O0n1yq1slsnpUtxhiTsm46ENajx9zZp6Iv6Ws+NM0RUqjND8BdF1co9WGD3LS
cnZMFBs4Bg/V1HICL/D4s6L7t1ofrEXIgJH1y3iF0HeECq03mU4CgA/qly9Aebqg
UxPF3oNlVOPlds9FzsU2
=I2Ac
-----END PGP SIGNATURE-----
Merge tag 'oprofile-removal-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/linux
Pull oprofile and dcookies removal from Viresh Kumar:
"Remove oprofile and dcookies support
The 'oprofile' user-space tools don't use the kernel OPROFILE support
any more, and haven't in a long time. User-space has been converted to
the perf interfaces.
The dcookies stuff is only used by the oprofile code. Now that
oprofile's support is getting removed from the kernel, there is no
need for dcookies as well.
Remove kernel's old oprofile and dcookies support"
* tag 'oprofile-removal-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/linux:
fs: Remove dcookies support
drivers: Remove CONFIG_OPROFILE support
arch: xtensa: Remove CONFIG_OPROFILE support
arch: x86: Remove CONFIG_OPROFILE support
arch: sparc: Remove CONFIG_OPROFILE support
arch: sh: Remove CONFIG_OPROFILE support
arch: s390: Remove CONFIG_OPROFILE support
arch: powerpc: Remove oprofile
arch: powerpc: Stop building and using oprofile
arch: parisc: Remove CONFIG_OPROFILE support
arch: mips: Remove CONFIG_OPROFILE support
arch: microblaze: Remove CONFIG_OPROFILE support
arch: ia64: Remove rest of perfmon support
arch: ia64: Remove CONFIG_OPROFILE support
arch: hexagon: Don't select HAVE_OPROFILE
arch: arc: Remove CONFIG_OPROFILE support
arch: arm: Remove CONFIG_OPROFILE support
arch: alpha: Remove CONFIG_OPROFILE support
Pull ELF compat updates from Al Viro:
"Sanitizing ELF compat support, especially for triarch architectures:
- X32 handling cleaned up
- MIPS64 uses compat_binfmt_elf.c both for O32 and N32 now
- Kconfig side of things regularized
Eventually I hope to have compat_binfmt_elf.c killed, with both native
and compat built from fs/binfmt_elf.c, with -DELF_BITS={64,32} passed
by kbuild, but that's a separate story - not included here"
* 'work.elf-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
get rid of COMPAT_ELF_EXEC_PAGESIZE
compat_binfmt_elf: don't bother with undef of ELF_ARCH
Kconfig: regularize selection of CONFIG_BINFMT_ELF
mips compat: switch to compat_binfmt_elf.c
mips: don't bother with ELF_CORE_EFLAGS
mips compat: don't bother with ELF_ET_DYN_BASE
mips: KVM_GUEST makes no sense for 64bit builds...
mips: kill unused definitions in binfmt_elf[on]32.c
mips binfmt_elf*32.c: use elfcore-compat.h
x32: make X32, !IA32_EMULATION setups able to execute x32 binaries
[amd64] clean PRSTATUS_SIZE/SET_PR_FPVALID up properly
elf_prstatus: collect the common part (everything before pr_reg) into a struct
binfmt_elf: partially sanitize PRSTATUS_SIZE and SET_PR_FPVALID
For non-QEBSM devices, get_buf_states() merges PENDING and EMPTY buffers
into a single group of finished buffers. To allow the upper-layer driver
to differentiate between the two states, qdio_check_pending() looks at
each buffer's state again and sets the sbal_state flag to
QDIO_OUTBUF_STATE_FLAG_PENDING accordingly.
So effectively we're spending overhead on _every_ Output Queue
inspection, just to avoid some additional TX completion calls in case
a group of buffers has completed with mixed EMPTY / PENDING state.
Given that PENDING buffers should rarely occur, this is a bad trade-off.
In particular so as the additional checks in get_buf_states() affect
_all_ device types (even those that don't use the PENDING state).
Rip it all out, and just report the PENDING completions separately as
we already do for QEBSM devices.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
For QEBSM devices the 'merge_pending' mechanism in get_buf_states()
doesn't apply, and we can actually get SLSB_P_OUTPUT_PENDING returned.
So for this case propagating the PENDING state to the driver via the
queue's sbal_state doesn't make sense and creates unnecessary overhead.
Instead introduce a new QDIO_ERROR_* flag that gets passed to the
driver, and triggers the same processing as if the buffers were flagged
as QDIO_OUTBUF_STATE_FLAG_PENDING.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Remove get_tod_clock_ext() and the STORE_CLOCK_EXT_SIZE define. This
enforces all users of the existing low level functions to use a union
tod_clock.
This way there is now a compile time check for the correct time and
therefore also if the size of the argument matches what will be
written to by the STORE CLOCK EXTENDED instruction.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Use store_tod_clock_ext() in order to be able to get rid
get_tod_clock_ext().
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Use store_tod_clock_ext() instead of get_tod_clock_ext().
Unfortunately one usage has to be converted to a cast, since
otherwise a uapi header file would have to be changed.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Use union tod_clock and get rid of the kvm specific struct
kvm_s390_tod_clock_ext which apparently was introduced for the same
purpose.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Convert tod_clock_base to union tod_clock. This simplifies quite a bit
of code and also fixes a bug in read_persistent_clock64();
void read_persistent_clock64(struct timespec64 *ts)
{
__u64 delta;
delta = initial_leap_seconds + TOD_UNIX_EPOCH;
get_tod_clock_ext(clk);
*(__u64 *) &clk[1] -= delta;
if (*(__u64 *) &clk[1] > delta)
clk[0]--;
ext_to_timespec64(clk, ts);
}
Assume &clk[1] == 3 and delta == 2; then after the substraction the if
condition becomes true and the epoch part of the clock is decremented
by one because of an assumed overflow, even though there is none.
Fix this by using 128 bit arithmetics and let the compiler do the
right thing:
void read_persistent_clock64(struct timespec64 *ts)
{
union tod_clock clk;
u64 delta;
delta = initial_leap_seconds + TOD_UNIX_EPOCH;
store_tod_clock_ext(&clk);
clk.eitod -= delta;
ext_to_timespec64(&clk, ts);
}
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Introduce new store_tod_clock_ext() function, which is the same like
store_tod_clock_ext_cc() except that it doesn't return a condition
code.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Rename store_tod_clock_ext() to store_tod_clock_ext_cc() to reflect
that it returns a condition code and also use union tod_clock as
parameter.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Introduce union tod_clock which is supposed to be used to decode and
access various fields of the result of STORE CLOCK EXTENDED.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
s390 and alpha are the only 64 bit architectures with a 32-bit ino_t.
Since this is quite unusual this causes bugs from time to time.
See e.g. commit ebce3eb2f7 ("ceph: fix inode number handling on
arches with 32-bit ino_t") for an example.
This (obviously) also prevents s390 and alpha to use 64-bit ino_t for
tmpfs. See commit b85a7a8bb5 ("tmpfs: disallow CONFIG_TMPFS_INODE64
on s390").
Therefore switch both s390 and alpha to 64-bit ino_t. This should only
have an effect on the ustat system call. To prevent ABI breakage
define struct ustat compatible to the old layout and change
sys_ustat() accordingly.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The current code uses the address in %r11 to figure out whether
it was called from the machine check handler or from a normal
interrupt handler. Instead of doing this implicit logic (which
is mostly a leftover from the old critical cleanup approach)
just add a second label and use that.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Instead of thrashing r11 which is normally our pointer to struct
pt_regs on the stack, use r13 as temporary register in the BR_EX
macro. r13 is already used in cleanup_sie, so no need to thrash
another register.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
If a machine check is coming in during sie, the PU saves the
control registers to the machine check save area. Afterwards
mcck_int_handler is called, which loads __LC_KERNEL_ASCE into
%cr1. Later the code restores %cr1 from the machine check area,
but that is wrong when SIE was interrupted because the machine
check area still contains the gmap asce. Instead it should return
with either __KERNEL_ASCE in %cr1 when interrupted in SIE or
the previous %cr1 content saved in the machine check save area.
Fixes: 87d5986345 ("s390/mm: remove set_fs / rework address space handling")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Cc: <stable@kernel.org> # v5.8+
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The previous code used the normal kernel stack for machine checks.
This is problematic when a machine check interrupts a system call
or interrupt handler right at the beginning where registers are set up.
Assume system_call is interrupted at the first instruction and a machine
check is triggered. The machine check handler is called, checks the PSW
to see whether it is coming from user space, notices that it is already
in kernel mode but %r15 still contains the user space stack. This would
lead to a kernel crash.
There are basically two ways of fixing that: Either using the 'critical
cleanup' approach which compares the address in the PSW to see whether
it is already at a point where the stack has been set up, or use an extra
stack for the machine check handler.
For simplicity, we will go with the second approach and allocate an extra
stack. This adds some memory overhead for large systems, but usually large
system have plenty of memory so this isn't really a concern. But it keeps
the mchk stack setup simple and less error prone.
Fixes: 0b0ed657fe ("s390: remove critical section cleanup from entry.S")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Cc: <stable@kernel.org> # v5.8+
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The code does:
S390_lowcore.async_stack = new + STACK_INIT_OFFSET;
But the compiler is free to first assign one value and
add the other value later. If a IRQ would be coming in
between these two operations, it would run with an invalid
stack. Prevent this by using WRITE_ONCE.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This is a preparation patch for two later bugfixes. In the past both
int_handler and machine check handler used SWITCH_KERNEL to switch to
the kernel stack. However, SWITCH_KERNEL doesn't work properly in machine
check context. So instead of adding more complexity to this macro, just
remove it.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Cc: <stable@kernel.org> # v5.8+
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Merge in the recent paravirt changes to resolve conflicts caused
by objtool annotations.
Conflicts:
arch/x86/xen/xen-asm.S
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To avoid include recursion hell move the do_softirq_own_stack() related
content into a generic asm header and include it from all places in arch/
which need the prototype.
This allows architectures to provide an inline implementation of
do_softirq_own_stack() without introducing a lot of #ifdeffery all over the
place.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002513.289960691@linutronix.de
To prepare for inlining do_softirq_own_stack() replace
__ARCH_HAS_DO_SOFTIRQ with a Kconfig switch and select it in the affected
architectures.
This allows in the next step to move the function prototype and the inline
stub into a seperate asm-generic header file which is required to avoid
include recursion.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002513.181713427@linutronix.de
Use a cpu alternative to switch between stck and stckf instead of
making it compile time dependent. This will also make kernels compiled
for old machines, but running on newer machines, use stckf.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add support for alternative inline assemblies with input and output
arguments. This is consistent to x86.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Use a cpu alternative to switch between stck and stckf instead of
making it compile time dependent. This will also make kernels compiled
for old machines, but running on newer machines, use stckf.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Use STORE CLOCK EXTENDED instead of STORE CLOCK in early tod clock
setup. This is just to remove another usage of stck, trying to remove
all usages of STORE CLOCK. This doesn't fix anything.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Use get_tod_clock_fast() instead of store_tod_clock(), since
store_tod_clock() can be very slow.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The stck/stckf instruction used within the inline assembly within
do_account_vtime() changes the condition code. This is not reflected
with the clobber list, and therefore might result in incorrect code
generation.
It seems unlikely that the compiler could generate incorrect code
considering the surrounding C code, but it must still be fixed.
Cc: <stable@vger.kernel.org>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This is the s390 variant of commit e6b28ec65b ("x86/vdso: On timens
page fault prefault also VVAR page").
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Implement generic vdso time namespace support which also enables time
namespaces for s390. This is quite similar to what arm64 has.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Use the passed in vdso_data pointer instead of calculating it again.
This is also required as a prerequisite for vdso time namespaces: if a
process is part of a time namespace __arch_get_vdso_data() will return
a pointer to the time namespace data page instead of the vdso data
page, which is not what __arch_get_hw_counter() expects.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
For consistency with x86 and arm64 move the data page before code
pages. Similar to commit 601255ae3c ("arm64: vdso: move data page
before code pages").
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add a separate "[vvar]" mapping for the vdso datapage, since it
doesn't need to be executable or COW-able.
This is actually the s390 implementation of commit 8715493852
("arm64: vdso: put vdso datapage in a separate vma")
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Implement vdso mapping similar to arm64 and powerpc.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
A few local variables exist only so the contents of a global variable
can be copied to them, and use that value only for reading.
Just remove them and rename some global variables. Also change
vdso64_[start|end] to be character arrays to be consistent with other
architectures, and get rid of the global variable vdso64_kbase.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Handle allocation error gracefully and simply disable vdso instead of
leaving the system in an undefined state.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The vdso is (and must) be page aligned and its size must also be
a multiple of PAGE_SIZE. Therefore no need to round upwards.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Convert vdso_init() to arch_initcall like it is on all other architectures.
This requires to remove the vdso_getcpu_init() call from vdso_init()
since it must be called before smp is enabled.
vdso_getcpu_init() is now an early_initcall like on powerpc.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The vdso data page actually contains an array. Fix that.
This doesn't fix a real bug, just reflects reality.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Since Fedora 33 the virtualization stack of Fedora requires a couple of
netfilter modules to function properly. Let's add these to defconfig and
debug_defconfig.
Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Tested-by: Bjoern Walk <bwalk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
...but set it to off by default. Use the kernel command line option
`kmemleak=on` to enable it.
Signed-off-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add missing forward declaration for task_struct.
The warning appears when the -Werror C compiler flag is being used.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Disable CONFIG_TMPFS_INODE64 which is currently broken on s390x
because size of ino_t on s390x is 4 bytes.
This fixes the following error with kdump:
[ 9.415082] [608]: Remounting '/' read-only in with options 'size=238372k,nr_inodes=59593,inode64'.
[ 9.415093] rootfs: Cannot use inode64 with <64bit inums in kernel
[ 9.415093]
[ 9.415100] [608]: Failed to remount '/' read-only: Invalid argument
Fixes: 5c60ed283e ("s390: update defconfigs")
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Fix the following coccicheck warnings:
./arch/s390/include/asm/scsw.h:528:48-50: WARNING !A || A && B is
equivalent to !A || B.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Reviewed-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Remove a superfluous semicolon after function definition.
Signed-off-by: Chengyang Fan <cy.fan@huawei.com>
Message-Id: <20210125095839.1720265-1-cy.fan@huawei.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Currently zpci_create_device() is only called in clp_add_pci_device()
which allocates the memory for the struct zpci_dev being created. There
is little separation of concerns as only both functions together can
create a zpci_dev and the only CLP specific code in
clp_add_pci_device() is a call to clp_query_pci_fn().
Improve this by removing clp_add_pci_device() and refactor
zpci_create_device() such that it alone creates and initializes the
zpci_dev given the FID and Function Handle. For this we need to make
clp_query_pci_fn() non-static. While at it remove the function handle
parameter since we can just take that from the zpci_dev. Also move
adding to the zpci_list to after the zdev has been fully created which
eliminates a window where a partially initialized zdev can be found by
get_zdev_by_fid().
Acked-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Both qeth and zfcp have fully moved to the polling-driven flow for
Input Queues with commit 0a6e634535 ("s390/qdio: extend polling
support to multiple queues") and commit 0b524abc2d ("scsi: zfcp: Lift
Input Queue tasklet from qdio").
So remove the tasklet code for Input Queues, streamline the IRQ handlers
and push the tasklet struct into struct qdio_output_q.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Current KVM_USER_MEM_SLOTS limits are arch specific (512 on Power, 509 on x86,
32 on s390, 16 on MIPS) but they don't really need to be. Memory slots are
allocated dynamically in KVM when added so the only real limitation is
'id_to_index' array which is 'short'. We don't have any other
KVM_MEM_SLOTS_NUM/KVM_USER_MEM_SLOTS-sized statically defined structures.
Low KVM_USER_MEM_SLOTS can be a limiting factor for some configurations.
In particular, when QEMU tries to start a Windows guest with Hyper-V SynIC
enabled and e.g. 256 vCPUs the limit is hit as SynIC requires two pages per
vCPU and the guest is free to pick any GFN for each of them, this fragments
memslots as QEMU wants to have a separate memslot for each of these pages
(which are supposed to act as 'overlay' pages).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210127175731.2020089-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, the follow_pfn function is exported for modules but
follow_pte is not. However, follow_pfn is very easy to misuse,
because it does not provide protections (so most of its callers
assume the page is writable!) and because it returns after having
already unlocked the page table lock.
Provide instead a simplified version of follow_pte that does
not have the pmdpp and range arguments. The older version
survives as follow_invalidate_pte() for use by fs/dax.c.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
EXPORT_UNUSED_SYMBOL* is not actually used anywhere. Remove the
unused functionality as we generally just remove unused code anyway.
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Emil Velikov <emil.l.velikov@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jessica Yu <jeyu@kernel.org>