Reorganise TLBCAM allocation so that when STRICT_KERNEL_RWX is
enabled, TLBCAMs are allocated such that readonly memory uses
different TLBCAMs.
This results in an allocation looking like:
Memory CAM mapping: 4/4/4/1/1/1/1/16/16/16/64/64/64/256/256 Mb, residual: 256Mb
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/8ca169bc288261a0e0558712f979023c3a960ebb.1634292136.git.christophe.leroy@csgroup.eu
Avoid switching to AS1 when reloading TLBCAM after init for
STRICT_KERNEL_RWX.
When we setup AS1 we expect the entire accessible memory to be mapped
through one entry, this is not the case anymore at the end of init.
We are not changing the size of TLBCAMs, only flags, so no need to
switch to AS1.
So change loadcam_multi() to not switch to AS1 when the given
temporary tlb entry in 0.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a9d517fbfbc940f56103c46b323f6eb8f4485571.1634292136.git.christophe.leroy@csgroup.eu
Don't force MAS3_SX and MAS3_UX at all time. Take into account the
exec flag.
While at it, fix a couple of closeby style problems (indent with space
and unnecessary parenthesis), it keeps more readability.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5467044e59f27f9fcf709b9661779e3ce5f784f6.1634292136.git.christophe.leroy@csgroup.eu
We have a myriad of CONFIG symbols around different variants
of BOOKEs, which would be worth tidying up one day.
But at least, make file names and CONFIG option match:
We have CONFIG_FSL_BOOKE and CONFIG_PPC_FSL_BOOK3E.
fsl_booke.c is selected by and only by CONFIG_PPC_FSL_BOOK3E.
So rename it fsl_book3e to reduce confusion.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5dc871db1f67739319bec11f049ca450da1c13a2.1634292136.git.christophe.leroy@csgroup.eu
The page_alloc.c code will call into __kernel_map_pages() when
DEBUG_PAGEALLOC is configured and enabled.
As the implementation assumes hash, this should crash spectacularly if
not for a bit of luck in __kernel_map_pages(). In this function
linear_map_hash_count is always zero, the for loop exits without doing
any damage.
There are no other platforms that determine if they support
debug_pagealloc at runtime. Instead of adding code to mm/page_alloc.c to
do that, this change turns the map/unmap into a noop when in radix
mode and prints a warning once.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mpe: Reformat if per Christophe's suggestion]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211013213438.675095-1-joel@jms.id.au
max_mapnr is used by virt_addr_valid() to check if a linear
address is valid.
It must only include lowmem PFNs, like other architectures.
Problem detected on a system with 1G mem (Only 768M are mapped), with
CONFIG_DEBUG_VIRTUAL and CONFIG_TEST_DEBUG_VIRTUAL, it didn't report
virt_to_phys(VMALLOC_START), VMALLOC_START being 0xf1000000.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/77d99037782ac4b3c3b0124fc4ae80ce7b760b05.1634035228.git.christophe.leroy@csgroup.eu
Commit 8e11d62e2e ("powerpc/mem: Add back missing header to fix 'no
previous prototype' error") was supposed to fix the problem, but in
the meantime commit a927bd6ba9 ("mm: fix phys_to_target_node() and*
memory_add_physaddr_to_nid() exports") moved create_section_mapping()
prototype from asm/sparsemem.h to asm/mmzone.h
Fixes: 8e11d62e2e ("powerpc/mem: Add back missing header to fix 'no previous prototype' error")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/025754fde3d027904ae9d0191f395890bec93369.1631541649.git.christophe.leroy@csgroup.eu
Merge more updates from Andrew Morton:
"147 patches, based on 7d2a07b769.
Subsystems affected by this patch series: mm (memory-hotplug, rmap,
ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
selftests, ipc, and scripts"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
scripts: check_extable: fix typo in user error message
mm/workingset: correct kernel-doc notations
ipc: replace costly bailout check in sysvipc_find_ipc()
selftests/memfd: remove unused variable
Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
configs: remove the obsolete CONFIG_INPUT_POLLDEV
prctl: allow to setup brk for et_dyn executables
pid: cleanup the stale comment mentioning pidmap_init().
kernel/fork.c: unexport get_{mm,task}_exe_file
coredump: fix memleak in dump_vma_snapshot()
fs/coredump.c: log if a core dump is aborted due to changed file permissions
nilfs2: use refcount_dec_and_lock() to fix potential UAF
nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
nilfs2: fix NULL pointer in nilfs_##name##_attr_release
nilfs2: fix memory leak in nilfs_sysfs_create_device_group
trap: cleanup trap_init()
init: move usermodehelper_enable() to populate_rootfs()
...
- Convert pseries & powernv to use MSI IRQ domains.
- Rework the pseries CPU numbering so that CPUs that are removed, and later re-added, are
given a CPU number on the same node as previously, when possible.
- Add support for a new more flexible device-tree format for specifying NUMA distances.
- Convert powerpc to GENERIC_PTDUMP.
- Retire sbc8548 and sbc8641d board support.
- Various other small features and fixes.
Thanks to: Alexey Kardashevskiy, Aneesh Kumar K.V, Anton Blanchard, Cédric Le Goater,
Christophe Leroy, Emmanuel Gil Peyrot, Fabiano Rosas, Fangrui Song, Finn Thain, Gautham R.
Shenoy, Hari Bathini, Joel Stanley, Jordan Niethe, Kajol Jain, Laurent Dufour, Leonardo
Bras, Lukas Bulwahn, Marc Zyngier, Masahiro Yamada, Michal Suchanek, Nathan Chancellor,
Nicholas Piggin, Parth Shah, Paul Gortmaker, Pratik R. Sampat, Randy Dunlap, Sebastian
Andrzej Siewior, Srikar Dronamraju, Wan Jiabing, Xiongwei Song, Zheng Yongjun.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmEyHTYTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgDo3D/9aXMVP2wsEMNB0XhTiJ1UUdi311Uq9
PvkAaGZH14ZqZLVigeiD3gt6YzTH0cEuGj6qgwsJrPDjF8FESnMbBsprMLr5/qE1
itWRGMAMCFaeTcB9ogYVJkzwg6RN2ZgIqoq4NVswNSXoAQGWb+1bvXq3RnXXNuGR
TQmLL02poNC6nX0YbRaQoT1Xx4nfUTiKHhU+Aok9uOCMJIyYZVATR6Qafb7/j7tO
UvjwOHztbu84lcJOGmSnw4LcmwNORLuP9IwR0r+O1M3ijEZqDo9TPkvtSz8HZwjU
mxdJwhrUmN0euMcghuiFxW+1XG2eM49ugsdJugiezG2RaIijbIp0nAIvdeaKAgT1
OSSwvWCQ0fkTPyLXE+O6tVqMhlUMdqQlRcyNwmN9svIip9VnwGNq3vA4ePlJm6Fi
i0i/tLqVNlJwFokZ7blW5g8SRgGRuFfXd5XUYLFvy5Teez+/7b1mW95gPQZSJ8kV
Tbx2e0nHAPX4hCAxJ1AB3/zTlnjY+4+WJ9bD5XdgXkeVE8PPh1BEkulhMi1R1OMj
57D1W6OgsBu/Pze78wjAvwO8+NAb1T/2mv2Bd/LY6Q+7hNDqOOhuajyBTxbH41FG
sqx5bKjKOwgTybfV9A0Eo0e4FQBX07yXltBFHaPlyA4sOsIhM59+PxNrEwN1eZrQ
LVVsdBXg8pHxrw==
=EbN0
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Convert pseries & powernv to use MSI IRQ domains.
- Rework the pseries CPU numbering so that CPUs that are removed, and
later re-added, are given a CPU number on the same node as
previously, when possible.
- Add support for a new more flexible device-tree format for specifying
NUMA distances.
- Convert powerpc to GENERIC_PTDUMP.
- Retire sbc8548 and sbc8641d board support.
- Various other small features and fixes.
Thanks to Alexey Kardashevskiy, Aneesh Kumar K.V, Anton Blanchard,
Cédric Le Goater, Christophe Leroy, Emmanuel Gil Peyrot, Fabiano Rosas,
Fangrui Song, Finn Thain, Gautham R. Shenoy, Hari Bathini, Joel
Stanley, Jordan Niethe, Kajol Jain, Laurent Dufour, Leonardo Bras, Lukas
Bulwahn, Marc Zyngier, Masahiro Yamada, Michal Suchanek, Nathan
Chancellor, Nicholas Piggin, Parth Shah, Paul Gortmaker, Pratik R.
Sampat, Randy Dunlap, Sebastian Andrzej Siewior, Srikar Dronamraju, Wan
Jiabing, Xiongwei Song, and Zheng Yongjun.
* tag 'powerpc-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (154 commits)
powerpc/bug: Cast to unsigned long before passing to inline asm
powerpc/ptdump: Fix generic ptdump for 64-bit
KVM: PPC: Fix clearing never mapped TCEs in realmode
powerpc/pseries/iommu: Rename "direct window" to "dma window"
powerpc/pseries/iommu: Make use of DDW for indirect mapping
powerpc/pseries/iommu: Find existing DDW with given property name
powerpc/pseries/iommu: Update remove_dma_window() to accept property name
powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper
powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()
powerpc/pseries/iommu: Allow DDW windows starting at 0x00
powerpc/pseries/iommu: Add ddw_list_new_entry() helper
powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper
powerpc/kernel/iommu: Add new iommu_table_in_use() helper
powerpc/pseries/iommu: Replace hard-coded page shift
powerpc/numa: Update cpu_cpu_map on CPU online/offline
powerpc/numa: Print debug statements only when required
powerpc/numa: convert printk to pr_xxx
powerpc/numa: Drop dbg in favour of pr_debug
powerpc/smp: Enable CACHE domain for shared processor
powerpc/smp: Update cpu_core_map on all PowerPc systems
...
Merge our fixes branch into next.
That lets us resolve a conflict in arch/powerpc/sysdev/xive/common.c.
Between cbc06f051c ("powerpc/xive: Do not skip CPU-less nodes when
creating the IPIs"), which moved request_irq() out of xive_init_ipis(),
and 17df41fec5 ("powerpc: use IRQF_NO_DEBUG for IPIs") which added
IRQF_NO_DEBUG to that request_irq() call, which has now moved.
Since the conversion to generic ptdump we see crashes on 64-bit:
BUG: Unable to handle kernel data access on read at 0xc0eeff7f00000000
Faulting instruction address: 0xc00000000045e5fc
Oops: Kernel access of bad area, sig: 11 [#1]
...
NIP __walk_page_range+0x2bc/0xce0
LR __walk_page_range+0x240/0xce0
Call Trace:
__walk_page_range+0x240/0xce0 (unreliable)
walk_page_range_novma+0x74/0xb0
ptdump_walk_pgd+0x98/0x170
ptdump_check_wx+0x88/0xd0
mark_rodata_ro+0x48/0x80
kernel_init+0x74/0x1a0
ret_from_kernel_thread+0x5c/0x64
What's happening is that have walked off the end of the kernel page
tables, and started dereferencing junk values.
That happens because we initialised the ptdump_range to span all the way
up to 0xffffffffffffffff:
static struct ptdump_range ptdump_range[] __ro_after_init = {
{TASK_SIZE_MAX, ~0UL},
But the kernel page tables don't span that far. So on 64-bit set the end
of the range to be the address immediately past the end of the kernel
page tables, to limit the page table walk to valid addresses.
Fixes: e084728393 ("powerpc/ptdump: Convert powerpc to GENERIC_PTDUMP")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210831135151.886620-1-mpe@ellerman.id.au
Currently, a debug message gets printed every time an attempt to
add(remove) a CPU. However this is redundant if the CPU is already added
(removed) from the node.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210826100521.412639-4-srikar@linux.vnet.ibm.com
Convert the remaining printk to pr_xxx
One advantage would be all prints will now have prefix "numa:" from
pr_fmt().
[ convert printk(KERN_ERR) to pr_warn : Suggested by Laurent Dufour ]
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[mpe: Rebase onto powerpc/next, s/WARNING/Warning/]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210826100521.412639-3-srikar@linux.vnet.ibm.com
powerpc supported numa=debug which is not documented. This option was
used to print early debug output. However something more flexible can be
achieved by using CONFIG_DYNAMIC_DEBUG.
Hence drop dbg (and numa=debug) in favour of pr_debug
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[mpe: Rebase on to powerpc/next form2 affinity changes]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210826100521.412639-2-srikar@linux.vnet.ibm.com
- Fix random crashes on some 32-bit CPUs by adding isync() after locking/unlocking KUEP
- Fix intermittent crashes when loading modules with strict module RWX
- Fix a section mismatch introduce by a previous fix.
Thanks to: Christophe Leroy, Fabiano Rosas, Laurent Vivier, Murilo Opsfelder Araújo,
Nathan Chancellor, Stan Johnson.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmEhjVUTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgM/VD/4o5D9d2Xppt+T0zdnKomq+3ffkC33/
zBK4vqOVOXbnlRpChqIqsHB3LNxMNMvTVaoLvxgy3ZQ57+rnirSDaFOaj4Nbazdx
STwWmyxW9xPshqvj8tz8uHadSkvbrCClFy59FXtJf4H/iztTnQORKnXI9r3wxXS+
wBhw8Nhquuqg4O5h4q6yLLRIAaskus7uymDzYHVZkHO0RhfPLEZJwfCxydc29ukK
wIRB6qojFBbWm/UscY1w6FiYrBn4Y5F3DzoTzJ7xlO6l1NYaE+58aun/oTGU7922
/8fXYs34TnkF6sA9qGhOtOc1MXfH8meFoH9s/fY3Z3O88xTe8k15wo2Ujlk/u0X1
1Gzv9FZI0RnpPtSLPiiu72/zS/vFOxAVCFMTvcodlte9RN90fW5Qwt/O1ya22vWt
Ea3O9iNmYgQ+lV7ZZYDtKQ22WHIublg6cY5d3NDyj5HrzN/vGyp3QJFb2dnWoEpx
k/KkK16oiIlduLGiFoYjn1ELyHUBTvp483y7zmspA4fCb0ue6W8b2zt8FszH0hI7
N4uroGXuk9OyhNsLWR8UHUR0s6Gi0XSaQ0O4XgWfoDAAvdev4oZiCqw0q5552OvX
eE/Ogxc7INCiaoeLwOhYhCKjr+jBP8fhqyQzquyqMgUqEbxLtcFZCJ09bpXHjjiH
OlAvZwlzOhwcKg==
=K2B0
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix random crashes on some 32-bit CPUs by adding isync() after
locking/unlocking KUEP
- Fix intermittent crashes when loading modules with strict module RWX
- Fix a section mismatch introduce by a previous fix.
Thanks to Christophe Leroy, Fabiano Rosas, Laurent Vivier, Murilo
Opsfelder Araújo, Nathan Chancellor, and Stan Johnson.
h# -----BEGIN PGP SIGNATURE-----
* tag 'powerpc-5.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm: Fix set_memory_*() against concurrent accesses
powerpc/32s: Fix random crashes by adding isync() after locking/unlocking KUEP
powerpc/xive: Do not mark xive_request_ipi() as __init
Laurent reported that STRICT_MODULE_RWX was causing intermittent crashes
on one of his systems:
kernel tried to execute exec-protected page (c008000004073278) - exploit attempt? (uid: 0)
BUG: Unable to handle kernel instruction fetch
Faulting instruction address: 0xc008000004073278
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: drm virtio_console fuse drm_panel_orientation_quirks ...
CPU: 3 PID: 44 Comm: kworker/3:1 Not tainted 5.14.0-rc4+ #12
Workqueue: events control_work_handler [virtio_console]
NIP: c008000004073278 LR: c008000004073278 CTR: c0000000001e9de0
REGS: c00000002e4ef7e0 TRAP: 0400 Not tainted (5.14.0-rc4+)
MSR: 800000004280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002822 XER: 200400cf
...
NIP fill_queue+0xf0/0x210 [virtio_console]
LR fill_queue+0xf0/0x210 [virtio_console]
Call Trace:
fill_queue+0xb4/0x210 [virtio_console] (unreliable)
add_port+0x1a8/0x470 [virtio_console]
control_work_handler+0xbc/0x1e8 [virtio_console]
process_one_work+0x290/0x590
worker_thread+0x88/0x620
kthread+0x194/0x1a0
ret_from_kernel_thread+0x5c/0x64
Jordan, Fabiano & Murilo were able to reproduce and identify that the
problem is caused by the call to module_enable_ro() in do_init_module(),
which happens after the module's init function has already been called.
Our current implementation of change_page_attr() is not safe against
concurrent accesses, because it invalidates the PTE before flushing the
TLB and then installing the new PTE. That leaves a window in time where
there is no valid PTE for the page, if another CPU tries to access the
page at that time we see something like the fault above.
We can't simply switch to set_pte_at()/flush TLB, because our hash MMU
code doesn't handle a set_pte_at() of a valid PTE. See [1].
But we do have pte_update(), which replaces the old PTE with the new,
meaning there's no window where the PTE is invalid. And the hash MMU
version hash__pte_update() deals with synchronising the hash page table
correctly.
[1]: https://lore.kernel.org/linuxppc-dev/87y318wp9r.fsf@linux.ibm.com/
Fixes: 1f9ad21c3b ("powerpc/mm: Implement set_memory() routines")
Reported-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Murilo Opsfelder Araújo <muriloo@linux.ibm.com>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210818120518.3603172-1-mpe@ellerman.id.au
PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132223.225214-6-aneesh.kumar@linux.ibm.com
This helper is only used with the dispatch trace log collection.
A later patch will add Form2 affinity support and this change helps
in keeping that simpler. Also add a comment explaining we don't expect
the code to be called with FORM0
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132223.225214-5-aneesh.kumar@linux.ibm.com
The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call.
Instead of updating NUMA distance every time we lookup a node id
from the associativity property, add helpers that can be used
during boot which does this only once. Also remove the distance
update from node id lookup helpers.
Currently, we duplicate parsing code for ibm,associativity and
ibm,associativity-lookup-arrays in the kernel. The associativity array provided
by these device tree properties are very similar and hence can use
a helper to parse the node id and numa distance details.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132223.225214-4-aneesh.kumar@linux.ibm.com
Also make related code cleanup that will allow adding FORM2_AFFINITY in
later patches. No functional change in this patch.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132223.225214-3-aneesh.kumar@linux.ibm.com
No functional change in this patch. arch_debugfs_dir is the generic kernel
name declared in linux/debugfs.h for arch-specific debugfs directory.
Architectures like x86/s390 already use the name. Rename powerpc
specific powerpc_debugfs_root to arch_debugfs_dir.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132831.233794-2-aneesh.kumar@linux.ibm.com
Similar to x86/s390 add a debugfs file to tune tlb_single_page_flush_ceiling.
Also add a debugfs entry for tlb_local_single_page_flush_ceiling.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132831.233794-1-aneesh.kumar@linux.ibm.com
After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.
This is handled by the kernel, but the memory's node is not updated because
there is no way to move a memory block between nodes from the Linux kernel
point of view.
If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node ibm,dynamic-reconfiguration-memory to
match the added or removed LMB. But the LMB's associativity node has not
been updated after the DT node update and thus the node is overwritten by
the Linux's topology instead of the hypervisor one.
Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity. However, ignore the
call to that hook when the update has been triggered by drmem_update_dt().
Because, in that case, the LMB tree has been used to set the DT property
and thus it doesn't need to be updated back. Since drmem_update_dt() is
called under the protection of the device_hotplug_lock and the hook is
called in the same context, use a simple boolean variable to detect that
call.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210517090606.56930-1-ldufour@linux.ibm.com
When a LPAR is migratable, we should consider the maximum possible NUMA
node instead of the number of NUMA nodes from the actual system.
The DT property 'ibm,current-associativity-domains' defines the maximum
number of nodes the LPAR can see when running on that box. But if the
LPAR is being migrated on another box, it may see up to the nodes
defined by 'ibm,max-associativity-domains'. So if a LPAR is migratable,
that value should be used.
Unfortunately, there is no easy way to know if an LPAR is migratable or
not. The hypervisor exports the property 'ibm,migratable-partition' in
the case it set to migrate partition, but that would not mean that the
current partition is migratable.
Without this patch, when a LPAR is started on a 2 node box and then
migrated to a 3 node box, the hypervisor may spread the LPAR's CPUs on
the 3rd node. In that case if a CPU from that 3rd node is added to the
LPAR, it will be wrongly assigned to the node because the kernel has
been set to use up to 2 nodes (the configuration of the departure node).
With this patch applies, the CPU is correctly added to the 3rd node.
Fixes: f9f130ff2e ("powerpc/numa: Detect support for coregroup")
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210511073136.17795-1-ldufour@linux.ibm.com
This reverts commit c742199a01.
c742199a01 ("mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge")
breaks arm64 in at least two ways for configurations where PUD or PMD
folding occur:
1. We no longer install huge-vmap mappings and silently fall back to
page-granular entries, despite being able to install block entries
at what is effectively the PGD level.
2. If the linear map is backed with block mappings, these will now
silently fail to be created in alloc_init_pud(), causing a panic
early during boot.
The pgtable selftests caught this, although a fix has not been
forthcoming and Christophe is AWOL at the moment, so just revert the
change for now to get a working -rc3 on which we can queue patches for
5.15.
A simple revert breaks the build for 32-bit PowerPC 8xx machines, which
rely on the default function definitions when the corresponding
page-table levels are folded, since commit a6a8f7c4aa ("powerpc/8xx:
add support for huge pages on VMAP and VMALLOC"), eg:
powerpc64-linux-ld: mm/vmalloc.o: in function `vunmap_pud_range':
linux/mm/vmalloc.c:362: undefined reference to `pud_clear_huge'
To avoid that, add stubs for pud_clear_huge() and pmd_clear_huge() in
arch/powerpc/mm/nohash/8xx.c as suggested by Christophe.
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: c742199a01 ("mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge")
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
[mpe: Fold in 8xx.c changes from Christophe and mention in change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/linux-arm-kernel/CAMuHMdXShORDox-xxaeUfDW3wx2PeggFSqhVSHVZNKCGK-y_vQ@mail.gmail.com/
Link: https://lore.kernel.org/r/20210717160118.9855-1-jonathan@marek.ca
Link: https://lore.kernel.org/r/87r1fs1762.fsf@mpe.ellerman.id.au
Signed-off-by: Will Deacon <will@kernel.org>
Fix crashes on 64-bit Book3E due to use of Book3S only mtmsrd instruction.
Fix "scheduling while atomic" warnings at boot due to preempt count underflow.
Two commits fixing our handling of BPF atomic instructions.
Fix error handling in xive when allocating an IPI.
Fix lockup on kernel exec fault on 603.
Thanks to: Bharata B Rao, Cédric Le Goater, Christian Zigotzky, Christophe Leroy, Guenter
Roeck, Jiri Olsa, Naveen N. Rao, Nicholas Piggin, Valentin Schneider.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmDoT7cTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgFbSEACa703VD0R/Zx7/MASoc0nfLkoyDvOS
1dQ7isuNLLwTgFCymS36w5cyUxpQJUepUFLjZajJ5rHgGM60i78trZgrxKp5o+us
r/F530QI2RZdko2rrtoSVUe6Oj4z0l2lYtBHa7UybKIRG7R/po/DmYYe1/Wmq7Bc
fu/R3C7m5/HY63E2mzdCPFHfNDPZavScXyPUpfgEka7PGDJsTCZqIOzeGt9oqm6f
lEysuIvBNvq5NVvUOaBiW4dlbkxckVANvj43kjeX+c0YnJ7MTW/xCYl4AuaAxL2T
Kc23mWgTj2ONrThbhrB5Bq2uf9bMA+4cJKTRfWiGslxLyhXP59jBnJvn6H5oCPf/
pr/iLjA8bBS7HnaeAEjC74IIs8SDnhkWjW9ec3Da2Z4ihl8W15Bkf0+g3GOMdj3J
hrphnhAayr4NKuXgkI/SfoFrAiH+V00LabPA1IFm5Zgvs5DSX/Lygtcf/kNcCZeQ
jI0jgy5HjmVhTyySw+htVicdW8xJ/rvXqJDguLkxiNEZN0PF4UQSPUX+ARBnMUDT
Bn/RSGWrgMR5VI1+kpYephhv/PFmF6dKUFpSFrBqt/sZ7rVj4jJCZpNdLDmkaIYi
v1H4eNN3p85KeApa4PJXRuXrxi5F5XRkGTlk2Sy51ClyFYT6WlzKUcOhMlG4HuzO
oro2jl2LF9steg==
=q3Ag
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Fix crashes on 64-bit Book3E due to use of Book3S only mtmsrd
instruction.
Fix "scheduling while atomic" warnings at boot due to preempt count
underflow.
Two commits fixing our handling of BPF atomic instructions.
Fix error handling in xive when allocating an IPI.
Fix lockup on kernel exec fault on 603.
Thanks to Bharata B Rao, Cédric Le Goater, Christian Zigotzky,
Christophe Leroy, Guenter Roeck, Jiri Olsa, Naveen N. Rao, Nicholas
Piggin, and Valentin Schneider"
* tag 'powerpc-5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/preempt: Don't touch the idle task's preempt_count during hotplug
powerpc/64e: Fix system call illegal mtmsrd instruction
powerpc/xive: Fix error handling when allocating an IPI
powerpc/bpf: Reject atomic ops in ppc32 JIT
powerpc/bpf: Fix detecting BPF atomic instructions
powerpc/mm: Fix lockup on kernel exec fault
flush_tlb_range is special in that we don't specify the page size used for
the translation. Hence when flushing TLB we flush the translation cache
for all possible page sizes. The kernel also uses the same interface when
moving page tables around. Such a move requires us to flush the page walk
cache.
Instead of adding another interface to force page walk cache flush, update
flush_tlb_range to flush page walk cache if the range flushed is more than
the PMD range. A page table move will always involve an invalidate range
more than PMD_SIZE.
Running microbenchmark with mprotect and parallel memory access didn't
show any observable performance impact.
Link: https://lkml.kernel.org/r/20210616045735.374532-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The powerpc kernel is not prepared to handle exec faults from kernel.
Especially, the function is_exec_fault() will return 'false' when an
exec fault is taken by kernel, because the check is based on reading
current->thread.regs->trap which contains the trap from user.
For instance, when provoking a LKDTM EXEC_USERSPACE test,
current->thread.regs->trap is set to SYSCALL trap (0xc00), and
the fault taken by the kernel is not seen as an exec fault by
set_access_flags_filter().
Commit d7df2443cd ("powerpc/mm: Fix spurious segfaults on radix
with autonuma") made it clear and handled it properly. But later on
commit d3ca587404 ("powerpc/mm: Fix reporting of kernel execute
faults") removed that handling, introducing test based on error_code.
And here is the problem, because on the 603 all upper bits of SRR1
get cleared when the TLB instruction miss handler bails out to ISI.
Until commit cbd7e6ca02 ("powerpc/fault: Avoid heavy
search_exception_tables() verification"), an exec fault from kernel
at a userspace address was indirectly caught by the lack of entry for
that address in the exception tables. But after that commit the
kernel mainly relies on KUAP or on core mm handling to catch wrong
user accesses. Here the access is not wrong, so mm handles it.
It is a minor fault because PAGE_EXEC is not set,
set_access_flags_filter() should set PAGE_EXEC and voila.
But as is_exec_fault() returns false as explained in the beginning,
set_access_flags_filter() bails out without setting PAGE_EXEC flag,
which leads to a forever minor exec fault.
As the kernel is not prepared to handle such exec faults, the thing to
do is to fire in bad_kernel_fault() for any exec fault taken by the
kernel, as it was prior to commit d3ca587404.
Fixes: d3ca587404 ("powerpc/mm: Fix reporting of kernel execute faults")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/024bb05105050f704743a0083fe3548702be5706.1625138205.git.christophe.leroy@csgroup.eu
- A big series refactoring parts of our KVM code, and converting some to C.
- Support for ARCH_HAS_SET_MEMORY, and ARCH_HAS_STRICT_MODULE_RWX on some CPUs.
- Support for the Microwatt soft-core.
- Optimisations to our interrupt return path on 64-bit.
- Support for userspace access to the NX GZIP accelerator on PowerVM on Power10.
- Enable KUAP and KUEP by default on 32-bit Book3S CPUs.
- Other smaller features, fixes & cleanups.
Thanks to: Andy Shevchenko, Aneesh Kumar K.V, Arnd Bergmann, Athira Rajeev, Baokun Li,
Benjamin Herrenschmidt, Bharata B Rao, Christophe Leroy, Daniel Axtens, Daniel Henrique
Barboza, Finn Thain, Geoff Levand, Haren Myneni, Jason Wang, Jiapeng Chong, Joel Stanley,
Jordan Niethe, Kajol Jain, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas
Piggin, Nick Desaulniers, Paul Mackerras, Russell Currey, Sathvika Vasireddy, Shaokun
Zhang, Stephen Rothwell, Sudeep Holla, Suraj Jitindar Singh, Tom Rix, Vaibhav Jain,
YueHaibing, Zhang Jianhua, Zhen Lei.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmDfFS4THG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgFxHEAC88NJ+Gz87LiTQFt6QjhziBaJUd+sY
uqADPRROr4P50O8PjYZbMi2qbXzOlLkZO4wJWX7jpZ1F9KmbPNqY2shD8h4ahyge
F/uqzBW1FXBJfnDEKdU2MzalkeTP+dwxLZyouUamjDCGNLFjOV4x/Fft5otOdXjO
k9uO6yoGyOkWYzjC+Y/irNPlIDDByB/+bD92Cb52Y2mXMDDEnx4JzbtkeJW+8udT
Sjn3bWzeL+dz5GehjMKwK4+SptNiyQGOgM8FwtnKUMvgzxv04DqCGjr9YC12L2Z7
VoFZc4GzVgtf8DZg4fJ3KG5aG2nH3Tui7jc9lUckdrxixDAZw5wSG7CQ39gFb/5+
7A4fEJk4Z3h5llibwxAZrC7wV8ZDDXn8oRFzRcOJjfxYaD+ohOOyWHIebwkdiXYx
nfYI7sBcScDLXeBvHtDra2GJpbFSpVL3S/QNhhi1vKVNrFSyAgbAybcVL2xPLZ6+
8Mh7A8xt+hf2bo9AXuYJDo9mwXWfg1093d0kT+AslcRhZioBk18c2AiZLIz0FzuL
Ua/e5FPb99x9LSdcZHvaAXBoHT2iTgDyCyDa3gkIesyuRX6ggHoFcVQuvdDcbJ9d
H8LK+Tahy1Y+E5b6KdtU8mDEGE+QG+CWLnwQ6YSCaL/MYgaFzNa32Jdj1fmztSBC
cttP43kHZ7ljTw==
=zo4d
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- A big series refactoring parts of our KVM code, and converting some
to C.
- Support for ARCH_HAS_SET_MEMORY, and ARCH_HAS_STRICT_MODULE_RWX on
some CPUs.
- Support for the Microwatt soft-core.
- Optimisations to our interrupt return path on 64-bit.
- Support for userspace access to the NX GZIP accelerator on PowerVM on
Power10.
- Enable KUAP and KUEP by default on 32-bit Book3S CPUs.
- Other smaller features, fixes & cleanups.
Thanks to: Andy Shevchenko, Aneesh Kumar K.V, Arnd Bergmann, Athira
Rajeev, Baokun Li, Benjamin Herrenschmidt, Bharata B Rao, Christophe
Leroy, Daniel Axtens, Daniel Henrique Barboza, Finn Thain, Geoff Levand,
Haren Myneni, Jason Wang, Jiapeng Chong, Joel Stanley, Jordan Niethe,
Kajol Jain, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas
Piggin, Nick Desaulniers, Paul Mackerras, Russell Currey, Sathvika
Vasireddy, Shaokun Zhang, Stephen Rothwell, Sudeep Holla, Suraj Jitindar
Singh, Tom Rix, Vaibhav Jain, YueHaibing, Zhang Jianhua, and Zhen Lei.
* tag 'powerpc-5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (218 commits)
powerpc: Only build restart_table.c for 64s
powerpc/64s: move ret_from_fork etc above __end_soft_masked
powerpc/64s/interrupt: clean up interrupt return labels
powerpc/64/interrupt: add missing kprobe annotations on interrupt exit symbols
powerpc/64: enable MSR[EE] in irq replay pt_regs
powerpc/64s/interrupt: preserve regs->softe for NMI interrupts
powerpc/64s: add a table of implicit soft-masked addresses
powerpc/64e: remove implicit soft-masking and interrupt exit restart logic
powerpc/64e: fix CONFIG_RELOCATABLE build warnings
powerpc/64s: fix hash page fault interrupt handler
powerpc/4xx: Fix setup_kuep() on SMP
powerpc/32s: Fix setup_{kuap/kuep}() on SMP
powerpc/interrupt: Use names in check_return_regs_valid()
powerpc/interrupt: Also use exit_must_hard_disable() on PPC32
powerpc/sysfs: Replace sizeof(arr)/sizeof(arr[0]) with ARRAY_SIZE
powerpc/ptrace: Refactor regs_set_return_{msr/ip}
powerpc/ptrace: Move set_return_regs_changed() before regs_set_return_{msr/ip}
powerpc/stacktrace: Fix spurious "stale" traces in raise_backtrace_ipi()
powerpc/pseries/vas: Include irqdomain.h
powerpc: mark local variables around longjmp as volatile
...
The early bad fault or key fault test in do_hash_fault() ends up calling
into ___do_page_fault without having gone through an interrupt handler
wrapper (except the initial _RAW one). This can end up calling local irq
functions while the interrupt has not been reconciled, which will likely
cause crashes and it trips up on a later patch that adds more assertions.
pkey_exec_prot from selftests causes this path to be executed.
There is no real reason to run the in_nmi() test should be performed
before the key fault check. In fact if a perf interrupt in the hash
fault code did a stack walk that was made to take a key fault somehow
then running ___do_page_fault could possibly cause another hash fault
causing problems. Move the in_nmi() test first, and then do everything
else inside the regular interrupt handler function.
Fixes: 3a96570ffc ("powerpc: convert interrupt handlers to use wrappers")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210630074621.2109197-2-npiggin@gmail.com
On SMP, setup_kuep() is also called from start_secondary() since
commit 86f46f3432 ("powerpc/32s: Initialise KUAP and KUEP in C").
start_secondary() is not an __init function.
Remove the __init marker from setup_kuep() and bail out when
not caller on the first CPU as the work is already done.
Fixes: 10248dcba1 ("powerpc/44x: Implement Kernel Userspace Exec Protection (KUEP)")
Fixes: 86f46f3432 ("powerpc/32s: Initialise KUAP and KUEP in C")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/8ee05934288994a65743a987acb1558f12c0c8c1.1624969450.git.christophe.leroy@csgroup.eu
On SMP, setup_kup() is also called from start_secondary().
start_secondary() is not an __init function.
Remove the __init marker from setup_kuep() and setup_kuap().
Fixes: 86f46f3432 ("powerpc/32s: Initialise KUAP and KUEP in C")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/42f4bd12b476942e4d5dc81c0e839d8871b20b1c.1624863319.git.christophe.leroy@csgroup.eu
Merge misc updates from Andrew Morton:
"191 patches.
Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
pagealloc, and memory-failure)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits)
mm,hwpoison: make get_hwpoison_page() call get_any_page()
mm,hwpoison: send SIGBUS with error virutal address
mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
docs: remove description of DISCONTIGMEM
arch, mm: remove stale mentions of DISCONIGMEM
mm: remove CONFIG_DISCONTIGMEM
m68k: remove support for DISCONTIGMEM
arc: remove support for DISCONTIGMEM
arc: update comment about HIGHMEM implementation
alpha: remove DISCONTIGMEM and NUMA
mm/page_alloc: move free_the_page
mm/page_alloc: fix counting of managed_pages
mm/page_alloc: improve memmap_pages dbg msg
mm: drop SECTION_SHIFT in code comments
mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
mm/page_alloc: scale the number of pages that are batch freed
...
Core changes:
- Cleanup and simplification of common code to invoke the low level
interrupt flow handlers when this invocation requires irqdomain
resolution. Add the necessary core infrastructure.
- Provide a proper interface for modular PMU drivers to set the
interrupt affinity.
- Add a request flag which allows to exclude interrupts from spurious
interrupt detection. Useful especially for IPI handlers which always
return IRQ_HANDLED which turns the spurious interrupt detection into a
pointless waste of CPU cycles.
Driver changes:
- Bulk convert interrupt chip drivers to the new irqdomain low level flow
handler invocation mechanism.
- Add device tree bindings for the Renesas R-Car M3-W+ SoC
- Enable modular build of the Qualcomm PDC driver
- The usual small fixes and improvements.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmDbIg8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYobZNEAC2wTq3Ishk026va7g5mbQVSvAQyf8G
0msmgJ48lJWVL9a6JUogNcCO7sZCTcAy4CYbuHI6kz1fGZZnNWSCrtEz0rFNAdWE
WVR2k8ExR2R73vJm+K50WUMMj8YsefRnIFXWlJdTp+pksr3TZ7Lo70taGUK/6tMo
aL0dqvnf7Vb3LG0iIkaHWLF4HnyK/UGqB+121rlL4UhI1/g+3EUxNWNcY5eg/dmc
Ym73U1uDsjydp3/3jm8v8NYNtwCDGpujZZc/88RFLjP6PMpF1S9JUvDEt+LHJi0a
cdS3RreB78HYXpLg5NtDFJwIegRMLSitvCGPBjHvWBzbifkMsA2zWIb6Cs8VkYys
vuPoEGZ0ol+SWvcnSh5Xy36nyr4iGIBhQql47UAaqelSxsYPjvCCSD4yJV3k8hnC
ZuDscOekXUMn75qZR0quNdi1SkgKpGZxK73QFbuW3Apl5EgArVai6kq0rbl6zlx6
ACy0SEcevhOcpU6WpqDgrmUBgFr+M8zina8edRELgiFEuWT6pYxKwrN3pT4U5djO
e5V3YuNzzwzvtUoXN4AiTlT8gwRiGfgeiEvHpvZBXPNvk5ffS6XzPiV81ZMWiBkb
ReoCbqME3PKoxj1VAHJvVXHbcjiPIJeCRdV+5vQSNh1SPSQOmEdWyJtNUDrSkoym
QkKKY5jrOhPhlQ==
=FIKh
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Updates for the interrupt subsystem:
Core changes:
- Cleanup and simplification of common code to invoke the low level
interrupt flow handlers when this invocation requires irqdomain
resolution. Add the necessary core infrastructure.
- Provide a proper interface for modular PMU drivers to set the
interrupt affinity.
- Add a request flag which allows to exclude interrupts from spurious
interrupt detection. Useful especially for IPI handlers which
always return IRQ_HANDLED which turns the spurious interrupt
detection into a pointless waste of CPU cycles.
Driver changes:
- Bulk convert interrupt chip drivers to the new irqdomain low level
flow handler invocation mechanism.
- Add device tree bindings for the Renesas R-Car M3-W+ SoC
- Enable modular build of the Qualcomm PDC driver
- The usual small fixes and improvements"
* tag 'irq-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
dt-bindings: interrupt-controller: arm,gic-v3: Describe GICv3 optional properties
irqchip: gic-pm: Remove redundant error log of clock bulk
irqchip/sun4i: Remove unnecessary oom message
irqchip/irq-imx-gpcv2: Remove unnecessary oom message
irqchip/imgpdc: Remove unnecessary oom message
irqchip/gic-v3-its: Remove unnecessary oom message
irqchip/gic-v2m: Remove unnecessary oom message
irqchip/exynos-combiner: Remove unnecessary oom message
irqchip: Bulk conversion to generic_handle_domain_irq()
genirq: Move non-irqdomain handle_domain_irq() handling into ARM's handle_IRQ()
genirq: Add generic_handle_domain_irq() helper
irqchip/nvic: Convert from handle_IRQ() to handle_domain_irq()
irqdesc: Fix __handle_domain_irq() comment
genirq: Use irq_resolve_mapping() to implement __handle_domain_irq() and co
irqdomain: Introduce irq_resolve_mapping()
irqdomain: Protect the linear revmap with RCU
irqdomain: Cache irq_data instead of a virq number in the revmap
irqdomain: Use struct_size() helper when allocating irqdomain
irqdomain: Make normal and nomap irqdomains exclusive
powerpc: Move the use of irq_domain_add_nomap() behind a config option
...
After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
configuration options are equivalent.
Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.
Done with
$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
$(git grep -wl NEED_MULTIPLE_NODES)
with manual tweaks afterwards.
[rppt@linux.ibm.com: fix arm boot crash]
Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com
Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Add MTE support in guests, complete with tag save/restore interface
- Reduce the impact of CMOs by moving them in the page-table code
- Allow device block mappings at stage-2
- Reduce the footprint of the vmemmap in protected mode
- Support the vGIC on dumb systems such as the Apple M1
- Add selftest infrastructure to support multiple configuration
and apply that to PMU/non-PMU setups
- Add selftests for the debug architecture
- The usual crop of PMU fixes
PPC:
- Support for the H_RPT_INVALIDATE hypercall
- Conversion of Book3S entry/exit to C
- Bug fixes
S390:
- new HW facilities for guests
- make inline assembly more robust with KASAN and co
x86:
- Allow userspace to handle emulation errors (unknown instructions)
- Lazy allocation of the rmap (host physical -> guest physical address)
- Support for virtualizing TSC scaling on VMX machines
- Optimizations to avoid shattering huge pages at the beginning of live migration
- Support for initializing the PDPTRs without loading them from memory
- Many TLB flushing cleanups
- Refuse to load if two-stage paging is available but NX is not (this has
been a requirement in practice for over a year)
- A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from
CR0/CR4/EFER, using the MMU mode everywhere once it is computed
from the CPU registers
- Use PM notifier to notify the guest about host suspend or hibernate
- Support for passing arguments to Hyper-V hypercalls using XMM registers
- Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap on
AMD processors
- Hide Hyper-V hypercalls that are not included in the guest CPUID
- Fixes for live migration of virtual machines that use the Hyper-V
"enlightened VMCS" optimization of nested virtualization
- Bugfixes (not many)
Generic:
- Support for retrieving statistics without debugfs
- Cleanups for the KVM selftests API
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDV9UYUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOIRgf/XX8fKLh24RnTOs2ldIu2AfRGVrT4
QMrr8MxhmtukBAszk2xKvBt8/6gkUjdaIC3xqEnVjxaDaUvZaEtP7CQlF5JV45rn
iv1zyxUKucXrnIOr+gCioIT7qBlh207zV35ArKioP9Y83cWx9uAs22pfr6g+7RxO
h8bJZlJbSG6IGr3voANCIb9UyjU1V/l8iEHqRwhmr/A5rARPfD7g8lfMEQeGkzX6
+/UydX2fumB3tl8e2iMQj6vLVdSOsCkehvpHK+Z33EpkKhan7GwZ2sZ05WmXV/nY
QLAYfD10KegoNWl5Ay4GTp4hEAIYVrRJCLC+wnLdc0U8udbfCuTC31LK4w==
=NcRh
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"This covers all architectures (except MIPS) so I don't expect any
other feature pull requests this merge window.
ARM:
- Add MTE support in guests, complete with tag save/restore interface
- Reduce the impact of CMOs by moving them in the page-table code
- Allow device block mappings at stage-2
- Reduce the footprint of the vmemmap in protected mode
- Support the vGIC on dumb systems such as the Apple M1
- Add selftest infrastructure to support multiple configuration and
apply that to PMU/non-PMU setups
- Add selftests for the debug architecture
- The usual crop of PMU fixes
PPC:
- Support for the H_RPT_INVALIDATE hypercall
- Conversion of Book3S entry/exit to C
- Bug fixes
S390:
- new HW facilities for guests
- make inline assembly more robust with KASAN and co
x86:
- Allow userspace to handle emulation errors (unknown instructions)
- Lazy allocation of the rmap (host physical -> guest physical
address)
- Support for virtualizing TSC scaling on VMX machines
- Optimizations to avoid shattering huge pages at the beginning of
live migration
- Support for initializing the PDPTRs without loading them from
memory
- Many TLB flushing cleanups
- Refuse to load if two-stage paging is available but NX is not (this
has been a requirement in practice for over a year)
- A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from
CR0/CR4/EFER, using the MMU mode everywhere once it is computed
from the CPU registers
- Use PM notifier to notify the guest about host suspend or hibernate
- Support for passing arguments to Hyper-V hypercalls using XMM
registers
- Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap
on AMD processors
- Hide Hyper-V hypercalls that are not included in the guest CPUID
- Fixes for live migration of virtual machines that use the Hyper-V
"enlightened VMCS" optimization of nested virtualization
- Bugfixes (not many)
Generic:
- Support for retrieving statistics without debugfs
- Cleanups for the KVM selftests API"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (314 commits)
KVM: x86: rename apic_access_page_done to apic_access_memslot_enabled
kvm: x86: disable the narrow guest module parameter on unload
selftests: kvm: Allows userspace to handle emulation errors.
kvm: x86: Allow userspace to handle emulation errors
KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP is on
KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault
KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault
KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT
KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic
KVM: x86: Enhance comments for MMU roles and nested transition trickiness
KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTE
KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMU
KVM: x86/mmu: Use MMU's role to determine PTTYPE
KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers
KVM: x86/mmu: Add a helper to calculate root from role_regs
KVM: x86/mmu: Add helper to update paging metadata
KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0
KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls
KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper
KVM: x86/mmu: Get nested MMU's root level from the MMU's role
...
Commit aaa2295292 ("powerpc/mm: Add physical address to Linux page
table dump") changed range coalescing to only combine ranges that are
both virtually and physically contiguous, in order to avoid erroneous
combination of unrelated mappings in IOREMAP space.
But in the VMALLOC space, mappings almost never have contiguous
physical pages, so the commit mentionned above leads to dumping one
line per page for vmalloc mappings.
Taking into account the vmalloc always leave a gap between two areas,
we never have two mappings dumped as a single combination even if they
have the exact same flags. The only space that may have encountered
such an issue was the early IOREMAP which is not using vmalloc engine.
But previous commits added gaps between early IO mappings, so it is
not an issue anymore.
That commit created some difficulties with KASAN mappings, see
commit cabe8138b2 ("powerpc: dump as a single line areas mapping a
single physical page.") and with huge page, see
commit b00ff6d8c1 ("powerpc/ptdump: Properly handle non standard
page size").
So, almost revert commit aaa2295292 to properly coalesce pages
mapped with the same flags as before, only keep the display of the
first physical address of the range, as it can be usefull especially
for IO mappings.
It brings back powerpc at the same level as other architectures and
simplifies the conversion to GENERIC PTDUMP.
With the patch:
---[ kasan shadow mem start ]---
0xf8000000-0xf8ffffff 0x07000000 16M huge rw present dirty accessed
0xf9000000-0xf91fffff 0x01434000 2M r present accessed
0xf9200000-0xf95affff 0x02104000 3776K rw present dirty accessed
0xfef5c000-0xfeffffff 0x01434000 656K r present accessed
---[ kasan shadow mem end ]---
Before:
---[ kasan shadow mem start ]---
0xf8000000-0xf8ffffff 0x07000000 16M huge rw present dirty accessed
0xf9000000-0xf91fffff 0x01434000 16K r present accessed
0xf9200000-0xf9203fff 0x02104000 16K rw present dirty accessed
0xf9204000-0xf9207fff 0x0213c000 16K rw present dirty accessed
0xf9208000-0xf920bfff 0x02174000 16K rw present dirty accessed
0xf920c000-0xf920ffff 0x02188000 16K rw present dirty accessed
0xf9210000-0xf9213fff 0x021dc000 16K rw present dirty accessed
0xf9214000-0xf9217fff 0x02220000 16K rw present dirty accessed
0xf9218000-0xf921bfff 0x023c0000 16K rw present dirty accessed
0xf921c000-0xf921ffff 0x023d4000 16K rw present dirty accessed
0xf9220000-0xf9227fff 0x023ec000 32K rw present dirty accessed
...
0xf93b8000-0xf93e3fff 0x02614000 176K rw present dirty accessed
0xf93e4000-0xf94c3fff 0x027c0000 896K rw present dirty accessed
0xf94c4000-0xf94c7fff 0x0236c000 16K rw present dirty accessed
0xf94c8000-0xf94cbfff 0x041f0000 16K rw present dirty accessed
0xf94cc000-0xf94cffff 0x029c0000 16K rw present dirty accessed
0xf94d0000-0xf94d3fff 0x041ec000 16K rw present dirty accessed
0xf94d4000-0xf94d7fff 0x0407c000 16K rw present dirty accessed
0xf94d8000-0xf94f7fff 0x041c0000 128K rw present dirty accessed
...
0xf95ac000-0xf95affff 0x042b0000 16K rw present dirty accessed
0xfef5c000-0xfeffffff 0x01434000 16K r present accessed
---[ kasan shadow mem end ]---
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c56ce1f5c3c75adc9811b1a5f9c410fa74183a8d.1618828806.git.christophe.leroy@csgroup.eu
When using the Radix MMU our PGD is always 64K, and must be naturally
aligned.
For a 4K page size kernel that means page alignment of swapper_pg_dir is
not sufficient, leading to failure to boot.
Use the existing MAX_PTRS_PER_PGD which has the correct value, and
avoids us hard-coding 64K here.
Fixes: e72421a085 ("powerpc: Define swapper_pg_dir[] in C")
Reported-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210624123420.2784187-1-mpe@ellerman.id.au
Enable support for process-scoped invalidations from nested
guests and partition-scoped invalidations for nested guests.
Process-scoped invalidations for any level of nested guests
are handled by implementing H_RPT_INVALIDATE handler in the
nested guest exit path in L0.
Partition-scoped invalidation requests are forwarded to the
right nested guest, handled there and passed down to L0
for eventual handling.
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
[aneesh: Nested guest partition-scoped invalidation changes]
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Squash in fixup patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210621085003.904767-5-bharata@linux.ibm.com
H_RPT_INVALIDATE does two types of TLB invalidations:
1. Process-scoped invalidations for guests when LPCR[GTSE]=0.
This is currently not used in KVM as GTSE is not usually
disabled in KVM.
2. Partition-scoped invalidations that an L1 hypervisor does on
behalf of an L2 guest. This is currently handled
by H_TLB_INVALIDATE hcall and this new replaces the old that.
This commit enables process-scoped invalidations for L1 guests.
Support for process-scoped and partition-scoped invalidations
from/for nested guests will be added separately.
Process scoped tlbie invalidations from L1 and nested guests
need RS register for TLBIE instruction to contain both PID and
LPID. This patch introduces primitives that execute tlbie
instruction with both PID and LPID set in prepartion for
H_RPT_INVALIDATE hcall.
A description of H_RPT_INVALIDATE follows:
int64 /* H_Success: Return code on successful completion */
/* H_Busy - repeat the call with the same */
/* H_Parameter, H_P2, H_P3, H_P4, H_P5 : Invalid
parameters */
hcall(const uint64 H_RPT_INVALIDATE, /* Invalidate RPT
translation
lookaside information */
uint64 id, /* PID/LPID to invalidate */
uint64 target, /* Invalidation target */
uint64 type, /* Type of lookaside information */
uint64 pg_sizes, /* Page sizes */
uint64 start, /* Start of Effective Address (EA)
range (inclusive) */
uint64 end) /* End of EA range (exclusive) */
Invalidation targets (target)
-----------------------------
Core MMU 0x01 /* All virtual processors in the
partition */
Core local MMU 0x02 /* Current virtual processor */
Nest MMU 0x04 /* All nest/accelerator agents
in use by the partition */
A combination of the above can be specified,
except core and core local.
Type of translation to invalidate (type)
---------------------------------------
NESTED 0x0001 /* invalidate nested guest partition-scope */
TLB 0x0002 /* Invalidate TLB */
PWC 0x0004 /* Invalidate Page Walk Cache */
PRT 0x0008 /* Invalidate caching of Process Table
Entries if NESTED is clear */
PAT 0x0008 /* Invalidate caching of Partition Table
Entries if NESTED is set */
A combination of the above can be specified.
Page size mask (pages)
----------------------
4K 0x01
64K 0x02
2M 0x04
1G 0x08
All sizes (-1UL)
A combination of the above can be specified.
All page sizes can be selected with -1.
Semantics: Invalidate radix tree lookaside information
matching the parameters given.
* Return H_P2, H_P3 or H_P4 if target, type, or pageSizes parameters
are different from the defined values.
* Return H_PARAMETER if NESTED is set and pid is not a valid nested
LPID allocated to this partition
* Return H_P5 if (start, end) doesn't form a valid range. Start and
end should be a valid Quadrant address and end > start.
* Return H_NotSupported if the partition is not in running in radix
translation mode.
* May invalidate more translation information than requested.
* If start = 0 and end = -1, set the range to cover all valid
addresses. Else start and end should be aligned to 4kB (lower 11
bits clear).
* If NESTED is clear, then invalidate process scoped lookaside
information. Else pid specifies a nested LPID, and the invalidation
is performed on nested guest partition table and nested guest
partition scope real addresses.
* If pid = 0 and NESTED is clear, then valid addresses are quadrant 3
and quadrant 0 spaces, Else valid addresses are quadrant 0.
* Pages which are fully covered by the range are to be invalidated.
Those which are partially covered are considered outside
invalidation range, which allows a caller to optimally invalidate
ranges that may contain mixed page sizes.
* Return H_SUCCESS on success.
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210621085003.904767-4-bharata@linux.ibm.com
Add a field to mmu_psize_def to store the page size encodings
of H_RPT_INVALIDATE hcall. Initialize this while scanning the radix
AP encodings. This will be used when invalidating with required
page size encoding in the hcall.
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210621085003.904767-3-bharata@linux.ibm.com
Use set_memory_attr() instead of the PPC32 specific change_page_attr()
change_page_attr() was checking that the address was not mapped by
blocks and was handling highmem, but that's unneeded because the
affected pages can't be in highmem and block mapping verification
is already done by the callers.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[ruscur: rebase on powerpc/merge with Christophe's new patches]
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210609013431.9805-10-jniethe5@gmail.com
In addition to the set_memory_xx() functions which allows to change
the memory attributes of not (yet) used memory regions, implement a
set_memory_attr() function to:
- set the final memory protection after init on currently used
kernel regions.
- enable/disable kernel memory regions in the scope of DEBUG_PAGEALLOC.
Unlike the set_memory_xx() which can act in three step as the regions
are unused, this function must modify 'on the fly' as the kernel is
executing from them. At the moment only PPC32 will use it and changing
page attributes on the fly is not an issue.
Reported-by: kbuild test robot <lkp@intel.com>
[ruscur: cast "data" to unsigned long instead of int]
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210609013431.9805-9-jniethe5@gmail.com
The set_memory_{ro/rw/nx/x}() functions are required for
STRICT_MODULE_RWX, and are generally useful primitives to have. This
implementation is designed to be generic across powerpc's many MMUs.
It's possible that this could be optimised to be faster for specific
MMUs.
This implementation does not handle cases where the caller is attempting
to change the mapping of the page it is executing from, or if another
CPU is concurrently using the page being altered. These cases likely
shouldn't happen, but a more complex implementation with MMU-specific code
could safely handle them.
On hash, the linear mapping is not kept in the linux pagetable, so this
will not change the protection if used on that range. Currently these
functions are not used on the linear map so just WARN for now.
apply_to_existing_page_range() does not work on huge pages so for now
disallow changing the protection of huge pages.
[jpn: - Allow set memory functions to be used without Strict RWX
- Hash: Disallow certain regions
- Have change_page_attr() take function pointers to manipulate ptes
- Radix: Add ptesync after set_pte_at()]
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210609013431.9805-2-jniethe5@gmail.com
Merge some powerpc KVM patches from our topic branch.
In particular this brings in Nick's big series rewriting parts of the
guest entry/exit path in C.
Conflicts:
arch/powerpc/kernel/security.c
arch/powerpc/kvm/book3s_hv_rmhandlers.S
Update _tlbiel_pid() such that we can avoid build errors like below when
using this function in other places.
arch/powerpc/mm/book3s64/radix_tlb.c: In function ‘__radix__flush_tlb_range_psize’:
arch/powerpc/mm/book3s64/radix_tlb.c:114:2: warning: ‘asm’ operand 3 probably does not match constraints
114 | asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
| ^~~
arch/powerpc/mm/book3s64/radix_tlb.c:114:2: error: impossible constraint in ‘asm’
make[4]: *** [scripts/Makefile.build:271: arch/powerpc/mm/book3s64/radix_tlb.o] Error 1
m
With this fix, we can also drop the __always_inline in __radix_flush_tlb_range_psize
which was added by commit e12d6d7d46 ("powerpc/mm/radix: mark __radix__flush_tlb_range_psize() as __always_inline")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210610083639.387365-1-aneesh.kumar@linux.ibm.com
PTE_SIZE means PTE page table size in most placed, whereas
in hash_low.S in means size of one entry in the table.
Rename it PTE_T_SIZE, and define it directly in hash_low.S
instead of going through asm-offsets.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/83a008a9fd6cc3f2bbcb470f592555d260ed7a3d.1623063174.git.christophe.leroy@csgroup.eu
At the time being, empty_zero_page[] is defined in each
platform head.S.
Define it in mm/mem.c instead, and put it in BSS section instead
of the DATA section. Commit 5227cfa71f ("arm64: mm: place
empty_zero_page in bss") explains why it is interesting to have
it in BSS.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5838caffa269e0957c5a50cc85477876220298b0.1623063174.git.christophe.leroy@csgroup.eu
DEBUG_CLAMP_LAST_CONTEXT was there in the old days to reduce
number of contexts in order to ease debugging implementation
of context switching, but that's been quite stable during
years now.
As it is not user selectable, remove it.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/da81837b452e8b9f1657b529b9c3050dc10b9770.1622712515.git.christophe.leroy@csgroup.eu
PPC64 uses MMU features to enable/disable KUAP at boot time.
But feature fixups are applied way too early on PPC32.
Now that all KUAP related actions are in C following the
conversion of KUAP initial setup and context switch in C,
static branches can be used to enable/disable KUAP.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mpe: Export disable_kuap_key to fix build errors]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/cd79e8008455fba5395d099f9bb1305c039b931c.1622708530.git.christophe.leroy@csgroup.eu
PPC64 uses MMU features to enable/disable KUEP at boot time.
But feature fixups are applied way too early on PPC32.
Now that all KUEP related actions are in C following the
conversion of KUEP initial setup and context switch in C,
static branches can be used to enable/disable KUEP.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7745a2c3a08ec46302920a3f48d1cb9b5469dbbb.1622708530.git.christophe.leroy@csgroup.eu
switch_mmu_context() does things that can easily be done in C.
For updating user segments, we have update_user_segments().
As mentionned in commit b5efec00b6 ("powerpc/32s: Move KUEP
locking/unlocking in C"), update_user_segments() has the loop
unrolled which is a significant performance gain.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/05c0875ad8220c03452c3a334946e207c6ca04d6.1622708530.git.christophe.leroy@csgroup.eu
KUEP implements the update of user segment registers.
Move it into mmu-hash.h in order to use it from other places.
And inline kuep_lock() and kuep_unlock(). Inlining kuep_lock() is
important for system_call_exception(), otherwise system_call_exception()
has to save into stack the system call parameters that are used just
after, and doing that takes more instructions than kuep_lock() itself.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/24591ca480d14a62ef910e38a5273d551262c4a2.1622708530.git.christophe.leroy@csgroup.eu
'struct ppc_inst' is an internal representation of an instruction, but
in-memory instructions are and will remain a table of 'u32' forever.
Replace all 'struct ppc_inst *' used for locating an instruction in
memory by 'u32 *'. This removes a lot of undue casts to 'struct
ppc_inst *'.
It also helps locating ab-use of 'struct ppc_inst' dereference.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mpe: Fix ppc_inst_next(), use u32 instead of unsigned int]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7062722b087228e42cbd896e39bfdf526d6a340a.1621516826.git.christophe.leroy@csgroup.eu
Rather than partition the guest PID space + flush a rogue guest PID to
work around this problem, instead fix it by always disabling the MMU when
switching in or out of guest MMU context in HV mode.
This may be a bit less efficient, but it is a lot less complicated and
allows the P9 path to trivally implement the workaround too. Newer CPUs
are not subject to this issue.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210528090752.3542186-22-npiggin@gmail.com
A bunch of PPC files are missing the inclusion of linux/of.h and
linux/irqdomain.h, relying on transitive inclusion from another
file.
As we are about to break this dependency, make sure these dependencies
are explicit.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Commit b26e8f2725 ("powerpc/mem: Move cache flushing functions into
mm/cacheflush.c") removed asm/sparsemem.h which is required when
CONFIG_MEMORY_HOTPLUG is selected to get the declaration of
create_section_mapping().
Add it back.
Fixes: b26e8f2725 ("powerpc/mem: Move cache flushing functions into mm/cacheflush.c")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/3e5b63bb3daab54a1eb9c20221c2e9528c4db9b3.1622883330.git.christophe.leroy@csgroup.eu
Merge more updates from Andrew Morton:
"The remainder of the main mm/ queue.
143 patches.
Subsystems affected by this patch series (all mm): pagecache, hugetlb,
userfaultfd, vmscan, compaction, migration, cma, ksm, vmstat, mmap,
kconfig, util, memory-hotplug, zswap, zsmalloc, highmem, cleanups, and
kfence"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (143 commits)
kfence: use power-efficient work queue to run delayed work
kfence: maximize allocation wait timeout duration
kfence: await for allocation using wait_event
kfence: zero guard page after out-of-bounds access
mm/process_vm_access.c: remove duplicate include
mm/mempool: minor coding style tweaks
mm/highmem.c: fix coding style issue
btrfs: use memzero_page() instead of open coded kmap pattern
iov_iter: lift memzero_page() to highmem.h
mm/zsmalloc: use BUG_ON instead of if condition followed by BUG.
mm/zswap.c: switch from strlcpy to strscpy
arm64/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
x86/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
mm,memory_hotplug: add kernel boot option to enable memmap_on_memory
acpi,memhotplug: enable MHP_MEMMAP_ON_MEMORY when supported
mm,memory_hotplug: allocate memmap from the added memory range
mm,memory_hotplug: factor out adjusting present pages into adjust_present_page_count()
mm,memory_hotplug: relax fully spanned sections check
drivers/base/memory: introduce memory_block_{online,offline}
mm/memory_hotplug: remove broken locking of zone PCP structures during hot remove
...
Patch series "hugetlb: Disable huge pmd unshare for uffd-wp", v4.
This series tries to disable huge pmd unshare of hugetlbfs backed memory
for uffd-wp. Although uffd-wp of hugetlbfs is still during rfc stage,
the idea of this series may be needed for multiple tasks (Axel's uffd
minor fault series, and Mike's soft dirty series), so I picked it out
from the larger series.
This patch (of 4):
It is a preparation work to be able to behave differently in the per
architecture huge_pte_alloc() according to different VMA attributes.
Pass it deeper into huge_pmd_share() so that we can avoid the find_vma() call.
[peterx@redhat.com: build fix]
Link: https://lkml.kernel.org/r/20210304164653.GB397383@xz-x1Link: https://lkml.kernel.org/r/20210218230633.15028-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210218230633.15028-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc updates from Andrew Morton:
"A few misc subsystems and some of MM.
175 patches.
Subsystems affected by this patch series: ia64, kbuild, scripts, sh,
ocfs2, kfifo, vfs, kernel/watchdog, and mm (slab-generic, slub,
kmemleak, debug, pagecache, msync, gup, memremap, memcg, pagemap,
mremap, dma, sparsemem, vmalloc, documentation, kasan, initialization,
pagealloc, and memory-failure)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (175 commits)
mm/memory-failure: unnecessary amount of unmapping
mm/mmzone.h: fix existing kernel-doc comments and link them to core-api
mm: page_alloc: ignore init_on_free=1 for debug_pagealloc=1
net: page_pool: use alloc_pages_bulk in refill code path
net: page_pool: refactor dma_map into own function page_pool_dma_map
SUNRPC: refresh rq_pages using a bulk page allocator
SUNRPC: set rq_page_end differently
mm/page_alloc: inline __rmqueue_pcplist
mm/page_alloc: optimize code layout for __alloc_pages_bulk
mm/page_alloc: add an array-based interface to the bulk page allocator
mm/page_alloc: add a bulk page allocator
mm/page_alloc: rename alloced to allocated
mm/page_alloc: duplicate include linux/vmalloc.h
mm, page_alloc: avoid page_to_pfn() in move_freepages()
mm/Kconfig: remove default DISCONTIGMEM_MANUAL
mm: page_alloc: dump migrate-failed pages
mm/mempolicy: fix mpol_misplaced kernel-doc
mm/mempolicy: rewrite alloc_pages_vma documentation
mm/mempolicy: rewrite alloc_pages documentation
mm/mempolicy: rename alloc_pages_current to alloc_pages
...
mem_init_print_info() is called in mem_init() on each architecture, and
pass NULL argument, so using void argument and move it into mm_init().
Link: https://lkml.kernel.org/r/20210317015210.33641-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86]
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> [powerpc]
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Anatoly Pugachev <matorola@gmail.com> [sparc64]
Acked-by: Russell King <rmk+kernel@armlinux.org.uk> [arm]
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Guo Ren <guoren@kernel.org>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a shim around vunmap_range, get rid of it.
Move the main API comment from the _noflush variant to the normal
variant, and make _noflush internal to mm/.
[npiggin@gmail.com: fix nommu builds and a comment bug per sfr]
Link: https://lkml.kernel.org/r/1617292598.m6g0knx24s.astroid@bobo.none
[akpm@linux-foundation.org: move vunmap_range_noflush() stub inside !CONFIG_MMU, not !CONFIG_NUMA]
[npiggin@gmail.com: fix nommu builds]
Link: https://lkml.kernel.org/r/1617292497.o1uhq5ipxp.astroid@bobo.none
Link: https://lkml.kernel.org/r/20210322021806.892164-5-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Cédric Le Goater <clg@kaod.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.
Link: https://lkml.kernel.org/r/20210317062402.533919-8-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This changes the awkward approach where architectures provide init
functions to determine which levels they can provide large mappings for,
to one where the arch is queried for each call.
This removes code and indirection, and allows constant-folding of dead
code for unsupported levels.
This also adds a prot argument to the arch query. This is unused
currently but could help with some architectures (e.g., some powerpc
processors can't map uncacheable memory with large pages).
Link: https://lkml.kernel.org/r/20210317062402.533919-7-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memory ordering comment no longer applies, because mm_ctx_id is
no longer used anywhere. At best always been difficult to follow.
It's better to consider the load on which the slbmte depends on, which
the MMU depends on before it can start loading TLBs, rather than a
store which may or may not have a subsequent dependency chain to the
slbmte.
So update the comment and we use the load of the mm's user context ID.
This is much more analogous the radix ordering too, which is good.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210421151733.212858-1-npiggin@gmail.com
When probe_kernel_read_inst() was created, there was no good place to
put it, so a file called lib/inst.c was dedicated for it.
Since then, probe_kernel_read_inst() has been renamed
copy_inst_from_kernel_nofault(). And mm/maccess.h didn't exist at that
time. Today, mm/maccess.h is related to copy_from_kernel_nofault().
Move copy_inst_from_kernel_nofault() into mm/maccess.c
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/9655d8957313906b77b8db5700a0e33ce06f45e5.1618405715.git.christophe.leroy@csgroup.eu
Define macros to list ppc interrupt types in interttupt.h, replace the
reference of the trap hex values with these macros.
Referred the hex numbers in arch/powerpc/kernel/exceptions-64e.S,
arch/powerpc/kernel/exceptions-64s.S, arch/powerpc/kernel/head_*.S,
arch/powerpc/kernel/head_booke.h and arch/powerpc/include/asm/kvm_asm.h.
Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
[mpe: Resolve conflicts in nmi_disables_ftrace(), fix 40x build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1618398033-13025-1-git-send-email-sxwjean@me.com
The lkp bot pointed out that with W=1 we get:
arch/powerpc/mm/book3s64/radix_pgtable.c:183:6: error: no previous
prototype for 'radix__change_memory_range'
Which is really saying that it could be static, make it so.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
search_exception_tables + __bad_page_fault can be substituted with
bad_page_fault, do_page_fault no longer needs to return a value
to asm for any sub-architecture, and __bad_page_fault can be static.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210316104206.407354-10-npiggin@gmail.com
With non-volatile registers saved on interrupt, bad_page_fault
can now be called by do_page_fault.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210316104206.407354-9-npiggin@gmail.com
flush_dcache_page() is only a few lines, it is worth
inlining.
ia64, csky, mips, openrisc and riscv have a similar
flush_dcache_page() and inline it.
On pmac32_defconfig, we get a small size reduction.
On ppc64_defconfig, we get a very small size increase.
In both case that's in the noise (less than 0.1%).
text data bss dec hex filename
18991155 5934744 1497624 26423523 19330e3 vmlinux64.before
18994829 59367321497624 26429185 1934701 vmlinux64.after
9150963 2467502 184548 11803013 b41985 vmlinux32.before
9149689 2467302 184548 11801539 b413c3 vmlinux32.after
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/21c417488b70b7629dae316539fb7bb8bdef4fdd.1617895813.git.christophe.leroy@csgroup.eu
__flush_dcache_icache() is usable for non HIGHMEM pages on
every platform.
It is only for HIGHMEM pages that BOOKE needs kmap() and
BOOK3S needs flush_dcache_icache_phys().
So make flush_dcache_icache_phys() dependent on CONFIG_HIGHMEM and
call it only when it is a HIGHMEM page.
We could make flush_dcache_icache_phys() available at all time,
but as it is declared NOKPROBE_SYMBOL(), GCC doesn't optimise
it out when it is not used.
So define a stub for !CONFIG_HIGHMEM in order to remove the #ifdef in
flush_dcache_icache_page() and use IS_ENABLED() instead.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/79ed5d7914f497cd5fcd681ca2f4d50a91719455.1617895813.git.christophe.leroy@csgroup.eu
flush_dcache_icache_hugepage() is a static function, with
only one caller. That caller calls it when PageCompound() is true,
so bugging on !PageCompound() is useless if we can trust the
compiler a little. Remove the BUG_ON(!PageCompound()).
The number of elements of a page won't change over time, but
GCC doesn't know about it, so it gets the value at every iteration.
To avoid that, call compound_nr() outside the loop and save it in
a local variable.
Whether the page is a HIGHMEM page or not doesn't change over time.
But GCC doesn't know it so it does the test on every iteration.
Do the test outside the loop.
When the page is not a HIGHMEM page, page_address() will fallback on
lowmem_page_address(), so call lowmem_page_address() directly and
don't suffer the call to page_address() on every iteration.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ab03712b70105fccfceef095aa03007de9295a40.1617895813.git.christophe.leroy@csgroup.eu
flush_coherent_icache() doesn't need the address anymore,
so it can be called immediately when entering the public
functions and doesn't need to be disseminated among
lower level functions.
And use page_to_phys() instead of open coding the calculation
of phys address to call flush_dcache_icache_phys().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5f063986e325d2efdd404b8f8c5f4bcbd4eb11a6.1617895813.git.christophe.leroy@csgroup.eu
On book3s/32, the segment below kernel text is used for module
allocation when CONFIG_STRICT_KERNEL_RWX is defined.
In order to benefit from the powerpc specific module_alloc()
function which allocate modules with 32 Mbytes from
end of kernel text, use that segment below PAGE_OFFSET at all time.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a46dcdd39a9e80b012d86c294c4e5cd8d31665f3.1617283827.git.christophe.leroy@csgroup.eu
While removing large number of mappings from hash page tables for
large memory systems as soft-lockup is reported because of the time
spent inside htap_remove_mapping() like one below:
watchdog: BUG: soft lockup - CPU#8 stuck for 23s!
<snip>
NIP plpar_hcall+0x38/0x58
LR pSeries_lpar_hpte_invalidate+0x68/0xb0
Call Trace:
0x1fffffffffff000 (unreliable)
pSeries_lpar_hpte_removebolted+0x9c/0x230
hash__remove_section_mapping+0xec/0x1c0
remove_section_mapping+0x28/0x3c
arch_remove_memory+0xfc/0x150
devm_memremap_pages_release+0x180/0x2f0
devm_action_release+0x30/0x50
release_nodes+0x28c/0x300
device_release_driver_internal+0x16c/0x280
unbind_store+0x124/0x170
drv_attr_store+0x44/0x60
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
__vfs_write+0x3c/0x70
vfs_write+0xd4/0x270
ksys_write+0xdc/0x130
system_call+0x5c/0x70
Fix this by adding a cond_resched() to the loop in
htap_remove_mapping() that issues hcall to remove hpte mapping. The
call to cond_resched() is issued every HZ jiffies which should prevent
the soft-lockup from being reported.
Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210404163148.321346-1-vaibhav@linux.ibm.com
When we enabled STRICT_KERNEL_RWX we received some reports of boot
failures when using the Hash MMU and running under phyp. The crashes
are intermittent, and often exhibit as a completely unresponsive
system, or possibly an oops.
One example, which was caught in xmon:
[ 14.068327][ T1] devtmpfs: mounted
[ 14.069302][ T1] Freeing unused kernel memory: 5568K
[ 14.142060][ T347] BUG: Unable to handle kernel instruction fetch
[ 14.142063][ T1] Run /sbin/init as init process
[ 14.142074][ T347] Faulting instruction address: 0xc000000000004400
cpu 0x2: Vector: 400 (Instruction Access) at [c00000000c7475e0]
pc: c000000000004400: exc_virt_0x4400_instruction_access+0x0/0x80
lr: c0000000001862d4: update_rq_clock+0x44/0x110
sp: c00000000c747880
msr: 8000000040001031
current = 0xc00000000c60d380
paca = 0xc00000001ec9de80 irqmask: 0x03 irq_happened: 0x01
pid = 347, comm = kworker/2:1
...
enter ? for help
[c00000000c747880] c0000000001862d4 update_rq_clock+0x44/0x110 (unreliable)
[c00000000c7478f0] c000000000198794 update_blocked_averages+0xb4/0x6d0
[c00000000c7479f0] c000000000198e40 update_nohz_stats+0x90/0xd0
[c00000000c747a20] c0000000001a13b4 _nohz_idle_balance+0x164/0x390
[c00000000c747b10] c0000000001a1af8 newidle_balance+0x478/0x610
[c00000000c747be0] c0000000001a1d48 pick_next_task_fair+0x58/0x480
[c00000000c747c40] c000000000eaab5c __schedule+0x12c/0x950
[c00000000c747cd0] c000000000eab3e8 schedule+0x68/0x120
[c00000000c747d00] c00000000016b730 worker_thread+0x130/0x640
[c00000000c747da0] c000000000174d50 kthread+0x1a0/0x1b0
[c00000000c747e10] c00000000000e0f0 ret_from_kernel_thread+0x5c/0x6c
This shows that CPU 2, which was idle, woke up and then appears to
randomly take an instruction fault on a completely valid area of
kernel text.
The cause turns out to be the call to hash__mark_rodata_ro(), late in
boot. Due to the way we layout text and rodata, that function actually
changes the permissions for all of text and rodata to read-only plus
execute.
To do the permission change we use a hypervisor call, H_PROTECT. On
phyp that appears to be implemented by briefly removing the mapping of
the kernel text, before putting it back with the updated permissions.
If any other CPU is executing during that window, it will see spurious
faults on the kernel text and/or data, leading to crashes.
To fix it we use stop machine to collect all other CPUs, and then have
them drop into real mode (MMU off), while we change the mapping. That
way they are unaffected by the mapping temporarily disappearing.
We don't see this bug on KVM because KVM always use VPM=1, where
faults are directed to the hypervisor, and the fault will be
serialised vs the h_protect() by HPTE_V_HVLOCK.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210331003845.216246-5-mpe@ellerman.id.au
Pull the loop calling hpte_updateboltedpp() out of
hash__change_memory_range() into a helper function. We need it to be a
separate function for the next patch.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210331003845.216246-4-mpe@ellerman.id.au
In hash__mark_rodata_ro() we pass the raw PP_RXXX value to
hash__change_memory_range(). That has the effect of setting the key to
zero, because PP_RXXX contains no key value.
Fix it by using htab_convert_pte_flags(), which knows how to convert a
pgprot into a pp value, including the key.
Fixes: d94b827e89 ("powerpc/book3s64/kuap: Use Key 3 for kernel mapping with hash translation")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Link: https://lore.kernel.org/r/20210331003845.216246-3-mpe@ellerman.id.au
When adding a PTE a ptesync is needed to order the update of the PTE
with subsequent accesses otherwise a spurious fault may be raised.
radix__set_pte_at() does not do this for performance gains. For
non-kernel memory this is not an issue as any faults of this kind are
corrected by the page fault handler. For kernel memory these faults
are not handled. The current solution is that there is a ptesync in
flush_cache_vmap() which should be called when mapping from the
vmalloc region.
However, map_kernel_page() does not call flush_cache_vmap(). This is
troublesome in particular for code patching with Strict RWX on radix.
In do_patch_instruction() the page frame that contains the instruction
to be patched is mapped and then immediately patched. With no ordering
or synchronization between setting up the PTE and writing to the page
it is possible for faults.
As the code patching is done using __put_user_asm_goto() the resulting
fault is obscured - but using a normal store instead it can be seen:
BUG: Unable to handle kernel data access on write at 0xc008000008f24a3c
Faulting instruction address: 0xc00000000008bd74
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: nop_module(PO+) [last unloaded: nop_module]
CPU: 4 PID: 757 Comm: sh Tainted: P O 5.10.0-rc5-01361-ge3c1b78c8440-dirty #43
NIP: c00000000008bd74 LR: c00000000008bd50 CTR: c000000000025810
REGS: c000000016f634a0 TRAP: 0300 Tainted: P O (5.10.0-rc5-01361-ge3c1b78c8440-dirty)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 44002884 XER: 00000000
CFAR: c00000000007c68c DAR: c008000008f24a3c DSISR: 42000000 IRQMASK: 1
This results in the kind of issue reported here:
https://lore.kernel.org/linuxppc-dev/15AC5B0E-A221-4B8C-9039-FA96B8EF7C88@lca.pw/
Chris Riedl suggested a reliable way to reproduce the issue:
$ mount -t debugfs none /sys/kernel/debug
$ (while true; do echo function > /sys/kernel/debug/tracing/current_tracer ; echo nop > /sys/kernel/debug/tracing/current_tracer ; done) &
Turning ftrace on and off does a large amount of code patching which
in usually less then 5min will crash giving a trace like:
ftrace-powerpc: (____ptrval____): replaced (4b473b11) != old (60000000)
------------[ ftrace bug ]------------
ftrace failed to modify
[<c000000000bf8e5c>] napi_busy_loop+0xc/0x390
actual: 11:3b:47:4b
Setting ftrace call site to call ftrace function
ftrace record flags: 80000001
(1)
expected tramp: c00000000006c96c
------------[ cut here ]------------
WARNING: CPU: 4 PID: 809 at kernel/trace/ftrace.c:2065 ftrace_bug+0x28c/0x2e8
Modules linked in: nop_module(PO-) [last unloaded: nop_module]
CPU: 4 PID: 809 Comm: sh Tainted: P O 5.10.0-rc5-01360-gf878ccaf250a #1
NIP: c00000000024f334 LR: c00000000024f330 CTR: c0000000001a5af0
REGS: c000000004c8b760 TRAP: 0700 Tainted: P O (5.10.0-rc5-01360-gf878ccaf250a)
MSR: 900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28008848 XER: 20040000
CFAR: c0000000001a9c98 IRQMASK: 0
GPR00: c00000000024f330 c000000004c8b9f0 c000000002770600 0000000000000022
GPR04: 00000000ffff7fff c000000004c8b6d0 0000000000000027 c0000007fe9bcdd8
GPR08: 0000000000000023 ffffffffffffffd8 0000000000000027 c000000002613118
GPR12: 0000000000008000 c0000007fffdca00 0000000000000000 0000000000000000
GPR16: 0000000023ec37c5 0000000000000000 0000000000000000 0000000000000008
GPR20: c000000004c8bc90 c0000000027a2d20 c000000004c8bcd0 c000000002612fe8
GPR24: 0000000000000038 0000000000000030 0000000000000028 0000000000000020
GPR28: c000000000ff1b68 c000000000bf8e5c c00000000312f700 c000000000fbb9b0
NIP ftrace_bug+0x28c/0x2e8
LR ftrace_bug+0x288/0x2e8
Call Trace:
ftrace_bug+0x288/0x2e8 (unreliable)
ftrace_modify_all_code+0x168/0x210
arch_ftrace_update_code+0x18/0x30
ftrace_run_update_code+0x44/0xc0
ftrace_startup+0xf8/0x1c0
register_ftrace_function+0x4c/0xc0
function_trace_init+0x80/0xb0
tracing_set_tracer+0x2a4/0x4f0
tracing_set_trace_write+0xd4/0x130
vfs_write+0xf0/0x330
ksys_write+0x84/0x140
system_call_exception+0x14c/0x230
system_call_common+0xf0/0x27c
To fix this when updating kernel memory PTEs using ptesync.
Fixes: f1cb8f9beb ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Tidy up change log slightly]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210208032957.1232102-1-jniethe5@gmail.com
Various spelling/typo fixes.
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Hash faults use the trap vector to decide whether this is an
instruction or data fault. This should use the TRAP accessor
rather than open access regs->trap.
This won't cause a problem at the moment because 64s only uses
trap flags for system call interrupts (the norestart flag), but
that could change if any other trap flags get used in future.
Fixes: a4922f5442 ("powerpc/64s: move the hash fault handling logic to C")
Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210316105205.407767-1-npiggin@gmail.com
In fault.c, #ifdef CONFIG_PPC_MEM_KEYS is not needed because all
functions are always defined, and arch_vma_access_permitted()
always returns true when CONFIG_PPC_MEM_KEYS is not defined so
access_pkey_error() will return false so bad_access_pkey()
will never be called.
Include linux/pkeys.h to get a definition of vma_pkeys() for
bad_access_pkey().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/8038392f38d81f2ad169347efac29146f553b238.1615819955.git.christophe.leroy@csgroup.eu
lkp reported warnings in some configuration due to
update_current_thread_amr() being unused:
arch/powerpc/mm/book3s64/pkeys.c:284:20: error: unused function 'update_current_thread_amr'
static inline void update_current_thread_amr(u64 value)
Which is because it's only use is inside an ifdef. We could move it
inside the ifdef, but it's a single line function and only has one
caller, so just fold it in.
Similarly update_current_thread_iamr() is small and only called once,
so fold it in also.
Fixes: 48a8ab4eeb ("powerpc/book3s64/pkeys: Don't update SPRN_AMR when in kernel mode.")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210314093320.132331-1-mpe@ellerman.id.au
If the code can use a stack in vm area, it can also use a
stack in linear space.
Simplify code by removing old non VMAP stack code on PPC32.
That means the data translation is now re-enabled early in
exception prolog in all cases, not only when using VMAP stacks.
While we are touching EXCEPTION_PROLOG macros, remove the
unused for_rtas parameter in EXCEPTION_PROLOG_1.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7cd6440c60a7e8f4f035b245c57720f51e225aae.1615552866.git.christophe.leroy@csgroup.eu
Add architecture specific implementation details for KFENCE and enable
KFENCE for the ppc32 architecture. In particular, this implements the
required interface in <asm/kfence.h>.
KFENCE requires that attributes for pages from its memory pool can
individually be set. Therefore, force the Read/Write linear map to be
mapped at page granularity.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/8dfe1bd2abde26337c1d8c1ad0acfcc82185e0d5.1614868445.git.christophe.leroy@csgroup.eu
The mutex linear_mapping_mutex is defined at the of the file while its
only two user are within the CONFIG_MEMORY_HOTPLUG block.
A compile without CONFIG_MEMORY_HOTPLUG set fails on PREEMPT_RT because
its mutex implementation is smart enough to realize that it is unused.
Move the definition of linear_mapping_mutex to ifdef block where it is
used.
Fixes: 1f73ad3e8d ("powerpc/mm: print warning in arch_remove_linear_mapping()")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210219165648.2505482-1-bigeasy@linutronix.de
A large series adding wrappers for our interrupt handlers, so that irq/nmi/user
tracking can be isolated in the wrappers rather than spread in each handler.
Conversion of the 32-bit syscall handling into C.
A series from Nick to streamline our TLB flushing when using the Radix MMU.
Switch to using queued spinlocks by default for 64-bit server CPUs.
A rework of our PCI probing so that it happens later in boot, when more generic
infrastructure is available.
Two small fixes to allow 32-bit little-endian processes to run on 64-bit
kernels.
Other smaller features, fixes & cleanups.
Thanks to:
Alexey Kardashevskiy, Ananth N Mavinakayanahalli, Aneesh Kumar K.V, Athira
Rajeev, Bhaskar Chowdhury, Cédric Le Goater, Chengyang Fan, Christophe Leroy,
Christopher M. Riedl, Fabiano Rosas, Florian Fainelli, Frederic Barrat, Ganesh
Goudar, Hari Bathini, Jiapeng Chong, Joseph J Allen, Kajol Jain, Markus
Elfring, Michal Suchanek, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver
O'Halloran, Pingfan Liu, Po-Hsu Lin, Qian Cai, Ram Pai, Randy Dunlap, Sandipan
Das, Stephen Rothwell, Tyrel Datwyler, Will Springer, Yury Norov, Zheng
Yongjun.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmAzMagTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgAbBD/wMS2g1Q9oAGZPsx2NGd2RoeAauGxUs
Yj6cZVmR+oa6sJyFYgEG7dT7tcwJITQxLBD3HpsHSnJ/rLrMloE33+cZNA9c4STz
0mlzm3R7M5pOgcEqZglsgLP0RQeUuHSSF01g0kf1N3r+HYtmbmPjuUIl8CnAjlbT
iMD2ZN2p8/r3kDDht0iBO534HUpsqhc00duSZgQhsV/PR7ZWVxoPk7PEJeo4vXlJ
77986F7J5NLUTjMiLv5lTx49FcPbRd7a1jubsBtahJrwXj2GVvuy2i86G7HY+a+B
eSxN7zJQgaFeLo0YPo7fZLBI0MAsIQt3nnZhKX0TMglbv/K8Aq64xiJqsVQdJ883
CeEt0HvSJhsSC0C4O595NEINfDhDd+5IeSF9MvsujYXiUKRXtRkm1EPuAzTcZIzW
NwkCLRo33NMXa+khMKaiqF/g7INayPUXoWESx75NXFsuNfcORvstkeUuEoi5GwJo
TSlmosFqwRjghQ8eTLZuWBzmh3EpPGdtC4gm6D+lbzhzjah5c/1whyuLqra275kK
E3Qt0/V0ixKyvlG7MI5yYh3L7+R/hrsflH7xIJJxZp2DW6mwBJzQYmkxDbSS8PzK
nWien2XgpIQhSFat3QqreEFSfNkzdN2MClVi2Y1hpAgi+2Zm9rPdPNGcQI+DSOsB
kpJkjOjWNJU/PQ==
=dB2S
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- A large series adding wrappers for our interrupt handlers, so that
irq/nmi/user tracking can be isolated in the wrappers rather than
spread in each handler.
- Conversion of the 32-bit syscall handling into C.
- A series from Nick to streamline our TLB flushing when using the
Radix MMU.
- Switch to using queued spinlocks by default for 64-bit server CPUs.
- A rework of our PCI probing so that it happens later in boot, when
more generic infrastructure is available.
- Two small fixes to allow 32-bit little-endian processes to run on
64-bit kernels.
- Other smaller features, fixes & cleanups.
Thanks to: Alexey Kardashevskiy, Ananth N Mavinakayanahalli, Aneesh
Kumar K.V, Athira Rajeev, Bhaskar Chowdhury, Cédric Le Goater, Chengyang
Fan, Christophe Leroy, Christopher M. Riedl, Fabiano Rosas, Florian
Fainelli, Frederic Barrat, Ganesh Goudar, Hari Bathini, Jiapeng Chong,
Joseph J Allen, Kajol Jain, Markus Elfring, Michal Suchanek, Nathan
Lynch, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Pingfan Liu,
Po-Hsu Lin, Qian Cai, Ram Pai, Randy Dunlap, Sandipan Das, Stephen
Rothwell, Tyrel Datwyler, Will Springer, Yury Norov, and Zheng Yongjun.
* tag 'powerpc-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (188 commits)
powerpc/perf: Adds support for programming of Thresholding in P10
powerpc/pci: Remove unimplemented prototypes
powerpc/uaccess: Merge raw_copy_to_user_allowed() into raw_copy_to_user()
powerpc/uaccess: Merge __put_user_size_allowed() into __put_user_size()
powerpc/uaccess: get rid of small constant size cases in raw_copy_{to,from}_user()
powerpc/64: Fix stack trace not displaying final frame
powerpc/time: Remove get_tbl()
powerpc/time: Avoid using get_tbl()
spi: mpc52xx: Avoid using get_tbl()
powerpc/syscall: Avoid storing 'current' in another pointer
powerpc/32: Handle bookE debugging in C in syscall entry/exit
powerpc/syscall: Do not check unsupported scv vector on PPC32
powerpc/32: Remove the counter in global_dbcr0
powerpc/32: Remove verification of MSR_PR on syscall in the ASM entry
powerpc/syscall: implement system call entry/exit logic in C for PPC32
powerpc/32: Always save non volatile GPRs at syscall entry
powerpc/syscall: Change condition to check MSR_RI
powerpc/syscall: Save r3 in regs->orig_r3
powerpc/syscall: Use is_compat_task()
powerpc/syscall: Make interrupt.c buildable on PPC32
...
We added dcache flush on memory add/remove in commit
fb5924fddf ("powerpc/mm: Flush cache on memory hot(un)plug") to
handle crashes on GPU hotplug. Instead of adding dcache flush in
generic memory add/remove routine which is used even for regular
memory, we should handle these devices specific flush in the device
driver code.
memtrace did handle this in the driver and that was removed by commit
7fd6641de2 ("powerpc/powernv/memtrace: Let the arch hotunplug code
flush cache"). This patch reverts that commit.
The dcache flush in memory add was removed by commit
ea458effa8 ("powerpc: Don't flush caches when adding memory") which
I don't think is correct. The reason why we require dcache flush in
memtrace is to make sure we don't have a dirty cache when we remap a
pfn to cache inhibited. We should do that when the memtrace module
removes the memory and make the pfn available for HTM traces to map it
as cache inhibited.
The other device mentioned in commit fb5924fddf ("powerpc/mm: Flush
cache on memory hot(un)plug") is nvlink device with coherent memory.
The support for that was removed in commit
7eb3cf7619 ("powerpc/powernv: remove unused NPU DMA code") and
commit 25b2995a35 ("mm: remove MEMORY_DEVICE_PUBLIC support")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210203045812.234439-3-aneesh.kumar@linux.ibm.com
THP config results in compound pages. Make sure the kernel enables
the PageCompound() check with CONFIG_HUGETLB_PAGE disabled and
CONFIG_TRANSPARENT_HUGEPAGE enabled.
This makes sure we correctly flush the icache with THP pages.
flush_dcache_icache_page only matter for platforms that don't support
COHERENT_ICACHE.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210203045812.234439-1-aneesh.kumar@linux.ibm.com
As reported by lkp:
arch/powerpc/mm/book3s64/radix_tlb.c:646:6: warning: no previous
prototype for function 'exit_lazy_flush_tlb'
Fix it by moving the prototype into the existing header.
Fixes: 032b7f0893 ("powerpc/64s/radix: serialize_against_pte_lookup IPIs trim mm_cpumask")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210210130804.3190952-2-mpe@ellerman.id.au
The allyesconfig ppc64 kernel fails to link with relocations unable to
fit after commit 3a96570ffc ("powerpc: convert interrupt handlers to
use wrappers"), which is due to the interrupt handler functions being
put into the .noinstr.text section, which the linker script places on
the opposite side of the main .text section from the interrupt entry
asm code which calls the handlers.
This results in a lot of linker stubs that overwhelm the 252-byte sized
space we allow for them, or in the case of BE a .opd relocation link
error for some reason.
It's not required to put interrupt handlers in the .noinstr section,
previously they used NOKPROBE_SYMBOL, so take them out and replace
with a NOKPROBE_SYMBOL in the wrapper macro. Remove the explicit
NOKPROBE_SYMBOL macros in the interrupt handler functions. This makes
a number of interrupt handlers nokprobe that were not prior to the
interrupt wrappers commit, but since that commit they were made
nokprobe due to being in .noinstr.text, so this fix does not change
that.
The fixes tag is different to the commit that first exposes the problem
because it is where the wrapper macros were introduced.
Fixes: 8d41fc618a ("powerpc: interrupt handler wrapper functions")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Slightly fix up comment wording]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210211063636.236420-1-npiggin@gmail.com
Function names should tell what the function does, not how.
mfsrin() and mtsrin() are read/writing segment registers.
They are called that way because they are using mfsrin and mtsrin
instructions, but it doesn't matter for the caller.
In preparation of following patch, change their name to mfsr() and mtsr()
in order to make it obvious they manipulate segment registers without
messing up with how they do it.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/f92d99f4349391b77766745900231aa880a0efb5.1612612022.git.christophe.leroy@csgroup.eu
serialize_against_pte_lookup() performs IPIs to all CPUs in mm_cpumask.
Take this opportunity to try trim the CPU out of mm_cpumask. This can
reduce the cost of future serialize_against_pte_lookup() and/or the
cost of future TLB flushes.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201217134731.488135-7-npiggin@gmail.com
A single-threaded process that is flushing its own address space is
so far the only case where the mm_cpumask is attempted to be trimmed.
This patch expands that to flush in other situations, multi-threaded
processes and external sources. For now it's a relatively simple
occasional trim attempt. The main aim is to add the mechanism,
tweaking and tuning can come with more data.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201217134731.488135-6-npiggin@gmail.com
mm_cpumask trimming is currently restricted to be issued by the current
thread of a single-threaded mm. This patch relaxes that and allows the
mask to be trimmed from any context.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201217134731.488135-5-npiggin@gmail.com
If there are no CPUs in mm_cpumask, no TLB flush is required at all.
This patch adds a check for this case.
Currently it's not tested for, in fact mm_is_thread_local() returns
false if the current CPU is not in mm_cpumask, so it's treated as a
global flush.
This can come up in some cases like exec failure before the new mm has
ever been switched to. This patch reduces TLBIE instructions required
to build a kernel from about 120,000 to 45,000. Another situation it
could help is page reclaim, KSM, THP, etc., (i.e., asynch operations
external to the process) where the process is sleeping and has all TLBs
flushed out of all CPUs.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201217134731.488135-4-npiggin@gmail.com
The logic to decide what kind of TLB flush is required (local, global,
or IPI) is spread multiple times over the several kinds of TLB flushes.
Move it all into a single function which may issue IPIs if necessary,
and also returns a flush type that is to be used.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201217134731.488135-3-npiggin@gmail.com
Add a comment explaining part of the logic for mm_cpumask trimming, and
add a (hopefully graceful) check and warning in case something gets it
wrong.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201217134731.488135-2-npiggin@gmail.com
This moves exception_enter/exit calls to wrapper functions for
synchronous interrupts. More interrupt handlers are covered by
this than previously.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-33-npiggin@gmail.com
This moves the 64s/hash context tracking from hash_page_mm() to
__do_hash_fault(), so it's no longer called by OCXL / SPU
accelerators, which was certainly the wrong thing to be doing,
because those callers are not low level interrupt handlers, so
should have entered a kernel context tracking already.
Then remain in kernel context for the duration of the fault,
rather than enter/exit for the hash fault then enter/exit for
the page fault, which is pointless.
Even still, calling exception_enter/exit in __do_hash_fault seems
questionable because that's touching per-cpu variables, tracing,
etc., which might have been interrupted by this hash fault or
themselves cause hash faults. But maybe I miss something because
hash_page_mm very deliberately calls trace_hash_fault too, for
example. So for now go with it, it's no worse than before, in this
regard.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-32-npiggin@gmail.com
Simple helper for synchronous interrupt handlers (i.e., process-context)
to enable interrupts if it was taken in an interrupts-enabled context.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-30-npiggin@gmail.com
This makes a small improvement to the description of the SLB interrupt
environment. Move the memory access restrictions into one paragraph,
and the interrupt restrictions into the next rather than mix them.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-18-npiggin@gmail.com
This simplifies code, and it is also useful when introducing
interrupt handler wrappers when introducing wrapper functionality
that doesn't cope with asm entry code calling into more than one
handler function.
32-bit and 64e still have some such cases, which limits some ways
they can use interrupt wrappers.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-15-npiggin@gmail.com
This keeps the context tracking over the entire interrupt handler which
helps later with moving context tracking into interrupt wrappers.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-14-npiggin@gmail.com
This function acts like an interrupt handler so it needs to follow
the standard interrupt handler function signature which will be
introduced in a future change.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-13-npiggin@gmail.com
Similar to the previous patch this makes interrupt handler function
types more regular so they can be wrapped with the next patch.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-12-npiggin@gmail.com
Make mm fault handlers all just take the pt_regs * argument and load
DAR/DSISR from that. Make those that return a value return long.
This is done to make the function signatures match other handlers, which
will help with a future patch to add wrappers. Explicit arguments could
be added for performance but that would require more wrapper macro
variants.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-7-npiggin@gmail.com
The fault handling still has some complex logic particularly around
hash table handling, in asm. Implement most of this in C.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-6-npiggin@gmail.com
This fix the bad fault reported by KUAP when io_wqe_worker access userspace.
Bug: Read fault blocked by KUAP!
WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 __do_page_fault+0x6b4/0xcd0
NIP [c00000000009e7e4] __do_page_fault+0x6b4/0xcd0
LR [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0
..........
Call Trace:
[c000000016367330] [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0 (unreliable)
[c0000000163673e0] [c00000000009ee3c] do_page_fault+0x3c/0x120
[c000000016367430] [c00000000000c848] handle_page_fault+0x10/0x2c
--- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
..........
NIP [c0000000008e8228] iov_iter_fault_in_readable+0x148/0x6f0
LR [c0000000008e834c] iov_iter_fault_in_readable+0x26c/0x6f0
interrupt: 300
[c0000000163677e0] [c0000000007154a0] iomap_write_actor+0xc0/0x280
[c000000016367880] [c00000000070fc94] iomap_apply+0x1c4/0x780
[c000000016367990] [c000000000710330] iomap_file_buffered_write+0xa0/0x120
[c0000000163679e0] [c00800000040791c] xfs_file_buffered_aio_write+0x314/0x5e0 [xfs]
[c000000016367a90] [c0000000006d74bc] io_write+0x10c/0x460
[c000000016367bb0] [c0000000006d80e4] io_issue_sqe+0x8d4/0x1200
[c000000016367c70] [c0000000006d8ad0] io_wq_submit_work+0xc0/0x250
[c000000016367cb0] [c0000000006e2578] io_worker_handle_work+0x498/0x800
[c000000016367d40] [c0000000006e2cdc] io_wqe_worker+0x3fc/0x4f0
[c000000016367da0] [c0000000001cb0a4] kthread+0x1c4/0x1d0
[c000000016367e10] [c00000000000dbf0] ret_from_kernel_thread+0x5c/0x6c
The kernel consider thread AMR value for kernel thread to be
AMR_KUAP_BLOCKED. Hence access to userspace is denied. This
of course not correct and we should allow userspace access after
kthread_use_mm(). To be precise, kthread_use_mm() should inherit the
AMR value of the operating address space. But, the AMR value is
thread-specific and we inherit the address space and not thread
access restrictions. Because of this ignore AMR value when accessing
userspace via kernel thread.
current_thread_amr/iamr() are updated, because we use them in the
below stack.
....
[ 530.710838] CPU: 13 PID: 5587 Comm: io_wqe_worker-0 Tainted: G D 5.11.0-rc6+ #3
....
NIP [c0000000000aa0c8] pkey_access_permitted+0x28/0x90
LR [c0000000004b9278] gup_pte_range+0x188/0x420
--- interrupt: 700
[c00000001c4ef3f0] [0000000000000000] 0x0 (unreliable)
[c00000001c4ef490] [c0000000004bd39c] gup_pgd_range+0x3ac/0xa20
[c00000001c4ef5a0] [c0000000004bdd44] internal_get_user_pages_fast+0x334/0x410
[c00000001c4ef620] [c000000000852028] iov_iter_get_pages+0xf8/0x5c0
[c00000001c4ef6a0] [c0000000007da44c] bio_iov_iter_get_pages+0xec/0x700
[c00000001c4ef770] [c0000000006a325c] iomap_dio_bio_actor+0x2ac/0x4f0
[c00000001c4ef810] [c00000000069cd94] iomap_apply+0x2b4/0x740
[c00000001c4ef920] [c0000000006a38b8] __iomap_dio_rw+0x238/0x5c0
[c00000001c4ef9d0] [c0000000006a3c60] iomap_dio_rw+0x20/0x80
[c00000001c4ef9f0] [c008000001927a30] xfs_file_dio_aio_write+0x1f8/0x650 [xfs]
[c00000001c4efa60] [c0080000019284dc] xfs_file_write_iter+0xc4/0x130 [xfs]
[c00000001c4efa90] [c000000000669984] io_write+0x104/0x4b0
[c00000001c4efbb0] [c00000000066cea4] io_issue_sqe+0x3d4/0xf50
[c00000001c4efc60] [c000000000670200] io_wq_submit_work+0xb0/0x2f0
[c00000001c4efcb0] [c000000000674268] io_worker_handle_work+0x248/0x4a0
[c00000001c4efd30] [c0000000006746e8] io_wqe_worker+0x228/0x2a0
[c00000001c4efda0] [c00000000019d994] kthread+0x1b4/0x1c0
Fixes: 48a8ab4eeb ("powerpc/book3s64/pkeys: Don't update SPRN_AMR when in kernel mode.")
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210206025634.521979-1-aneesh.kumar@linux.ibm.com
It is safe to traverse mm->context.iommu_group_mem_list with either
mem_list_mutex or the RCU read lock held. Silence a few RCU-list false
positive warnings and fix a few missing RCU read locks.
arch/powerpc/mm/book3s64/iommu_api.c:330 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
2 locks held by qemu-kvm/4305:
#0: c000000bc3fe4d68 (&container->lock){+.+.}-{3:3}, at: tce_iommu_ioctl.part.9+0xc7c/0x1870 [vfio_iommu_spapr_tce]
#1: c000000001501910 (mem_list_mutex){+.+.}-{3:3}, at: mm_iommu_get+0x50/0x190
====
arch/powerpc/mm/book3s64/iommu_api.c:132 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
2 locks held by qemu-kvm/4305:
#0: c000000bc3fe4d68 (&container->lock){+.+.}-{3:3}, at: tce_iommu_ioctl.part.9+0xc7c/0x1870 [vfio_iommu_spapr_tce]
#1: c000000001501910 (mem_list_mutex){+.+.}-{3:3}, at: mm_iommu_do_alloc+0x120/0x5f0
====
arch/powerpc/mm/book3s64/iommu_api.c:292 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
2 locks held by qemu-kvm/4312:
#0: c000000ecafe23c8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0xdc/0x950 [kvm]
#1: c000000045e6c468 (&kvm->srcu){....}-{0:0}, at: kvmppc_h_put_tce+0x88/0x340 [kvm]
====
arch/powerpc/mm/book3s64/iommu_api.c:424 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
2 locks held by qemu-kvm/4312:
#0: c000000ecafe23c8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0xdc/0x950 [kvm]
#1: c000000045e6c468 (&kvm->srcu){....}-{0:0}, at: kvmppc_h_put_tce+0x88/0x340 [kvm]
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200510051559.1959-1-cai@lca.pw
pseries_alloc_bootmem_huge_page() is only used locally in
alloc_bootmem_huge_page() and does not need to be external.
It fixes this W=1 compile error :
../arch/powerpc/mm/hugetlbpage.c:220:12: error: no previous prototype for ‘pseries_alloc_bootmem_huge_page’ [-Werror=missing-prototypes]
220 | int __init pseries_alloc_bootmem_huge_page(struct hstate *hstate)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210104143206.695198-16-clg@kaod.org
It fixes this W=1 compile error :
../arch/powerpc/mm/book3s64/hash_utils.c:1867:6: error: no previous prototype for ‘hpte_insert_repeating’ [-Werror=missing-prototypes]
1867 | long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
| ^~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210104143206.695198-14-clg@kaod.org
- Switch to the generic C VDSO, as well as some cleanups of our VDSO
setup/handling code.
- Support for KUAP (Kernel User Access Prevention) on systems using the hashed
page table MMU, using memory protection keys.
- Better handling of PowerVM SMT8 systems where all threads of a core do not
share an L2, allowing the scheduler to make better scheduling decisions.
- Further improvements to our machine check handling.
- Show registers when unwinding interrupt frames during stack traces.
- Improvements to our pseries (PowerVM) partition migration code.
- Several series from Christophe refactoring and cleaning up various parts of
the 32-bit code.
- Other smaller features, fixes & cleanups.
Thanks to:
Alan Modra, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Ard
Biesheuvel, Athira Rajeev, Balamuruhan S, Bill Wendling, Cédric Le Goater,
Christophe Leroy, Christophe Lombard, Colin Ian King, Daniel Axtens, David
Hildenbrand, Frederic Barrat, Ganesh Goudar, Gautham R. Shenoy, Geert
Uytterhoeven, Giuseppe Sacco, Greg Kurz, Harish, Jan Kratochvil, Jordan
Niethe, Kaixu Xia, Laurent Dufour, Leonardo Bras, Madhavan Srinivasan, Mahesh
Salgaonkar, Mathieu Desnoyers, Nathan Lynch, Nicholas Piggin, Oleg Nesterov,
Oliver O'Halloran, Oscar Salvador, Po-Hsu Lin, Qian Cai, Qinglang Miao, Randy
Dunlap, Ravi Bangoria, Sachin Sant, Sandipan Das, Sebastian Andrzej Siewior ,
Segher Boessenkool, Srikar Dronamraju, Tyrel Datwyler, Uwe Kleine-König,
Vincent Stehlé, Youling Tang, Zhang Xiaoxu.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl/bURITHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgEzBEAC1Vwibcog2P9rkJPb0q3UGWSYSx25V
h/LwwxtM9Tm14j/LZsSgkOgIsfMaWEBIw/8D4efQ7AX9aFo+R0c2DdQMx1MG5MXz
gZk58+l3LwId6h9+OrwurpEW+ZmURLAtGMSyFdkeiZ3/XTnkbf1XnewC0QWQe56a
EGLmjx1MFl45jspoy7UIUXsXoNJIfflEKhrgUzSUh8X2eLmvB9ws6A4BXxbVzyZl
lZv3+uWimU2pFgdkB9jOCxoG4zFEr2o5ovLHG7zCCVo5JoXmTPQ5cMVBraH206ms
+5vCmu4qI8uP5UlZW/mZfhrtDiMdHdQqqFOaQwVlOmoUbU6L6E6rxm3iVnov2Bbi
iUgxoeJDxAb2cM2EWFK6oWVgr7+NkwvXM1IG8xtprhHrCdnC9r+psQr/dswb3LSg
MJ7u/RCq3uixy2kWP8E0NEHw7ngQZ/ZKPqzfnmIWOC7tYUxgaL02I8Ff9/ZXAI2J
CnmqFYOjrimHkcwXGOtKkXNvfU0DiL97qpK2AQNWElE8+bUUmpw+ltUrsdSycYmv
Afc4WIcVrTA+a9laSLgjdZbolbNSa3p+cMIYdPrVx9g+xqygbxIKv+EDGNv1WHfD
GU1gmohMY+ZkMOvFRMi8LAsEm0DH/etWE0py/8uyxDYKnGyD1Ur6452DStkmgGb2
azmcaOyLdb+HXA==
=Ga3K
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Switch to the generic C VDSO, as well as some cleanups of our VDSO
setup/handling code.
- Support for KUAP (Kernel User Access Prevention) on systems using the
hashed page table MMU, using memory protection keys.
- Better handling of PowerVM SMT8 systems where all threads of a core
do not share an L2, allowing the scheduler to make better scheduling
decisions.
- Further improvements to our machine check handling.
- Show registers when unwinding interrupt frames during stack traces.
- Improvements to our pseries (PowerVM) partition migration code.
- Several series from Christophe refactoring and cleaning up various
parts of the 32-bit code.
- Other smaller features, fixes & cleanups.
Thanks to: Alan Modra, Alexey Kardashevskiy, Andrew Donnellan, Aneesh
Kumar K.V, Ard Biesheuvel, Athira Rajeev, Balamuruhan S, Bill Wendling,
Cédric Le Goater, Christophe Leroy, Christophe Lombard, Colin Ian King,
Daniel Axtens, David Hildenbrand, Frederic Barrat, Ganesh Goudar,
Gautham R. Shenoy, Geert Uytterhoeven, Giuseppe Sacco, Greg Kurz,
Harish, Jan Kratochvil, Jordan Niethe, Kaixu Xia, Laurent Dufour,
Leonardo Bras, Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu
Desnoyers, Nathan Lynch, Nicholas Piggin, Oleg Nesterov, Oliver
O'Halloran, Oscar Salvador, Po-Hsu Lin, Qian Cai, Qinglang Miao, Randy
Dunlap, Ravi Bangoria, Sachin Sant, Sandipan Das, Sebastian Andrzej
Siewior , Segher Boessenkool, Srikar Dronamraju, Tyrel Datwyler, Uwe
Kleine-König, Vincent Stehlé, Youling Tang, and Zhang Xiaoxu.
* tag 'powerpc-5.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (304 commits)
powerpc/32s: Fix cleanup_cpu_mmu_context() compile bug
powerpc: Add config fragment for disabling -Werror
powerpc/configs: Add ppc64le_allnoconfig target
powerpc/powernv: Rate limit opal-elog read failure message
powerpc/pseries/memhotplug: Quieten some DLPAR operations
powerpc/ps3: use dma_mapping_error()
powerpc: force inlining of csum_partial() to avoid multiple csum_partial() with GCC10
powerpc/perf: Fix Threshold Event Counter Multiplier width for P10
powerpc/mm: Fix hugetlb_free_pmd_range() and hugetlb_free_pud_range()
KVM: PPC: Book3S HV: Fix mask size for emulated msgsndp
KVM: PPC: fix comparison to bool warning
KVM: PPC: Book3S: Assign boolean values to a bool variable
powerpc: Inline setup_kup()
powerpc/64s: Mark the kuap/kuep functions non __init
KVM: PPC: Book3S HV: XIVE: Add a comment regarding VP numbering
powerpc/xive: Improve error reporting of OPAL calls
powerpc/xive: Simplify xive_do_source_eoi()
powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_EOI_FW
powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_MASK_FW
powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_SHIFT_BUG
...
- Consolidate all kmap_atomic() internals into a generic implementation
which builds the base for the kmap_local() API and make the
kmap_atomic() interface wrappers which handle the disabling/enabling of
preemption and pagefaults.
- Switch the storage from per-CPU to per task and provide scheduler
support for clearing mapping when scheduling out and restoring them
when scheduling back in.
- Merge the migrate_disable/enable() code, which is also part of the
scheduler pull request. This was required to make the kmap_local()
interface available which does not disable preemption when a mapping
is established. It has to disable migration instead to guarantee that
the virtual address of the mapped slot is the same accross preemption.
- Provide better debug facilities: guard pages and enforced utilization
of the mapping mechanics on 64bit systems when the architecture allows
it.
- Provide the new kmap_local() API which can now be used to cleanup the
kmap_atomic() usage sites all over the place. Most of the usage sites
do not require the implicit disabling of preemption and pagefaults so
the penalty on 64bit and 32bit non-highmem systems is removed and quite
some of the code can be simplified. A wholesale conversion is not
possible because some usage depends on the implicit side effects and
some need to be cleaned up because they work around these side effects.
The migrate disable side effect is only effective on highmem systems
and when enforced debugging is enabled. On 64bit and 32bit non-highmem
systems the overhead is completely avoided.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XyQwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUolD/9+R+BX96fGir+I8rG9dc3cbLw5meSi
0I/Nq3PToZMs2Iqv50DsoaPYHHz/M6fcAO9LRIgsE9jRbnY93GnsBM0wU9Y8yQaT
4wUzOG5WHaLDfqIkx/CN9coUl458oEiwOEbn79A2FmPXFzr7IpkufnV3ybGDwzwP
p73bjMJMPPFrsa9ig87YiYfV/5IAZHi82PN8Cq1v4yNzgXRP3Tg6QoAuCO84ZnWF
RYlrfKjcJ2xPdn+RuYyXolPtxr1hJQ0bOUpe4xu/UfeZjxZ7i1wtwLN9kWZe8CKH
+x4Lz8HZZ5QMTQ9sCHOLtKzu2MceMcpISzoQH4/aFQCNMgLn1zLbS790XkYiQCuR
ne9Cua+IqgYfGMG8cq8+bkU9HCNKaXqIBgPEKE/iHYVmqzCOqhW5Cogu4KFekf6V
Wi7pyyUdX2en8BAWpk5NHc8de9cGcc+HXMq2NIcgXjVWvPaqRP6DeITERTZLJOmz
XPxq5oPLGl7wdm7z+ICIaNApy8zuxpzb6sPLNcn7l5OeorViORlUu08AN8587wAj
FiVjp6ZYomg+gyMkiNkDqFOGDH5TMENpOFoB0hNNEyJwwS0xh6CgWuwZcv+N8aPO
HuS/P+tNANbD8ggT4UparXYce7YCtgOf3IG4GA3JJYvYmJ6pU+AZOWRoDScWq4o+
+jlfoJhMbtx5Gg==
=n71I
-----END PGP SIGNATURE-----
Merge tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull kmap updates from Thomas Gleixner:
"The new preemtible kmap_local() implementation:
- Consolidate all kmap_atomic() internals into a generic
implementation which builds the base for the kmap_local() API and
make the kmap_atomic() interface wrappers which handle the
disabling/enabling of preemption and pagefaults.
- Switch the storage from per-CPU to per task and provide scheduler
support for clearing mapping when scheduling out and restoring them
when scheduling back in.
- Merge the migrate_disable/enable() code, which is also part of the
scheduler pull request. This was required to make the kmap_local()
interface available which does not disable preemption when a
mapping is established. It has to disable migration instead to
guarantee that the virtual address of the mapped slot is the same
across preemption.
- Provide better debug facilities: guard pages and enforced
utilization of the mapping mechanics on 64bit systems when the
architecture allows it.
- Provide the new kmap_local() API which can now be used to cleanup
the kmap_atomic() usage sites all over the place. Most of the usage
sites do not require the implicit disabling of preemption and
pagefaults so the penalty on 64bit and 32bit non-highmem systems is
removed and quite some of the code can be simplified. A wholesale
conversion is not possible because some usage depends on the
implicit side effects and some need to be cleaned up because they
work around these side effects.
The migrate disable side effect is only effective on highmem
systems and when enforced debugging is enabled. On 64bit and 32bit
non-highmem systems the overhead is completely avoided"
* tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
ARM: highmem: Fix cache_is_vivt() reference
x86/crashdump/32: Simplify copy_oldmem_page()
io-mapping: Provide iomap_local variant
mm/highmem: Provide kmap_local*
sched: highmem: Store local kmaps in task struct
x86: Support kmap_local() forced debugging
mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
mm/highmem: Provide and use CONFIG_DEBUG_KMAP_LOCAL
microblaze/mm/highmem: Add dropped #ifdef back
xtensa/mm/highmem: Make generic kmap_atomic() work correctly
mm/highmem: Take kmap_high_get() properly into account
highmem: High implementation details and document API
Documentation/io-mapping: Remove outdated blurb
io-mapping: Cleanup atomic iomap
mm/highmem: Remove the old kmap_atomic cruft
highmem: Get rid of kmap_types.h
xtensa/mm/highmem: Switch to generic kmap atomic
sparc/mm/highmem: Switch to generic kmap atomic
powerpc/mm/highmem: Switch to generic kmap atomic
nds32/mm/highmem: Switch to generic kmap atomic
...
setup_kup() is used by both 64-bit and 32-bit code. However on 64-bit
it must not be __init, because it's used for CPU hotplug, whereas on
32-bit it should be __init because it calls setup_kuap/kuep() which
are __init.
We worked around that problem in the past by marking it __ref, see
commit 67d53f30e2 ("powerpc/mm: fix section mismatch for
setup_kup()").
Marking it __ref basically just omits it from section mismatch
checking, which can lead to bugs, and in fact it did, see commit
44b4c4450f ("powerpc/64s: Mark the kuap/kuep functions non __init")
We can avoid all these problems by just making it static inline.
Because all it does is call other functions, making it inline actually
shrinks the 32-bit vmlinux by ~76 bytes.
Make it __always_inline as pointed out by Christophe.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201214123011.311024-1-mpe@ellerman.id.au
The kernel calls these functions on CPU online and hence they must not
be marked __init.
Otherwise if the memory they occupied has been reused the system can
crash in various ways. Sachin reported it caused his LPAR to
spontaneously restart with no other output. With xmon enabled it may
drop into xmon with a dump like:
cpu 0x1: Vector: 700 (Program Check) at [c000000003c5fcb0]
pc: 00000000011e0a78
lr: 00000000011c51d4
sp: c000000003c5ff50
msr: 8000000000081001
current = 0xc000000002c12b00
paca = 0xc000000003cff280 irqmask: 0x03 irq_happened: 0x01
pid = 0, comm = swapper/1
...
[c000000003c5ff50] 0000000000087c38 (unreliable)
[c000000003c5ff70] 000000000003870c
[c000000003c5ff90] 000000000000d108
Fixes: 3b47b7549e ("powerpc/book3s64/kuap: Move KUAP related function outside radix")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Expand change log with details and xmon output]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201214080121.358567-1-aneesh.kumar@linux.ibm.com
One commit to implement copy_from_kernel_nofault_allowed(), otherwise
copy_from_kernel_nofault() can trigger warnings when accessing bad addresses in
some configurations.
Thanks to:
Christophe Leroy, Qian Cai.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl/StwITHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgHtgD/9XePu2lenUUZDbzHVKJ/4oozNqJaYc
mM7k/53GmPyi7AAttLdQGlSB0Gv2xBSDhng7T/UOnnBKwBk7gP8J4espuGraoYkA
8q1LsAO9wwpN+oDQjFQ+s4uErildwIy73uSXByfhIESHo5VtY9ol7g+zZaTfyNhO
W/wpSzcHLmTCMoWcJfk5vLCHMDmaY1Qq7U9uNt78bwUaNXz9LVZ/UFWSe4Bt6jEM
573bgsSkbLoTV5QptDUOPpIBw1T+zahwB6dMjPzbxYa6Rws1I4QeJRNYxdvunDHP
+F2ZYK/zyFBQlojPnjXJmbqQHEtXA/l9DNyLwR9VqjAOmgZaQezTVMIV56b8ndpM
X7+AG37Nt6hqUfPz3f7L67y64VFAmAt8dFqVqUzEXBcN1KpVkS5BvBxjTUKwItwo
Fdf80iSHaHYPdYAJJzjbeGuaaKID3w9H6npTR5xCKmN9o1r+N+VoZtQumlG+t6jl
EtnPu0r6y/tPcyKixk/myAAx/8mVTQicDyIj2klheDClmNMK7NA0+QpLEBus10tl
+bhk7KdWx7mQwYRltI+v7T3+mJ2SddVpQ84KmV6q21d/QbH1fQY/SvVNRYKWYb31
s31KT9lYiW7xZ5qiA6R9YNGynvhrD61Bzr5o2dKJxvbpmFkvosVWN7waPqwozlRC
l1Xvuc/1kBesiQ==
=ERDG
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
"One commit to implement copy_from_kernel_nofault_allowed(), otherwise
copy_from_kernel_nofault() can trigger warnings when accessing bad
addresses in some configurations.
Thanks to Christophe Leroy and Qian Cai"
* tag 'powerpc-5.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm: Fix KUAP warning by providing copy_from_kernel_nofault_allowed()
PTE_FLAGS_OFFSET is defined in asm/page_32.h and used only
in hash_low.S
And PTE_FLAGS_OFFSET nullity depends on CONFIG_PTE_64BIT
Instead of tests like #if (PTE_FLAGS_OFFSET != 0), use
CONFIG_PTE_64BIT related code.
Also move the definition of PTE_FLAGS_OFFSET into hash_low.S
directly, that improves readability.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/f5bc21db7a33dab55924734e6060c2e9daed562e.1606247495.git.christophe.leroy@csgroup.eu
All hugetlb range freeing functions have a verification like the following,
which only differs by the mask used, depending on the page table level.
start &= MASK;
if (start < floor)
return;
if (ceiling) {
ceiling &= MASK;
if (! ceiling)
return;
}
if (end - 1 > ceiling - 1)
return;
Refactor that into a helper function which takes the mask as
an argument, returning true when [start;end[ is not fully
contained inside [floor;ceiling[
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/16a571bb32eb6e8cd44bda484c8d81cd8a25e6d7.1604668827.git.christophe.leroy@csgroup.eu
Exception fixup doesn't require the heady full regs saving,
do it from do_page_fault() directly.
For that, split bad_page_fault() in two parts.
As bad_page_fault() can also be called from other places than
handle_page_fault(), it will still perform exception fixup and
fallback on __bad_page_fault().
handle_page_fault() directly calls __bad_page_fault() as the
exception fixup will now be done by do_page_fault()
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/bd07d6fef9237614cd6d318d8f19faeeadaa816b.1607491748.git.christophe.leroy@csgroup.eu
search_exception_tables() is an heavy operation, we have to avoid it.
When KUAP is selected, we'll know the fault has been blocked by KUAP.
When it is blocked by KUAP, check whether we are in an expected
userspace access place. If so, emit a warning to spot something is
going work. Otherwise, just remain silent, it will likely Oops soon.
When KUAP is not selected, it behaves just as if the address was
already in the TLBs and no fault was generated.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/9870f01e293a5a76c4f4e4ddd4a6b0f63038c591.1607491748.git.christophe.leroy@csgroup.eu
The verification and message introduced by commit 374f3f5979
("powerpc/mm/hash: Handle user access of kernel address gracefully")
applies to all platforms, it should not be limited to BOOK3S.
Make the BOOK3S version of sanity_check_fault() the one for all,
and bail out earlier if not BOOK3S.
Fixes: 374f3f5979 ("powerpc/mm/hash: Handle user access of kernel address gracefully")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/fe199d5af3578d3bf80035d203a94d742a7a28af.1607491748.git.christophe.leroy@csgroup.eu
There is no big poing in not pinning kernel text anymore, as now
we can keep pinned TLB even with things like DEBUG_PAGEALLOC.
Remove CONFIG_PIN_TLB_TEXT, making it always right.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mpe: Drop ifdef around mmu_pin_tlb() to fix build errors]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/203b89de491e1379f1677a2685211b7c32adfff0.1606231483.git.christophe.leroy@csgroup.eu
On hash 32 bits, handling minor protection faults like unsetting
dirty flag is heavy if done from the normal page_fault processing,
because it implies hash table software lookup for flushing the entry
and then a DSI is taken anyway to add the entry back.
When KUAP was implemented, as explained in commit a68c31fc01
("powerpc/32s: Implement Kernel Userspace Access Protection"),
protection faults has been diverted from hash_page() because
hash_page() was not able to identify a KUAP fault.
Implement KUAP verification in hash_page(), by clearing write
permission when the access is a kernel access and Ks is 1.
This works regardless of the address because kernel segments always
have Ks set to 0 while user segments have Ks set to 0 only
when kernel write to userspace is granted.
Then protection faults can be handled by hash_page() even for KUAP.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/8a4ffe4798e9ea32aaaccdf85e411bb1beed3500.1605542955.git.christophe.leroy@csgroup.eu
flush_range() handle both the MMU_FTR_HPTE_TABLE case and
the other case.
The non MMU_FTR_HPTE_TABLE case is trivial as it is only a call
to _tlbie()/_tlbia() which is not worth a dedicated function.
Make flush_range() a hash specific and call it from tlbflush.h based
on mmu_has_feature(MMU_FTR_HPTE_TABLE).
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/132ab19aae52abc8e06ab524ec86d4229b5b9c3d.1603348103.git.christophe.leroy@csgroup.eu
flush_tlb_mm() and flush_tlb_page() handle both the MMU_FTR_HPTE_TABLE
case and the other case.
The non MMU_FTR_HPTE_TABLE case is trivial as it is only a call
to _tlbie()/_tlbia() which is not worth a dedicated function.
Make flush_tlb_mm() and flush_tlb_page() hash specific and call
them from tlbflush.h based on mmu_has_feature(MMU_FTR_HPTE_TABLE).
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/11e932ded41ba6d9b251d89b7afa33cc060d3aa4.1603348103.git.christophe.leroy@csgroup.eu
_tlbie() and _tlbia() are used only on 603 cores while the
other functions are used only on cores having a hash table.
Move them into a new file named nohash_low.S
Add mmu_hash_lock var is used by both, it needs to go
in a common file.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/9a265b1b17a64153463d361280cb4b43eb1266a4.1603348103.git.christophe.leroy@csgroup.eu
We now have an early hash table on hash MMU, so no need to check
Hash var to know if the Hash table is set of not.
Use mmu_has_feature(MMU_FTR_HPTE_TABLE) instead. This will allow
optimisation via jump_label.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/f1766631a9e014b6433f1a3c12c726ddfce34220.1603348103.git.christophe.leroy@csgroup.eu
This partially reverts commit eb232b1624 ("powerpc/book3s64/kuap: Improve
error reporting with KUAP") and update the fault handler to print
[ 55.022514] Kernel attempted to access user page (7e6725b70000) - exploit attempt? (uid: 0)
[ 55.022528] BUG: Unable to handle kernel data access on read at 0x7e6725b70000
[ 55.022533] Faulting instruction address: 0xc000000000e8b9bc
[ 55.022540] Oops: Kernel access of bad area, sig: 11 [#1]
....
when the kernel access userspace address without unlocking AMR.
bad_kuap_fault() is added as part of commit 5e5be3aed2 ("powerpc/mm: Detect
bad KUAP faults") to catch userspace access incorrectly blocked by AMR. Hence
retain the full stack dump there even with hash translation. Also, add a comment
explaining the difference between hash and radix.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201208031539.84878-1-aneesh.kumar@linux.ibm.com
Three commits fixing possible missed TLB invalidations for multi-threaded
processes when CPUs are hotplugged in and out.
A fix for a host crash triggerable by host userspace (qemu) in KVM on Power9.
A fix for a host crash in machine check handling when running HPT guests on a
HPT host.
One commit fixing potential missed TLB invalidations when using the hash MMU on
Power9 or later.
A regression fix for machines with CPUs on node 0 but no memory.
Thanks to:
Aneesh Kumar K.V, Cédric Le Goater, Greg Kurz, Milan Mohanty, Milton Miller,
Nicholas Piggin, Paul Mackerras, Srikar Dronamraju.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl/LcwsTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgOeiD/wKGX8eE7AJ5ZxoFLwpGEJhp9QgMDhe
nP82CkKobwMM3UCbde9MC8PqYGC7/7PhRPM0GI03uh6EfeHUtle7AZlBAlZoGaeJ
MwdQBQrZSqf1QJOyhUEa6CI0XTfCEOrsw+AkZQKdsv9JLcFBz7IyfP61gf7MHfyo
QKlfYYilXHbJ7M9oiM9gKUdtrpPfMGH0YnIp0FR+JowJAWUfFY626H9j7chNwWK+
7nrphtLHwsBVNtIoKWvPocuLKPsziOqXWnOP/do/RuCoKXMbGjtOJHhUgEYC5PM7
eQug43YDaws4K1fxaHvQto/u92nL2GFY6FfKNeJ5FcQYgCIvi/T8jzEsJyqGbpVz
YihZj1MbhhGr/neVtJW4SbdCTCU7R7X9QBy4He6XoWHR0fNoQDQvjNT/ziiuHiN0
tU+Y9aoHwI/0Pb44ceiQ/T10nxYtk+6Cj5Cm9Ll7MvfjUsE/BpxlYdi+KMqRSGOb
itOwFLQpgy28feMRKGZNKFURwTophASFaKO88yhjeSnlcGqxvicSIUpz8UD1jxwt
o/tsger09ZXqBYVdVKLpqbKsifVbzUfJmmycvuDF37B+VjwHACP+VZltwdOqnX13
BM9ndcDW2p6UnNLfs47FWJM+czmShrgwqI/W7qcCFleYL3r5XOS8hJHfgvJEcE04
n7A9cNvK5q6nvg==
=tIAZ
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Some more powerpc fixes for 5.10:
- Three commits fixing possible missed TLB invalidations for
multi-threaded processes when CPUs are hotplugged in and out.
- A fix for a host crash triggerable by host userspace (qemu) in KVM
on Power9.
- A fix for a host crash in machine check handling when running HPT
guests on a HPT host.
- One commit fixing potential missed TLB invalidations when using the
hash MMU on Power9 or later.
- A regression fix for machines with CPUs on node 0 but no memory.
Thanks to Aneesh Kumar K.V, Cédric Le Goater, Greg Kurz, Milan
Mohanty, Milton Miller, Nicholas Piggin, Paul Mackerras, and Srikar
Dronamraju"
* tag 'powerpc-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s/powernv: Fix memory corruption when saving SLB entries on MCE
KVM: PPC: Book3S HV: XIVE: Fix vCPU id sanity check
powerpc/numa: Fix a regression on memoryless node 0
powerpc/64s: Trim offlined CPUs from mm_cpumasks
kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling
powerpc/64s/pseries: Fix hash tlbiel_all_isa300 for guest kernels
powerpc/64s: Fix hash ISA v3.0 TLBIEL instruction generation
There is no defconfig selecting CONFIG_E200, and no platform.
e200 is an earlier version of booke, a predecessor of e500,
with some particularities like an unified cache instead of both an
instruction cache and a data cache.
Remove it.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/34ebc3ba2c768d97f363bd5f2deea2356e9ae127.1605589460.git.christophe.leroy@csgroup.eu
SPRN_SPRG_PGDIR is there mainly to speedup SW TLB miss handlers
for powerpc 603.
We need to free SPRN_SPRG2 to reduce the mess with CONFIG_VMAP_STACK.
In hash_page(), reading PGDIR from thread_struct will be in the noise
performance wise.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4adca19b7120cdf619956768ed09e74fc6a558f3.1606285014.git.christophe.leroy@csgroup.eu
Since commit 2b279c0348 ("powerpc/32s: Allow mapping with BATs with
DEBUG_PAGEALLOC"), there is no real situation where mapping without
BATs is required.
In order to simplify memory handling, always map kernel text
and rodata with BATs even when "nobats" kernel parameter is set.
Also fix the 603 TLB miss exceptions that don't require anymore
kernel page table if DEBUG_PAGEALLOC.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/da51f7ec632825a4ce43290a904aad61648408c0.1606285013.git.christophe.leroy@csgroup.eu
With hash translation use DSISR_KEYFAULT to identify a wrong access.
With Radix we look at the AMR value and type of fault.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201127044424.40686-17-aneesh.kumar@linux.ibm.com
Now that kernel correctly store/restore userspace AMR/IAMR values, avoid
manipulating AMR and IAMR from the kernel on behalf of userspace.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Sandipan Das <sandipan@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201127044424.40686-15-aneesh.kumar@linux.ibm.com
On fork, we inherit from the parent and on exec, we should switch to default_amr values.
Also, avoid changing the AMR register value within the kernel. The kernel now runs with
different AMR values.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Sandipan Das <sandipan@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201127044424.40686-13-aneesh.kumar@linux.ibm.com
This patch updates kernel hash page table entries to use storage key 3
for its mapping. This implies all kernel access will now use key 3 to
control READ/WRITE. The patch also prevents the allocation of key 3 from
userspace and UAMOR value is updated such that userspace cannot modify key 3.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Sandipan Das <sandipan@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201127044424.40686-9-aneesh.kumar@linux.ibm.com
This is in preparation to adding support for kuap with hash translation.
In preparation for that rename/move kuap related functions to
non radix names. Also move the feature bit closer to MMU_FTR_KUEP.
MMU_FTR_KUEP is renamed to MMU_FTR_BOOK3S_KUEP to indicate the feature
is only relevant to BOOK3S_64
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201127044424.40686-8-aneesh.kumar@linux.ibm.com
The next set of patches adds support for kuep with hash translation.
In preparation for that rename/move kuap related functions to
non radix names.
Also set MMU_FTR_KUEP and add the missing isync().
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201127044424.40686-7-aneesh.kumar@linux.ibm.com
The next set of patches adds support for kuap with hash translation.
In preparation for that rename/move kuap related functions to
non radix names.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201127044424.40686-6-aneesh.kumar@linux.ibm.com
This patch consolidates UAMOR update across pkey, kuap and kuep features.
The boot cpu initialize UAMOR via pkey init and both radix/hash do the
secondary cpu UAMOR init in early_init_mmu_secondary.
We don't check for mmu_feature in radix secondary init because UAMOR
is a supported SPRN with all CPUs supporting radix translation.
The old code was not updating UAMOR if we had smap disabled and smep enabled.
This change handles that case.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201127044424.40686-5-aneesh.kumar@linux.ibm.com
The config CONFIG_PPC_PKEY is used to select the base support that is
required for PPC_MEM_KEYS, KUAP, and KUEP. Adding this dependency
reduces the code complexity(in terms of #ifdefs) and enables us to
move some of the initialization code to pkeys.c
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201127044424.40686-4-aneesh.kumar@linux.ibm.com
Since ISA v3.0, SLB no longer uses the slb_cache, and stab_rr is no
longer correlated with SLB allocation. Move those to pre-3.0.
While here, improve some alignments and reduce whitespace.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201128070728.825934-9-npiggin@gmail.com
When offlining a CPU, powerpc/64s does not flush TLBs, rather it just
leaves the CPU set in mm_cpumasks, so it continues to receive TLBIEs
to manage its TLBs.
However the exit_flush_lazy_tlbs() function expects that after
returning, all CPUs (except self) have flushed TLBs for that mm, in
which case TLBIEL can be used for this flush. This breaks for offline
CPUs because they don't get the IPI to flush their TLB. This can lead
to stale translations.
Fix this by clearing the CPU from mm_cpumasks, then flushing all TLBs
before going offline.
These offlined CPU bits stuck in the cpumask also prevents the cpumask
from being trimmed back to local mode, which means continual broadcast
IPIs or TLBIEs are needed for TLB flushing. This patch prevents that
situation too.
A cast of many were involved in working this out, but in particular
Milton, Aneesh, Paul made key discoveries.
Fixes: 0cef77c779 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Debugged-by: Milton Miller <miltonm@us.ibm.com>
Debugged-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Debugged-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201126102530.691335-5-npiggin@gmail.com
tlbiel_all() can not be usable in !HVMODE when running hash presently,
remove HV privileged flushes when running in guest to make it usable.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201126102530.691335-3-npiggin@gmail.com
A typo has the R field of the instruction assigned by lucky dip a la
register allocator.
Fixes: d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201126102530.691335-2-npiggin@gmail.com
The core-mm has a default __weak implementation of phys_to_target_node()
to mirror the weak definition of memory_add_physaddr_to_nid(). That
symbol is exported for modules. However, while the export in
mm/memory_hotplug.c exported the symbol in the configuration cases of:
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=y
...and:
CONFIG_NUMA_KEEP_MEMINFO=n
CONFIG_MEMORY_HOTPLUG=y
...it failed to export the symbol in the case of:
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=n
Not only is that broken, but Christoph points out that the kernel should
not be exporting any __weak symbol, which means that
memory_add_physaddr_to_nid() example that phys_to_target_node() copied
is broken too.
Rework the definition of phys_to_target_node() and
memory_add_physaddr_to_nid() to not require weak symbols. Move to the
common arch override design-pattern of an asm header defining a symbol
to replace the default implementation.
The only common header that all memory_add_physaddr_to_nid() producing
architectures implement is asm/sparsemem.h. In fact, powerpc already
defines its memory_add_physaddr_to_nid() helper in sparsemem.h.
Double-down on that observation and define phys_to_target_node() where
necessary in asm/sparsemem.h. An alternate consideration that was
discarded was to put this override in asm/numa.h, but that entangles
with the definition of MAX_NUMNODES relative to the inclusion of
linux/nodemask.h, and requires powerpc to grow a new header.
The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
now that the symbol is properly exported / stubbed in all combinations
of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG.
[dan.j.williams@intel.com: v4]
Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com
[dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning]
Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: a035b6bf86 ("mm/memory_hotplug: introduce default phys_to_target_node() implementation")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's revert what we did in case something goes wrong and we return an
error - as already done on arm64 and s390x.
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201111145322.15793-8-david@redhat.com
The single caller (arch_remove_linear_mapping()) prints a proper
warning when this function fails. No need to eventually crash the
kernel - let's drop this WARN_ON.
Suggested-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201111145322.15793-7-david@redhat.com
Let's print a warning similar to in arch_add_linear_mapping() instead of
WARN_ON_ONCE() and eventually crashing the kernel.
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201111145322.15793-6-david@redhat.com
This code currently relies on mem_hotplug_begin()/mem_hotplug_done() -
create_section_mapping()/remove_section_mapping() implementations
cannot tollerate getting called concurrently.
Let's prepare for callers (memtrace) not holding any such locks (and
don't force them to mess with memory hotplug locks).
Other parts in these functions don't seem to rely on external locking.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201111145322.15793-5-david@redhat.com
We want to stop abusing memory hotplug infrastructure in memtrace code
to perform allocations and remove the linear mapping. Instead we will use
alloc_contig_pages() and remove the linear mapping manually.
Let's factor out creating/removing the linear mapping into
arch_create_linear_mapping() / arch_remove_linear_mapping() - so in the
future, we might be able to have whole arch_add_memory() /
arch_remove_memory() be implemented in common code.
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201111145322.15793-4-david@redhat.com
With POWER10, single tlbiel instruction invalidates all the congruence
class of the TLB and hence we need to issue only one tlbiel with SET=0.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201007053305.232879-1-aneesh.kumar@linux.ibm.com
powerpc used to set the PTE specific flags in set_pte_at(). That is
different from other architectures. To be consistent with other
architectures powerpc updated pfn_pte() to set _PAGE_PTE in commit
379c926d63 ("powerpc/mm: move setting pte specific flags to
pfn_pte")
That commit didn't do the same for pfn_pmd() because we expect
pmd_mkhuge() to do that. But as per Linus that is a bad rule:
The rule that you must use "pmd_mkhuge()" seems _completely_ wrong.
The only valid use to ever make a pmd out of a pfn is to make a
huge-page.
Hence update pfn_pmd() to set _PAGE_PTE.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201022091115.39568-1-aneesh.kumar@linux.ibm.com
No reason having the same code in every architecture
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20201103095858.087635810@linutronix.de
- A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting it for
powerpc, as well as a related fix for sparc.
- Remove support for PowerPC 601.
- Some fixes for watchpoints & addition of a new ptrace flag for detecting ISA
v3.1 (Power10) watchpoint features.
- A fix for kernels using 4K pages and the hash MMU on bare metal Power9
systems with > 16TB of RAM, or RAM on the 2nd node.
- A basic idle driver for shallow stop states on Power10.
- Tweaks to our sched domains code to better inform the scheduler about the
hardware topology on Power9/10, where two SMT4 cores can be presented by
firmware as an SMT8 core.
- A series doing further reworks & cleanups of our EEH code.
- Addition of a filter for RTAS (firmware) calls done via sys_rtas(), to
prevent root from overwriting kernel memory.
- Other smaller features, fixes & cleanups.
Thanks to:
Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Athira Rajeev, Biwen
Li, Cameron Berkenpas, Cédric Le Goater, Christophe Leroy, Christoph Hellwig,
Colin Ian King, Daniel Axtens, David Dai, Finn Thain, Frederic Barrat, Gautham
R. Shenoy, Greg Kurz, Gustavo Romero, Ira Weiny, Jason Yan, Joel Stanley,
Jordan Niethe, Kajol Jain, Konrad Rzeszutek Wilk, Laurent Dufour, Leonardo
Bras, Liu Shixin, Luca Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar,
Nathan Lynch, Nicholas Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver
O'Halloran, Pedro Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai,
Qinglang Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott
Cheloha, Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
Yingliang, zhengbin.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl+JBQoTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgJJAD/0e3tsFP+9rFlxKSJlDcMW3w7kXDRXE
tG40F1ubYFLU8wtFVR0De3njTRsz5HyaNU6SI8CwPq48mCa7OFn1D1OeHonHXDX9
w6v3GE2S1uXXQnjm+czcfdjWQut0IwWBLx007/S23WcPff3Abc2irupKLNu+Gx29
b/yxJHZSRJVX59jSV94HkdJS75mDHQ3oUOlFGXtuGcUZDufpD1ynRcQOjr0V/8JU
F4WAblFSe7hiczHGqIvfhFVJ+OikEhnj2aEMAL8U7vxzrAZ7RErKCN9s/0Tf0Ktx
FzNEFNLHZGqh+qNDpKKmM+RnaeO2Lcoc9qVn7vMHOsXPzx9F5LJwkI/DgPjtgAq/
mFvGnQB/FapATnQeMluViC/qhEe5bQXLUfPP5i2+QOjK0QqwyFlUMgaVNfsY8jRW
0Q/sNA72Opzst4WUTveCd4SOInlUuat09e5nLooCRLW7u7/jIiXNRSFNvpOiwkfF
EcIPJsi6FUQ4SNbqpRSNEO9fK5JZrrUtmr0pg8I7fZhHYGcxEjqPR6IWCs3DTsak
4/KhjhhTnP/IWJRw6qKAyNhEyEwpWqYZ97SIQbvSb1g/bS47AIdQdJRb0eEoRjhx
sbbnnYFwPFkG4c1yQSIFanT9wNDQ2hFx/c/mRfbd7J+ordx9JsoqXjqrGuhsU/pH
GttJLmkJ5FH+pQ==
=akeX
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting
it for powerpc, as well as a related fix for sparc.
- Remove support for PowerPC 601.
- Some fixes for watchpoints & addition of a new ptrace flag for
detecting ISA v3.1 (Power10) watchpoint features.
- A fix for kernels using 4K pages and the hash MMU on bare metal
Power9 systems with > 16TB of RAM, or RAM on the 2nd node.
- A basic idle driver for shallow stop states on Power10.
- Tweaks to our sched domains code to better inform the scheduler about
the hardware topology on Power9/10, where two SMT4 cores can be
presented by firmware as an SMT8 core.
- A series doing further reworks & cleanups of our EEH code.
- Addition of a filter for RTAS (firmware) calls done via sys_rtas(),
to prevent root from overwriting kernel memory.
- Other smaller features, fixes & cleanups.
Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe
Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn
Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero,
Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad
Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca
Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas
Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro
Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang
Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha,
Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
Yingliang, zhengbin.
* tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits)
Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"
selftests/powerpc: Fix eeh-basic.sh exit codes
cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier
powerpc/time: Make get_tb() common to PPC32 and PPC64
powerpc/time: Make get_tbl() common to PPC32 and PPC64
powerpc/time: Remove get_tbu()
powerpc/time: Avoid using get_tbl() and get_tbu() internally
powerpc/time: Make mftb() common to PPC32 and PPC64
powerpc/time: Rename mftbl() to mftb()
powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S
powerpc/32s: Rename head_32.S to head_book3s_32.S
powerpc/32s: Setup the early hash table at all time.
powerpc/time: Remove ifdef in get_dec() and set_dec()
powerpc: Remove get_tb_or_rtc()
powerpc: Remove __USE_RTC()
powerpc: Tidy up a bit after removal of PowerPC 601.
powerpc: Remove support for PowerPC 601
powerpc: Remove PowerPC 601
powerpc: Drop SYNC_601() ISYNC_601() and SYNC()
powerpc: Remove CONFIG_PPC601_SYNC_FIX
...
powerpc used to set the pte specific flags in set_pte_at(). This is
different from other architectures. To be consistent with other
architecture update pfn_pte to set _PAGE_PTE on ppc64. Also, drop now
unused pte_mkpte.
We add a VM_WARN_ON() to catch the usage of calling set_pte_at() without
setting _PAGE_PTE bit. We will remove that after a few releases.
With respect to huge pmd entries, pmd_mkhuge() takes care of adding the
_PAGE_PTE bit.
[akpm@linux-foundation.org: whitespace fix, per Christophe]
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lkml.kernel.org/r/20200902114222.181353-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- rework the non-coherent DMA allocator
- move private definitions out of <linux/dma-mapping.h>
- lower CMA_ALIGNMENT (Paul Cercueil)
- remove the omap1 dma address translation in favor of the common
code
- make dma-direct aware of multiple dma offset ranges (Jim Quinlan)
- support per-node DMA CMA areas (Barry Song)
- increase the default seg boundary limit (Nicolin Chen)
- misc fixes (Robin Murphy, Thomas Tai, Xu Wang)
- various cleanups
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl+IiPwLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPKEQ//TM8vxjucnRl/pklpMin49dJorwiVvROLhQqLmdxw
286ZKpVzYYAPc7LnNqwIBugnFZiXuHu8xPKQkIiOa2OtNDTwhKNoBxOAmOJaV6DD
8JfEtZYeX5mKJ/Nqd2iSkIqOvCwZ9Wzii+aytJ2U88wezQr1fnyF4X49MegETEey
FHWreSaRWZKa0MMRu9AQ0QxmoNTHAQUNaPc0PeqEtPULybfkGOGw4/ghSB7WcKrA
gtKTuooNOSpVEHkTas2TMpcBp6lxtOjFqKzVN0ml+/nqq5NeTSDx91VOCX/6Cj76
mXIg+s7fbACTk/BmkkwAkd0QEw4fo4tyD6Bep/5QNhvEoAriTuSRbhvLdOwFz0EF
vhkF0Rer6umdhSK7nPd7SBqn8kAnP4vBbdmB68+nc3lmkqysLyE4VkgkdH/IYYQI
6TJ0oilXWFmU6DT5Rm4FBqCvfcEfU2dUIHJr5wZHqrF2kLzoZ+mpg42fADoG4GuI
D/oOsz7soeaRe3eYfWybC0omGR6YYPozZJ9lsfftcElmwSsFrmPsbO1DM5IBkj1B
gItmEbOB9ZK3RhIK55T/3u1UWY3Uc/RVr+kchWvADGrWnRQnW0kxYIqDgiOytLFi
JZNH8uHpJIwzoJAv6XXSPyEUBwXTG+zK37Ce769HGbUEaUrE71MxBbQAQsK8mDpg
7fM=
=Bkf/
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- rework the non-coherent DMA allocator
- move private definitions out of <linux/dma-mapping.h>
- lower CMA_ALIGNMENT (Paul Cercueil)
- remove the omap1 dma address translation in favor of the common code
- make dma-direct aware of multiple dma offset ranges (Jim Quinlan)
- support per-node DMA CMA areas (Barry Song)
- increase the default seg boundary limit (Nicolin Chen)
- misc fixes (Robin Murphy, Thomas Tai, Xu Wang)
- various cleanups
* tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping: (63 commits)
ARM/ixp4xx: add a missing include of dma-map-ops.h
dma-direct: simplify the DMA_ATTR_NO_KERNEL_MAPPING handling
dma-direct: factor out a dma_direct_alloc_from_pool helper
dma-direct check for highmem pages in dma_direct_alloc_pages
dma-mapping: merge <linux/dma-noncoherent.h> into <linux/dma-map-ops.h>
dma-mapping: move large parts of <linux/dma-direct.h> to kernel/dma
dma-mapping: move dma-debug.h to kernel/dma/
dma-mapping: remove <asm/dma-contiguous.h>
dma-mapping: merge <linux/dma-contiguous.h> into <linux/dma-map-ops.h>
dma-contiguous: remove dma_contiguous_set_default
dma-contiguous: remove dev_set_cma_area
dma-contiguous: remove dma_declare_contiguous
dma-mapping: split <linux/dma-mapping.h>
cma: decrease CMA_ALIGNMENT lower limit to 2
firewire-ohci: use dma_alloc_pages
dma-iommu: implement ->alloc_noncoherent
dma-mapping: add new {alloc,free}_noncoherent dma_map_ops methods
dma-mapping: add a new dma_alloc_pages API
dma-mapping: remove dma_cache_sync
53c700: convert to dma_alloc_noncoherent
...
There are several occurrences of the following pattern:
for_each_memblock(memory, reg) {
start = __pfn_to_phys(memblock_region_memory_base_pfn(reg);
end = __pfn_to_phys(memblock_region_memory_end_pfn(reg));
/* do something with start and end */
}
Using for_each_mem_range() iterator is more appropriate in such cases and
allows simpler and cleaner code.
[akpm@linux-foundation.org: fix arch/arm/mm/pmsa-v7.c build]
[rppt@linux.ibm.com: mips: fix cavium-octeon build caused by memblock refactoring]
Link: http://lkml.kernel.org/r/20200827124549.GD167163@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: https://lkml.kernel.org/r/20200818151634.14343-13-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are several occurrences of the following pattern:
for_each_memblock(memory, reg) {
start_pfn = memblock_region_memory_base_pfn(reg);
end_pfn = memblock_region_memory_end_pfn(reg);
/* do something with start_pfn and end_pfn */
}
Rather than iterate over all memblock.memory regions and each time query
for their start and end PFNs, use for_each_mem_pfn_range() iterator to get
simpler and clearer code.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> [.clang-format]
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: https://lkml.kernel.org/r/20200818151634.14343-12-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At the time being, an early hash table is set up when
CONFIG_KASAN is selected.
There is nothing wrong with setting such an early hash table
all the time, even if it is not used. This is a statically
allocated 256 kB table which lies in the init data section.
This makes the code simpler and may in the future allow to
setup early IO mappings with fixmap instead of hard coding BATs.
Put create_hpte() and flush_hash_pages() in the .ref.text section
in order to avoid warning for the reference to early_hash[]. This
reference is removed by MMU_init_hw_patch() before init memory is
freed.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b8f8101c368b8a6451844a58d7bd7d83c14cf2aa.1601566529.git.christophe.leroy@csgroup.eu
PowerPC 601 has been retired.
Remove all associated specific code.
CPU_FTRS_PPC601 has CPU_FTR_COHERENT_ICACHE and CPU_FTR_COMMON.
CPU_FTR_COMMON is already present via other CPU_FTRS.
None of the remaining CPU selects CPU_FTR_COHERENT_ICACHE.
So CPU_FTRS_PPC601 can be removed from the possible features,
hence can be removed completely.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/60b725d55e21beec3335175c20b77903ff98284f.1601362098.git.christophe.leroy@csgroup.eu
Similar to commit 89c140bbae ("pseries: Fix 64 bit logical memory block panic")
make sure different variables tracking lmb_size are updated to be 64 bit.
Fixes: af9d00e93a ("powerpc/mm/radix: Create separate mappings for hot-plugged memory")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201007114836.282468-4-aneesh.kumar@linux.ibm.com
During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
to determine which node id (nid) to use when later calling __add_memory().
This is wasteful. On pseries, memory_add_physaddr_to_nid() finds an
appropriate nid for a given address by looking up the LMB containing the
address and then passing that LMB to of_drconf_to_nid_single() to get the
nid. In dlpar_add_lmb() we get this address from the LMB itself.
In short, we have a pointer to an LMB and then we are searching for
that LMB *again* in order to find its nid.
If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
can skip the redundant lookup. The only error handling we need to
duplicate from memory_add_physaddr_to_nid() is the fallback to the
default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
an invalid nid.
Skipping the extra lookup makes hot-add operations faster, especially
on machines with many LMBs.
Consider an LPAR with 126976 LMBs. In one test, hot-adding 126000
LMBs on an upatched kernel took ~3.5 hours while a patched kernel
completed the same operation in ~2 hours:
Unpatched (12450 seconds):
Sep 9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
Sep 9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep 9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
Patched (7065 seconds):
Sep 8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
Sep 8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep 8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
It should be noted that the speedup grows more substantial when
hot-adding LMBs at the end of the drconf range. This is because we
are skipping a linear LMB search.
To see the distinction, consider smaller hot-add test on the same
LPAR. A perf-stat run with 10 iterations showed that hot-adding 4096
LMBs completed less than 1 second faster on a patched kernel:
Unpatched:
Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
104,753.42 msec task-clock # 0.992 CPUs utilized ( +- 0.55% )
4,708 context-switches # 0.045 K/sec ( +- 0.69% )
2,444 cpu-migrations # 0.023 K/sec ( +- 1.25% )
394 page-faults # 0.004 K/sec ( +- 0.22% )
445,902,503,057 cycles # 4.257 GHz ( +- 0.55% ) (66.67%)
8,558,376,740 stalled-cycles-frontend # 1.92% frontend cycles idle ( +- 0.88% ) (49.99%)
300,346,181,651 stalled-cycles-backend # 67.36% backend cycles idle ( +- 0.76% ) (50.01%)
258,091,488,691 instructions # 0.58 insn per cycle
# 1.16 stalled cycles per insn ( +- 0.22% ) (66.67%)
70,568,169,256 branches # 673.660 M/sec ( +- 0.17% ) (50.01%)
3,100,725,426 branch-misses # 4.39% of all branches ( +- 0.20% ) (49.99%)
105.583 +- 0.589 seconds time elapsed ( +- 0.56% )
Patched:
Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
104,055.69 msec task-clock # 0.993 CPUs utilized ( +- 0.32% )
4,606 context-switches # 0.044 K/sec ( +- 0.20% )
2,463 cpu-migrations # 0.024 K/sec ( +- 0.93% )
394 page-faults # 0.004 K/sec ( +- 0.25% )
442,951,129,921 cycles # 4.257 GHz ( +- 0.32% ) (66.66%)
8,710,413,329 stalled-cycles-frontend # 1.97% frontend cycles idle ( +- 0.47% ) (50.06%)
299,656,905,836 stalled-cycles-backend # 67.65% backend cycles idle ( +- 0.39% ) (50.02%)
252,731,168,193 instructions # 0.57 insn per cycle
# 1.19 stalled cycles per insn ( +- 0.20% ) (66.66%)
68,902,851,121 branches # 662.173 M/sec ( +- 0.13% ) (49.94%)
3,100,242,882 branch-misses # 4.50% of all branches ( +- 0.15% ) (49.98%)
104.829 +- 0.325 seconds time elapsed ( +- 0.31% )
This is consistent. An add-by-count hot-add operation adds LMBs
greedily, so LMBs near the start of the drconf range are considered
first. On an otherwise idle LPAR with so many LMBs we would expect to
find the LMBs we need near the start of the drconf range, hence the
smaller speedup.
Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200916145122.3408129-1-cheloha@linux.ibm.com
The copy buffer is implemented as a real address in the nest which is
translated from EA by copy, and used for memory access by paste. This
requires that it be invalidated by TLB invalidation.
TLBIE does invalidate the copy buffer, but TLBIEL does not. Add
cp_abort to the tlbiel sequence.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fixup whitespace and comment formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200916030234.4110379-2-npiggin@gmail.com
Lookup the coregroup id from the associativity array.
If unable to detect the coregroup id, fallback on the core id.
This way, ensure sched_domain degenerates and an extra sched domain is
not created.
Ideally this function should have been implemented in
arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
don't need to find the primary domain again.
If the device-tree mentions more than one coregroup, then kernel
implements only the last or the smallest coregroup, which currently
corresponds to the penultimate domain in the device-tree.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-11-srikar@linux.vnet.ibm.com
Add percpu coregroup maps and masks to create coregroup domain.
If a coregroup doesn't exist, the coregroup domain will be degenerated
in favour of SMT/CACHE domain. Do note this patch is only creating stubs
for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation
would be in a subsequent patch.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-10-srikar@linux.vnet.ibm.com
Add support for grouping cores based on the device-tree classification.
- The last domain in the associativity domains always refers to the
core.
- If primary reference domain happens to be the penultimate domain in
the associativity domains device-tree property, then there are no
coregroups. However if its not a penultimate domain, then there are
coregroups. There can be more than one coregroup. For now we would be
interested in the last or the smallest coregroups, i.e one sub-group
per DIE.
Currently there are no firmwares that are exposing this grouping. Hence
allow the basis for grouping to be abstract. Once the firmware starts
using this grouping, code would be added to detect the type of grouping
and adjust the sd domain flags accordingly.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com
Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.
This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.
By marking, node 0 as offline, lets stop assuming that node 0 is
always online. If node 0 has CPU or memory that are online, node 0 will
again be set as online.
v5.8
available: 2 nodes (0,2)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31490 MB
node distances:
node 0 2
0: 10 20
2: 20 10
proc and sys files
------------------
/sys/devices/system/node/online: 0,2
/proc/sys/kernel/numa_balancing: 1
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31
v5.8 + patch
------------------
available: 1 nodes (2)
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31487 MB
node distances:
node 2
2: 10
proc and sys files
------------------
/sys/devices/system/node/online: 2
/proc/sys/kernel/numa_balancing: 0
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31
Example of a node with online CPUs/memory on node 0.
(Same o/p with and without patch)
numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 32482 MB
node 0 free: 22994 MB
node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 20 40 40
1: 20 10 40 40
2: 40 40 10 20
3: 40 40 20 10
Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.
1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.
2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump
3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com
Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200818081104.57888-3-srikar@linux.vnet.ibm.com
A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.
To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com
As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
Abstraction Services (RTAS) Node" available at:
https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
... there are 2 device tree properties:
"ibm,max-associativity-domains"
which defines the maximum number of domains that the firmware i.e
PowerVM can support.
and:
"ibm,current-associativity-domains"
which defines the maximum number of domains that the current
platform can support.
The value of "ibm,max-associativity-domains" is always greater than or
equal to "ibm,current-associativity-domains" property. If the latter
property is not available, use "ibm,max-associativity-domain" as a
fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
is mentioned in page 833 / B.5.3 which is covered under under
"Appendix B. System Binding" section
Currently powerpc uses the "ibm,max-associativity-domains" property
while setting the possible number of nodes. This is currently set at
32. However the possible number of nodes for a platform may be
significantly less. Hence set the possible number of nodes based on
"ibm,current-associativity-domains" property.
Nathan Lynch had raised a valid concern that post LPM (Live Partition
Migration), a user could DLPAR add processors and memory after LPM
with "new" associativity properties:
https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u
He also pointed out that "ibm,max-associativity-domains" has the same
contents on all currently available PowerVM systems, unlike
"ibm,current-associativity-domains" and hence may be better able to
handle the new NUMA associativity properties.
However with the recent commit dbce456280 ("powerpc/numa: Limit
possible nodes to within num_possible_nodes"), all new NUMA
associativity properties are capped to initially set nr_node_ids.
Hence this commit should be safe with any new DLPAR add post LPM.
$ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
/proc/device-tree/rtas/ibm,current-associativity-domains
00000005 00000001 00000002 00000002 00000002 00000010
/proc/device-tree/rtas/ibm,max-associativity-domains
00000005 00000001 00000008 00000020 00000020 00000100
$ cat /sys/devices/system/node/possible ##Before patch
0-31
$ cat /sys/devices/system/node/possible ##After patch
0-1
Note the maximum nodes this platform can support is only 2 but the
possible nodes is set to 32.
This is important because lot of kernel and user space code allocate
structures for all possible nodes leading to a lot of memory that is
allocated but not used.
I ran a simple experiment to create and destroy 100 memory cgroups on
boot on a 8 node machine (Power8 Alpine).
Before patch:
free -k at boot
total used free shared buff/cache available
Mem: 523498176 4106816 518820608 22272 570752 516606720
Swap: 4194240 0 4194240
free -k after creating 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4628416 518246464 22336 623296 516058688
Swap: 4194240 0 4194240
free -k after destroying 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4697408 518173760 22400 627008 515987904
Swap: 4194240 0 4194240
After patch:
free -k at boot
total used free shared buff/cache available
Mem: 523498176 3969472 518933888 22272 594816 516731776
Swap: 4194240 0 4194240
free -k after creating 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4181888 518676096 22208 640192 516496448
Swap: 4194240 0 4194240
free -k after destroying 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4232320 518619904 22272 645952 516443264
Swap: 4194240 0 4194240
Observations:
Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[mpe: Reformat change log a bit for readability]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com
Commit 0cef77c779 ("powerpc/64s/radix: flush remote CPUs out of
single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
a process under certain conditions. One of the assumptions is that
mm_users would not be incremented via a reference outside the process
context with mmget_not_zero() then go on to kthread_use_mm() via that
reference.
That invariant was broken by io_uring code (see previous sparc64 fix),
but I'll point Fixes: to the original powerpc commit because we are
changing that assumption going forward, so this will make backports
match up.
Fix this by no longer relying on that assumption, but by having each CPU
check the mm is not being used, and clearing their own bit from the mask
only if it hasn't been switched-to by the time the IPI is processed.
This relies on commit 38cf307c1f ("mm: fix kthread_use_mm() vs TLB
invalidate") and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to disable irqs over mm
switch sequences.
Fixes: 0cef77c779 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
Depends-on: 38cf307c1f ("mm: fix kthread_use_mm() vs TLB invalidate")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200914045219.3736466-5-npiggin@gmail.com