KASAN requires early activation of hash table, before memblock()
functions are available.
This patch implements an early hash_table statically defined in
__initdata.
During early boot, a single page table is used.
For hash32, when doing the final init, one page table is allocated
for each PGD entry because of the _PAGE_HASHPTE flag which can't be
common to several virt pages. This is done after memblock get
available but before switching to the final hash table, otherwise
there are issues with TLB flushing due to the shared entries.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For KASAN, hash table handling will be activated early for
accessing to KASAN shadow areas.
In order to avoid any modification of the hash functions while
they are still used with the early hash table, the code patching
is moved out of MMU_init_hw() and put close to the big-bang switch
to the final hash table.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds KASAN support for PPC32. The following patch
will add an early activation of hash table for book3s. Until
then, a warning will be raised if trying to use KASAN on an
hash 6xx.
To support KASAN, this patch initialises that MMU mapings for
accessing to the KASAN shadow area defined in a previous patch.
An early mapping is set as soon as the kernel code has been
relocated at its definitive place.
Then the definitive mapping is set once paging is initialised.
For modules, the shadow area is allocated at module_alloc().
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
All files containing functions run before kasan_early_init() is called
must have KASAN instrumentation disabled.
For those file, branch profiling also have to be disabled otherwise
each if () generates a call to ftrace_likely_update().
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since commit 400c47d81c ("powerpc32: memset: only use dcbz once cache is
enabled"), memset() can be used before activation of the cache,
so no need to use memset_io() for zeroing the BSS.
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In kernel/cputable.c, explicitly use memcpy() instead of *y = *x;
This will allow GCC to replace it with __memcpy() when KASAN is
selected.
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When KASAN is active, the string functions in lib/ are doing the
KASAN checks. This is too early for prom_init.
This patch implements dedicated string functions for prom_init,
which will be compiled in with KASAN disabled.
Size of prom_init before the patch:
text data bss dec hex filename
12060 488 6960 19508 4c34 arch/powerpc/kernel/prom_init.o
Size of prom_init after the patch:
text data bss dec hex filename
12460 488 6960 19908 4dc4 arch/powerpc/kernel/prom_init.o
This increases the size of prom_init a bit, but as prom_init is
in __init section, it is freed after boot anyway.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch makes CONFIG_CMDLINE defined at all time. It avoids
having to enclose related code inside #ifdef CONFIG_CMDLINE
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
CONFIG_KASAN implements wrappers for memcpy() memmove() and memset()
Those wrappers are doing the verification then call respectively
__memcpy() __memmove() and __memset(). The arches are therefore
expected to rename their optimised functions that way.
For files on which KASAN is inhibited, #defines are used to allow
them to directly call optimised versions of the functions without
going through the KASAN wrappers.
See commit 393f203f5f ("x86_64: kasan: add interceptors for
memset/memmove/memcpy functions") for details.
Other string / mem functions do not (yet) have kasan wrappers,
we therefore have to fallback to the generic versions when
KASAN is active, otherwise KASAN checks will be skipped.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Fixups to keep selftests working]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In preparation of KASAN, move early_init() into a separate
file in order to allow deactivation of KASAN for that function.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
No need to have this in asm/page.h, move it into asm/hugetlb.h
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 67fda38f0d ("powerpc/mm: Move slb_addr_linit to
early_init_mmu") moved slb_addr_limit init out of setup_arch().
Commit 701101865f ("powerpc/mm: Reduce memory usage for mm_context_t
for radix") brought it back into setup_arch() by error.
This patch reverts that erroneous regress.
Fixes: 701101865f ("powerpc/mm: Reduce memory usage for mm_context_t for radix")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is a kernel crash that happens if rt_sigreturn() is called inside
a transactional block.
This crash happens if the kernel hits an in-kernel page fault when
accessing userspace memory, usually through copy_ckvsx_to_user(). A
major page fault calls might_sleep() function, which can cause a task
reschedule. A task reschedule (switch_to()) reclaim and recheckpoint
the TM states, but, in the signal return path, the checkpointed memory
was already reclaimed, thus the exception stack has MSR that points to
MSR[TS]=0.
When the code returns from might_sleep() and a task reschedule
happened, then this task is returned with the memory recheckpointed,
and CPU MSR[TS] = suspended.
This means that there is a side effect at might_sleep() if it is
called with CPU MSR[TS] = 0 and the task has regs->msr[TS] != 0.
This side effect can cause a TM bad thing, since at the exception
entrance, the stack saves MSR[TS]=0, and this is what will be used at
RFID, but, the processor has MSR[TS] = Suspended, and this transition
will be invalid and a TM Bad thing will be raised, causing the
following crash:
Unexpected TM Bad Thing exception at c00000000000e9ec (msr 0x8000000302a03031) tm_scratch=800000010280b033
cpu 0xc: Vector: 700 (Program Check) at [c00000003ff1fd70]
pc: c00000000000e9ec: fast_exception_return+0x100/0x1bc
lr: c000000000032948: handle_rt_signal64+0xb8/0xaf0
sp: c0000004263ebc40
msr: 8000000302a03031
current = 0xc000000415050300
paca = 0xc00000003ffc4080 irqmask: 0x03 irq_happened: 0x01
pid = 25006, comm = sigfuz
Linux version 5.0.0-rc1-00001-g3bd6e94bec12 (breno@debian) (gcc version 8.2.0 (Debian 8.2.0-3)) #899 SMP Mon Jan 7 11:30:07 EST 2019
WARNING: exception is not recoverable, can't continue
enter ? for help
[c0000004263ebc40] c000000000032948 handle_rt_signal64+0xb8/0xaf0 (unreliable)
[c0000004263ebd30] c000000000022780 do_notify_resume+0x2f0/0x430
[c0000004263ebe20] c00000000000e844 ret_from_except_lite+0x70/0x74
--- Exception: c00 (System Call) at 00007fffbaac400c
SP (7fffeca90f40) is in userspace
The solution for this problem is running the sigreturn code with
regs->msr[TS] disabled, thus, avoiding hitting the side effect above.
This does not seem to be a problem since regs->msr will be replaced by
the ucontext value, so, it is being flushed already. In this case, it
is flushed earlier.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Print more information about MCE error whether it is an hardware or
software error.
Some of the MCE errors can be easily categorized as hardware or
software errors e.g. UEs are due to hardware error, where as error
triggered due to invalid usage of tlbie is a pure software bug. But
not all the MCE errors can be easily categorize into either software
or hardware. There are errors like multihit errors which are usually
result of a software bug, but in some rare cases a hardware failure
can cause a multihit error. In past, we have seen case where after
replacing faulty chip, multihit errors stopped occurring. Same with
parity errors, which are usually due to faulty hardware but there are
chances where multihit can also cause an parity error. Such errors are
difficult to determine what really caused it. Hence this patch
classifies MCE errors into following four categorize:
1. Hardware error:
UE and Link timeout failure errors.
2. Probable hardware error (some chance of software cause)
SLB/ERAT/TLB Parity errors.
3. Software error
Invalid tlbie form.
4. Probable software error (some chance of hardware cause)
SLB/ERAT/TLB Multihit errors.
Sample output:
MCE: CPU80: machine check (Warning) Guest SLB Multihit DAR: 000001001b6e0320 [Recovered]
MCE: CPU80: PID: 24765 Comm: qemu-system-ppc Guest NIP: [00007fffa309dc60]
MCE: CPU80: Probable Software error (some chance of hardware cause)
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently all machine check errors are printed as severe errors which
isn't correct. Print soft errors as warning instead of severe errors.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When analysing sources of OS jitter, I noticed that doorbells cannot be
traced.
Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In commit 2bf1071a8d ("powerpc/64s: Remove POWER9 DD1 support") the
function __switch_to remove usage for 'dummy_copy_buffer'. Since it is
not used anywhere else, remove it completely.
This remove the following warning:
arch/powerpc/kernel/process.c:1156:17: error: 'dummy_copy_buffer' defined but not used
Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently error return from kobject_init_and_add() is not followed by
a call to kobject_put(). This means there is a memory leak.
Add call to kobject_put() in error path of kobject_init_and_add().
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Towards the goal of removing cc-ldoption, it seems that --hash-style=
was added to binutils 2.17.50.0.2 in 2006. The minimal required
version of binutils for the kernel according to
Documentation/process/changes.rst is 2.20.
Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
speific HV idle code to the powernv platform code.
Book3S assembly stubs are kept in common code and used only to save
the stack frame and non-volatile GPRs before executing architected
idle instructions, and restoring the stack and reloading GPRs then
returning to C after waking from idle.
The complex logic dealing with threads and subcores, locking, SPRs,
HMIs, timebase resync, etc., is all done in C which makes it more
maintainable.
This is not a strict translation to C code, there are some
significant differences:
- Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
but saves and restores them itself.
- The optimisation where EC=ESL=0 idle modes did not have to save GPRs
or change MSR is restored, because it's now simple to do. ESL=1
sleeps that do not lose GPRs can use this optimization too.
- KVM secondary entry and cede is now more of a call/return style
rather than branchy. nap_state_lost is not required because KVM
always returns via NVGPR restoring path.
- KVM secondary wakeup from offline sequence is moved entirely into
the offline wakeup, which avoids a hwsync in the normal idle wakeup
path.
Performance measured with context switch ping-pong on different
threads or cores, is possibly improved a small amount, 1-3% depending
on stop state and core vs thread test for shallow states. Deep states
it's in the noise compared with other latencies.
KVM improvements:
- Idle sleepers now always return to caller rather than branch out
to KVM first.
- This allows optimisations like very fast return to caller when no
state has been lost.
- KVM no longer requires nap_state_lost because it controls NVGPR
save/restore itself on the way in and out.
- The heavy idle wakeup KVM request check can be moved out of the
normal host idle code and into the not-performance-critical offline
code.
- KVM nap code now returns from where it is called, which makes the
flow a bit easier to follow.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Squash the KVM changes in]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Using a jiffies timer creates a dependency on the tick_do_timer_cpu
incrementing jiffies. If that CPU has locked up and jiffies is not
incrementing, the watchdog heartbeat timer for all CPUs stops and
creates false positives and confusing warnings on local CPUs, and
also causes the SMP detector to stop, so the root cause is never
detected.
Fix this by using hrtimer based timers for the watchdog heartbeat,
like the generic kernel hardlockup detector.
Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reported-by: Ravikumar Bangoria <ravi.bangoria@in.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reported-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This helps in debugging. We can look at the dmesg to find out
different kernel mapping details.
On 4K config this shows
kernel vmalloc start = 0xc000100000000000
kernel IO start = 0xc000200000000000
kernel vmemmap start = 0xc000300000000000
On 64K config:
kernel vmalloc start = 0xc008000000000000
kernel IO start = 0xc00a000000000000
kernel vmemmap start = 0xc00c000000000000
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, our mm_context_t on book3s64 include all hash specific
context details like slice mask and subpage protection details. We
can skip allocating these with radix translation. This will help us to save
8K per mm_context with radix translation.
With the patch applied we have
sizeof(mm_context_t) = 136
sizeof(struct hash_mm_context) = 8288
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Avoid #ifdef in generic code. Also enables us to do this specific to
MMU translation mode on book3s64
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We want to switch to allocating them runtime only when hash translation is
enabled. Add helpers so that both book3s and nohash can be adapted to
upcoming change easily.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch implements Kernel Userspace Access Protection for
book3s/32.
Due to limitations of the processor page protection capabilities,
the protection is only against writing. read protection cannot be
achieved using page protection.
The previous patch modifies the page protection so that RW user
pages are RW for Key 0 and RO for Key 1, and it sets Key 0 for
both user and kernel.
This patch changes userspace segment registers are set to Ku 0
and Ks 1. When kernel needs to write to RW pages, the associated
segment register is then changed to Ks 0 in order to allow write
access to the kernel.
In order to avoid having the read all segment registers when
locking/unlocking the access, some data is kept in the thread_struct
and saved on stack on exceptions. The field identifies both the
first unlocked segment and the first segment following the last
unlocked one. When no segment is unlocked, it contains value 0.
As the hash_page() function is not able to easily determine if a
protfault is due to a bad kernel access to userspace, protfaults
need to be handled by handle_page_fault when KUAP is set.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Drop allow_read/write_to/from_user() as they're now in kup.h,
and adapt allow_user_access() to do nothing when to == NULL]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch prepares Kernel Userspace Access Protection for
book3s/32.
Due to limitations of the processor page protection capabilities,
the protection is only against writing. read protection cannot be
achieved using page protection.
book3s/32 provides the following values for PP bits:
PP00 provides RW for Key 0 and NA for Key 1
PP01 provides RW for Key 0 and RO for Key 1
PP10 provides RW for all
PP11 provides RO for all
Today PP10 is used for RW pages and PP11 for RO pages, and user
segment register's Kp and Ks are set to 1. This patch modifies
page protection to use PP01 for RW pages and sets user segment
registers to Kp 0 and Ks 0.
This will allow to setup Userspace write access protection by
settng Ks to 1 in the following patch.
Kernel space segment registers remain unchanged.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
To implement Kernel Userspace Execution Prevention, this patch
sets NX bit on all user segments on kernel entry and clears NX bit
on all user segments on kernel exit.
Note that powerpc 601 doesn't have the NX bit, so KUEP will not
work on it. A warning is displayed at startup.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds ASM macros for saving, restoring and checking
the KUAP state, and modifies setup_32 to call them on exceptions
from kernel.
The macros are defined as empty by default for when CONFIG_PPC_KUAP
is not selected and/or for platforms which don't handle (yet) KUAP.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
syscalls are from user only, so we can account time without checking
whether returning to kernel or user as it will only be user.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Kernel Userspace Access Prevention utilises a feature of the Radix MMU
which disallows read and write access to userspace addresses. By
utilising this, the kernel is prevented from accessing user data from
outside of trusted paths that perform proper safety checks, such as
copy_{to/from}_user() and friends.
Userspace access is disabled from early boot and is only enabled when
performing an operation like copy_{to/from}_user(). The register that
controls this (AMR) does not prevent userspace from accessing itself,
so there is no need to save and restore when entering and exiting
userspace.
When entering the kernel from the kernel we save AMR and if it is not
blocking user access (because eg. we faulted doing a user access) we
reblock user access for the duration of the exception (ie. the page
fault) and then restore the AMR when returning back to the kernel.
This feature can be tested by using the lkdtm driver (CONFIG_LKDTM=y)
and performing the following:
# (echo ACCESS_USERSPACE) > [debugfs]/provoke-crash/DIRECT
If enabled, this should send SIGSEGV to the thread.
We also add paranoid checking of AMR in switch and syscall return
under CONFIG_PPC_KUAP_DEBUG.
Co-authored-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some platforms (i.e. Radix MMU) need per-CPU initialisation for KUP.
Any platforms that only want to do KUP initialisation once
globally can just check to see if they're running on the boot CPU, or
check if whatever setup they need has already been performed.
Note that this is only for 64-bit.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch implements a framework for Kernel Userspace Access
Protection.
Then subarches will have the possibility to provide their own
implementation by providing setup_kuap() and
allow/prevent_user_access().
Some platforms will need to know the area accessed and whether it is
accessed from read, write or both. Therefore source, destination and
size and handed over to the two functions.
mpe: Rename to allow/prevent rather than unlock/lock, and add
read/write wrappers. Drop the 32-bit code for now until we have an
implementation for it. Add kuap to pt_regs for 64-bit as well as
32-bit. Don't split strings, use pr_crit_ratelimited().
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds a skeleton for Kernel Userspace Protection
functionnalities like Kernel Userspace Access Protection and Kernel
Userspace Execution Prevention
The subsequent implementation of KUAP for radix makes use of a MMU
feature in order to patch out assembly when KUAP is disabled or
unsupported. This won't work unless there's an entry point for KUP
support before the feature magic happens, so for PPC64 setup_kup() is
called early in setup.
On PPC32, feature_fixup() is done too early to allow the same.
Suggested-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In order to implement KUAP (Kernel Userspace Access Protection) on
Power9 we will be using the AMR, and therefore indirectly the
UAMOR/AMOR.
So save/restore these regs in the idle code.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Without restoring the IAMR after idle, execution prevention on POWER9
with Radix MMU is overwritten and the kernel can freely execute
userspace without faulting.
This is necessary when returning from any stop state that modifies
user state, as well as hypervisor state.
To test how this fails without this patch, load the lkdtm driver and
do the following:
$ echo EXEC_USERSPACE > /sys/kernel/debug/provoke-crash/DIRECT
which won't fault, then boot the kernel with powersave=off, where it
will fault. Applying this patch will fix this.
Fixes: 3b10d0095a ("powerpc/mm/radix: Prevent kernel execution of user space")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a flag so that the DAWR can be enabled on P9 via:
echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous
The DAWR was previously force disabled on POWER9 in:
9654153158 powerpc: Disable DAWR in the base POWER9 CPU features
Also see Documentation/powerpc/DAWR-POWER9.txt
This is a dangerous setting, USE AT YOUR OWN RISK.
Some users may not care about a bad user crashing their box
(ie. single user/desktop systems) and really want the DAWR. This
allows them to force enable DAWR.
This flag can also be used to disable DAWR access. Once this is
cleared, all DAWR access should be cleared immediately and your
machine once again safe from crashing.
Userspace may get confused by toggling this. If DAWR is force
enabled/disabled between getting the number of breakpoints (via
PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an
inconsistent view of what's available. Similarly for guests.
For the DAWR to be enabled in a KVM guest, the DAWR needs to be force
enabled in the host AND the guest. For this reason, this won't work on
POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the
dawr_enable_dangerous file will fail if the hypervisor doesn't support
writing the DAWR.
To double check the DAWR is working, run this kernel selftest:
tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
Any errors/failures/skips mean something is wrong.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add support to hwpoison the pages upon hitting machine check
exception.
This patch queues the address where UE is hit to percpu array
and schedules work to plumb it into memory poison infrastructure.
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
[mpe: Combine #ifdefs, drop PPC_BIT8(), and empty inline stub]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With STRICT_KERNEL_RWX enabled anything marked __init is placed at a 16M
boundary. This is necessary so that it can be repurposed later with
different permissions. However, in kernels with text larger than 16M,
this pushes early_setup past 32M, incapable of being reached by the
branch instruction.
Fix this by setting the CTR and branching there instead.
Fixes: 1e0fc9d1eb ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs")
Signed-off-by: Russell Currey <ruscur@russell.cc>
[mpe: Fix it to work on BE by using DOTSYM()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 48e7b76957 ("powerpc/64s/hash: Convert SLB miss handlers to C")
broke the radix-mode segment exception handler. In radix mode, this is
exception is not an SLB miss, rather it signals that the EA is outside
the range translated by any page table.
The commit lost the radix feature alternate code patch, which can
cause faults to some EAs to kernel BUG at arch/powerpc/mm/slb.c:639!
The original radix code would send faults to slb_miss_large_addr,
which would end up faulting due to slb_addr_limit being 0. This patch
sends radix directly to do_bad_slb_fault, which is a bit clearer.
Fixes: 48e7b76957 ("powerpc/64s/hash: Convert SLB miss handlers to C")
Cc: stable@vger.kernel.org # v4.20+
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit b5b4453e79 ("powerpc/vdso64: Fix CLOCK_MONOTONIC
inconsistencies across Y2038") changed the type of wtom_clock_sec
to s64 on PPC64. Therefore, VDSO32 needs to read it with a 4 bytes
shift in order to retrieve the lower part of it.
Fixes: b5b4453e79 ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038")
Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 0df977eafc ("powerpc/6xx: Don't use SPRN_SPRG2 for storing
stack pointer while in RTAS") changes the code to use a field in
thread struct to store the stack pointer while in RTAS instead of
using SPRN_SPRG2. It therefore converts all places which were
manipulating SPRN_SPRG2 to use that field. During early startup, the
zeroing of SPRN_SPRG2 has been replaced by a zeroing of that field in
thread struct. But at least in start_here, that's done wrongly because
it used the physical address of the fields while MMU is on at that
time.
So the virtual address of the field should be used instead, but in
the meantime, thread struct has already been zeroed and initialised
so we can just drop this initialisation.
Reported-by: Larry Finger <Larry.Finger@lwfinger.net>
Fixes: 0df977eafc ("powerpc/6xx: Don't use SPRN_SPRG2 for storing stack pointer while in RTAS")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When I updated the spectre_v2 reporting to handle software count cache
flush I got the logic wrong when there's no software count cache
enabled at all.
The result is that on systems with the software count cache flush
disabled we print:
Mitigation: Indirect branch cache disabled, Software count cache flush
Which correctly indicates that the count cache is disabled, but
incorrectly says the software count cache flush is enabled.
The root of the problem is that we are trying to handle all
combinations of options. But we know now that we only expect to see
the software count cache flush enabled if the other options are false.
So split the two cases, which simplifies the logic and fixes the bug.
We were also missing a space before "(hardware accelerated)".
The result is we see one of:
Mitigation: Indirect branch serialisation (kernel only)
Mitigation: Indirect branch cache disabled
Mitigation: Software count cache flush
Mitigation: Software count cache flush (hardware accelerated)
Fixes: ee13cb249f ("powerpc/64s: Add support for software count cache flush")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Not only the 603 but all 6xx need SPRN_SPRG_PGDIR to be initialised at
startup. This patch move it from __setup_cpu_603() to start_here()
and __secondary_start(), close to the initialisation of SPRN_THREAD.
Previously, virt addr of PGDIR was retrieved from thread struct.
Now that it is the phys addr which is stored in SPRN_SPRG_PGDIR,
hash_page() shall not convert it to phys anymore.
This patch removes the conversion.
Fixes: 93c4a162b0 ("powerpc/6xx: Store PGDIR physical address in a SPRG")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Jakub Drnec reported:
Setting the realtime clock can sometimes make the monotonic clock go
back by over a hundred years. Decreasing the realtime clock across
the y2k38 threshold is one reliable way to reproduce. Allegedly this
can also happen just by running ntpd, I have not managed to
reproduce that other than booting with rtc at >2038 and then running
ntp. When this happens, anything with timers (e.g. openjdk) breaks
rather badly.
And included a test case (slightly edited for brevity):
#define _POSIX_C_SOURCE 199309L
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <unistd.h>
long get_time(void) {
struct timespec tp;
clock_gettime(CLOCK_MONOTONIC, &tp);
return tp.tv_sec + tp.tv_nsec / 1000000000;
}
int main(void) {
long last = get_time();
while(1) {
long now = get_time();
if (now < last) {
printf("clock went backwards by %ld seconds!\n", last - now);
}
last = now;
sleep(1);
}
return 0;
}
Which when run concurrently with:
# date -s 2040-1-1
# date -s 2037-1-1
Will detect the clock going backward.
The root cause is that wtom_clock_sec in struct vdso_data is only a
32-bit signed value, even though we set its value to be equal to
tk->wall_to_monotonic.tv_sec which is 64-bits.
Because the monotonic clock starts at zero when the system boots the
wall_to_montonic.tv_sec offset is negative for current and future
dates. Currently on a freshly booted system the offset will be in the
vicinity of negative 1.5 billion seconds.
However if the wall clock is set past the Y2038 boundary, the offset
from wall to monotonic becomes less than negative 2^31, and no longer
fits in 32-bits. When that value is assigned to wtom_clock_sec it is
truncated and becomes positive, causing the VDSO assembly code to
calculate CLOCK_MONOTONIC incorrectly.
That causes CLOCK_MONOTONIC to jump ahead by ~4 billion seconds which
it is not meant to do. Worse, if the time is then set back before the
Y2038 boundary CLOCK_MONOTONIC will jump backward.
We can fix it simply by storing the full 64-bit offset in the
vdso_data, and using that in the VDSO assembly code. We also shuffle
some of the fields in vdso_data to avoid creating a hole.
The original commit that added the CLOCK_MONOTONIC support to the VDSO
did actually use a 64-bit value for wtom_clock_sec, see commit
a7f290dad3 ("[PATCH] powerpc: Merge vdso's and add vdso support to
32 bits kernel") (Nov 2005). However just 3 days later it was
converted to 32-bits in commit 0c37ec2aa8 ("[PATCH] powerpc: vdso
fixes (take #2)"), and the bug has existed since then AFAICS.
Fixes: 0c37ec2aa8 ("[PATCH] powerpc: vdso fixes (take #2)")
Cc: stable@vger.kernel.org # v2.6.15+
Link: http://lkml.kernel.org/r/HaC.ZfES.62bwlnvAvMP.1STMMj@seznam.cz
Reported-by: Jakub Drnec <jaydee@email.cz>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
One fix to prevent runtime allocation of 16GB pages when running in a VM (as
opposed to bare metal), because it doesn't work.
A small fix to our recently added KCOV support to exempt some more code from
being instrumented.
Plus a few minor build fixes, a small dead code removal and a defconfig update.
Thanks to:
Alexey Kardashevskiy, Aneesh Kumar K.V, Christophe Leroy, Jason Yan, Joel
Stanley, Mahesh Salgaonkar, Mathieu Malaterre.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcjNHCAAoJEFHr6jzI4aWAJVAP/21RUgDvqAAW55jTwihH6Eit
q6l1mJ30zwARz+UYWssqMe7qIYmnjWDeapgpZncZE3P6f3VMmepJrr75zca0LJhC
ixWqNJOcQgUu9civDwwpaqKQvyY0CYCdF5mu1rA1RNZ2kTeuCMw7zYPPpM84UGkq
IPFe3EgWAOURFeaQUGpH16klJVbPISq/1RCtsAkR4QifD4auM+EDYq+ML69LInc4
m7mi2CpPQDGZyCepFL0zdfOI43zrtWerG0UwCxPbGPYzvT+T3mvxU2unV1NcYn6/
obNYB5V0OCz4gUiu7aLoHnYZx2zK8fi1lTjSrB7XhWdi4ftEfRP3TrUntHWo420n
FC3+ibbjS3Cr8y7eubXgEAAKh74M1xzBF2bdAEHQ/QmqHZLcG+mnUihOq/g8mCp1
LsTKvkzXilov752wKSwdjvSNbU29a2KRaXSXAEgWJvsAQbZAidGRzX7CA9XeHQPp
kRCWHTwzXM0E31oi5rGAk2F1l4EK12QLdk1m0DF96ZanX7xG/UK6MpDNut2y51Wr
KsWPYhUhI6pc9xt+Fts0zehDWAtfttn7RTvE+34dkaZURGl3rQkjsKt1lQ+scRYX
fuSAnpTinE46e6APezwjCELtHDAzOCZvOnh9RVPe+F//KEF8LcNQv6TLhQoukRAe
ldJEhSReJfo3/agqGJ6v
=6cp4
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"One fix to prevent runtime allocation of 16GB pages when running in a
VM (as opposed to bare metal), because it doesn't work.
A small fix to our recently added KCOV support to exempt some more
code from being instrumented.
Plus a few minor build fixes, a small dead code removal and a
defconfig update.
Thanks to: Alexey Kardashevskiy, Aneesh Kumar K.V, Christophe Leroy,
Jason Yan, Joel Stanley, Mahesh Salgaonkar, Mathieu Malaterre"
* tag 'powerpc-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Include <asm/nmi.h> header file to fix a warning
powerpc/powernv: Fix compile without CONFIG_TRACEPOINTS
powerpc/mm: Disable kcov for SLB routines
powerpc: remove dead code in head_fsl_booke.S
powerpc/configs: Sync skiroot defconfig
powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration