2019-06-03 13:44:50 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2012-03-05 19:49:28 +08:00
|
|
|
/*
|
|
|
|
* Based on arch/arm/kernel/process.c
|
|
|
|
*
|
|
|
|
* Original Copyright (C) 1995 Linus Torvalds
|
|
|
|
* Copyright (C) 1996-2000 Russell King - Converted to ARM.
|
|
|
|
* Copyright (C) 2012 ARM Ltd.
|
|
|
|
*/
|
2014-04-30 17:51:32 +08:00
|
|
|
#include <linux/compat.h>
|
2015-03-06 22:49:24 +08:00
|
|
|
#include <linux/efi.h>
|
2020-03-17 00:50:47 +08:00
|
|
|
#include <linux/elf.h>
|
2012-03-05 19:49:28 +08:00
|
|
|
#include <linux/export.h>
|
|
|
|
#include <linux/sched.h>
|
2017-02-09 01:51:35 +08:00
|
|
|
#include <linux/sched/debug.h>
|
2017-02-09 01:51:36 +08:00
|
|
|
#include <linux/sched/task.h>
|
2017-02-09 01:51:37 +08:00
|
|
|
#include <linux/sched/task_stack.h>
|
2012-03-05 19:49:28 +08:00
|
|
|
#include <linux/kernel.h>
|
2020-03-17 00:50:47 +08:00
|
|
|
#include <linux/mman.h>
|
2012-03-05 19:49:28 +08:00
|
|
|
#include <linux/mm.h>
|
2020-09-28 21:03:00 +08:00
|
|
|
#include <linux/nospec.h>
|
2012-03-05 19:49:28 +08:00
|
|
|
#include <linux/stddef.h>
|
2019-07-24 01:58:39 +08:00
|
|
|
#include <linux/sysctl.h>
|
2012-03-05 19:49:28 +08:00
|
|
|
#include <linux/unistd.h>
|
|
|
|
#include <linux/user.h>
|
|
|
|
#include <linux/delay.h>
|
|
|
|
#include <linux/reboot.h>
|
|
|
|
#include <linux/interrupt.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/elfcore.h>
|
|
|
|
#include <linux/pm.h>
|
|
|
|
#include <linux/tick.h>
|
|
|
|
#include <linux/utsname.h>
|
|
|
|
#include <linux/uaccess.h>
|
|
|
|
#include <linux/random.h>
|
|
|
|
#include <linux/hw_breakpoint.h>
|
|
|
|
#include <linux/personality.h>
|
|
|
|
#include <linux/notifier.h>
|
2015-09-16 22:23:21 +08:00
|
|
|
#include <trace/events/power.h>
|
arm64: split thread_info from task stack
This patch moves arm64's struct thread_info from the task stack into
task_struct. This protects thread_info from corruption in the case of
stack overflows, and makes its address harder to determine if stack
addresses are leaked, making a number of attacks more difficult. Precise
detection and handling of overflow is left for subsequent patches.
Largely, this involves changing code to store the task_struct in sp_el0,
and acquire the thread_info from the task struct. Core code now
implements current_thread_info(), and as noted in <linux/sched.h> this
relies on offsetof(task_struct, thread_info) == 0, enforced by core
code.
This change means that the 'tsk' register used in entry.S now points to
a task_struct, rather than a thread_info as it used to. To make this
clear, the TI_* field offsets are renamed to TSK_TI_*, with asm-offsets
appropriately updated to account for the structural change.
Userspace clobbers sp_el0, and we can no longer restore this from the
stack. Instead, the current task is cached in a per-cpu variable that we
can safely access from early assembly as interrupts are disabled (and we
are thus not preemptible).
Both secondary entry and idle are updated to stash the sp and task
pointer separately.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: James Morse <james.morse@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-11-04 04:23:13 +08:00
|
|
|
#include <linux/percpu.h>
|
arm64/sve: Core task context handling
This patch adds the core support for switching and managing the SVE
architectural state of user tasks.
Calls to the existing FPSIMD low-level save/restore functions are
factored out as new functions task_fpsimd_{save,load}(), since SVE
now dynamically may or may not need to be handled at these points
depending on the kernel configuration, hardware features discovered
at boot, and the runtime state of the task. To make these
decisions as fast as possible, const cpucaps are used where
feasible, via the system_supports_sve() helper.
The SVE registers are only tracked for threads that have explicitly
used SVE, indicated by the new thread flag TIF_SVE. Otherwise, the
FPSIMD view of the architectural state is stored in
thread.fpsimd_state as usual.
When in use, the SVE registers are not stored directly in
thread_struct due to their potentially large and variable size.
Because the task_struct slab allocator must be configured very
early during kernel boot, it is also tricky to configure it
correctly to match the maximum vector length provided by the
hardware, since this depends on examining secondary CPUs as well as
the primary. Instead, a pointer sve_state in thread_struct points
to a dynamically allocated buffer containing the SVE register data,
and code is added to allocate and free this buffer at appropriate
times.
TIF_SVE is set when taking an SVE access trap from userspace, if
suitable hardware support has been detected. This enables SVE for
the thread: a subsequent return to userspace will disable the trap
accordingly. If such a trap is taken without sufficient system-
wide hardware support, SIGILL is sent to the thread instead as if
an undefined instruction had been executed: this may happen if
userspace tries to use SVE in a system where not all CPUs support
it for example.
The kernel will clear TIF_SVE and disable SVE for the thread
whenever an explicit syscall is made by userspace. For backwards
compatibility reasons and conformance with the spirit of the base
AArch64 procedure call standard, the subset of the SVE register
state that aliases the FPSIMD registers is still preserved across a
syscall even if this happens. The remainder of the SVE register
state logically becomes zero at syscall entry, though the actual
zeroing work is currently deferred until the thread next tries to
use SVE, causing another trap to the kernel. This implementation
is suboptimal: in the future, the fastpath case may be optimised
to zero the registers in-place and leave SVE enabled for the task,
where beneficial.
TIF_SVE is also cleared in the following slowpath cases, which are
taken as reasonable hints that the task may no longer use SVE:
* exec
* fork and clone
Code is added to sync data between thread.fpsimd_state and
thread.sve_state whenever enabling/disabling SVE, in a manner
consistent with the SVE architectural programmer's model.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Alex Bennée <alex.bennee@linaro.org>
[will: added #include to fix allnoconfig build]
[will: use enable_daif in do_sve_acc]
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-10-31 23:51:05 +08:00
|
|
|
#include <linux/thread_info.h>
|
2019-07-24 01:58:39 +08:00
|
|
|
#include <linux/prctl.h>
|
arm64: Make __get_wchan() use arch_stack_walk()
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently, __get_wchan() walks the stack of a blocked task by calling
start_backtrace() with the task's saved PC and FP values, and iterating
unwind steps using unwind_frame(). The initialization is functionally
equivalent to calling arch_stack_walk() with the blocked task, which
will start with the task's saved PC and FP values.
Currently __get_wchan() always performs an initial unwind step, which
will stkip __switch_to(), but as this is now marked as a __sched
function, this no longer needs special handling and will be skipped in
the same way as other sched functions.
Make __get_wchan() use arch_stack_walk(). This simplifies __get_wchan(),
and in future will alow us to make unwind_frame() private to
stacktrace.c. At the same time, we can simplify the try_get_task_stack()
check and avoid the unnecessary `stack_page` variable.
The change to the skipping logic means we may terminate one frame
earlier than previously where there are an excessive number of sched
functions in the trace, but this isn't seen in practice, and wchan is
best-effort anyway, so this should not be a problem.
Other than the above, there should be no functional change as a result
of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
[Mark: rebase atop wchan changes, elaborate commit message, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-11-29 22:28:45 +08:00
|
|
|
#include <linux/stacktrace.h>
|
2012-03-05 19:49:28 +08:00
|
|
|
|
2016-02-05 22:58:48 +08:00
|
|
|
#include <asm/alternative.h>
|
2012-03-05 19:49:28 +08:00
|
|
|
#include <asm/compat.h>
|
arm64: entry.S: Do not preempt from IRQ before all cpufeatures are enabled
Preempting from IRQ-return means that the task has its PSTATE saved
on the stack, which will get restored when the task is resumed and does
the actual IRQ return.
However, enabling some CPU features requires modifying the PSTATE. This
means that, if a task was scheduled out during an IRQ-return before all
CPU features are enabled, the task might restore a PSTATE that does not
include the feature enablement changes once scheduled back in.
* Task 1:
PAN == 0 ---| |---------------
| |<- return from IRQ, PSTATE.PAN = 0
| <- IRQ |
+--------+ <- preempt() +--
^
|
reschedule Task 1, PSTATE.PAN == 1
* Init:
--------------------+------------------------
^
|
enable_cpu_features
set PSTATE.PAN on all CPUs
Worse than this, since PSTATE is untouched when task switching is done,
a task missing the new bits in PSTATE might affect another task, if both
do direct calls to schedule() (outside of IRQ/exception contexts).
Fix this by preventing preemption on IRQ-return until features are
enabled on all CPUs.
This way the only PSTATE values that are saved on the stack are from
synchronous exceptions. These are expected to be fatal this early, the
exception is BRK for WARN_ON(), but as this uses do_debug_exception()
which keeps IRQs masked, it shouldn't call schedule().
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
[james: Replaced a really cool hack, with an even simpler static key in C.
expanded commit message with Julien's cover-letter ascii art]
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2019-10-16 01:25:44 +08:00
|
|
|
#include <asm/cpufeature.h>
|
2012-03-05 19:49:28 +08:00
|
|
|
#include <asm/cacheflush.h>
|
2016-10-18 18:27:48 +08:00
|
|
|
#include <asm/exec.h>
|
2013-01-17 20:31:45 +08:00
|
|
|
#include <asm/fpsimd.h>
|
|
|
|
#include <asm/mmu_context.h>
|
2019-09-16 18:51:17 +08:00
|
|
|
#include <asm/mte.h>
|
2012-03-05 19:49:28 +08:00
|
|
|
#include <asm/processor.h>
|
arm64: add basic pointer authentication support
This patch adds basic support for pointer authentication, allowing
userspace to make use of APIAKey, APIBKey, APDAKey, APDBKey, and
APGAKey. The kernel maintains key values for each process (shared by all
threads within), which are initialised to random values at exec() time.
The ID_AA64ISAR1_EL1.{APA,API,GPA,GPI} fields are exposed to userspace,
to describe that pointer authentication instructions are available and
that the kernel is managing the keys. Two new hwcaps are added for the
same reason: PACA (for address authentication) and PACG (for generic
authentication).
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com>
Tested-by: Adam Wallis <awallis@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ramana Radhakrishnan <ramana.radhakrishnan@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
[will: Fix sizeof() usage and unroll address key initialisation]
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-12-08 02:39:25 +08:00
|
|
|
#include <asm/pointer_auth.h>
|
2012-03-05 19:49:28 +08:00
|
|
|
#include <asm/stacktrace.h>
|
2021-03-24 14:54:58 +08:00
|
|
|
#include <asm/switch_to.h>
|
|
|
|
#include <asm/system_misc.h>
|
2012-03-05 19:49:28 +08:00
|
|
|
|
2018-12-12 20:08:44 +08:00
|
|
|
#if defined(CONFIG_STACKPROTECTOR) && !defined(CONFIG_STACKPROTECTOR_PER_TASK)
|
2014-06-26 06:55:03 +08:00
|
|
|
#include <linux/stackprotector.h>
|
2021-09-14 17:44:02 +08:00
|
|
|
unsigned long __stack_chk_guard __ro_after_init;
|
2014-06-26 06:55:03 +08:00
|
|
|
EXPORT_SYMBOL(__stack_chk_guard);
|
|
|
|
#endif
|
|
|
|
|
2012-03-05 19:49:28 +08:00
|
|
|
/*
|
|
|
|
* Function pointers to optional machine specific functions
|
|
|
|
*/
|
|
|
|
void (*pm_power_off)(void);
|
|
|
|
EXPORT_SYMBOL_GPL(pm_power_off);
|
|
|
|
|
2013-10-25 03:30:18 +08:00
|
|
|
#ifdef CONFIG_HOTPLUG_CPU
|
|
|
|
void arch_cpu_idle_dead(void)
|
|
|
|
{
|
|
|
|
cpu_die();
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 09:41:22 +08:00
|
|
|
/*
|
|
|
|
* Called by kexec, immediately prior to machine_kexec().
|
|
|
|
*
|
|
|
|
* This must completely disable all secondary CPUs; simply causing those CPUs
|
|
|
|
* to execute e.g. a RAM-based pin loop is not sufficient. This allows the
|
|
|
|
* kexec'd kernel to use any and all RAM as it sees fit, without having to
|
|
|
|
* avoid any code or data used by any SW CPU pin loop. The CPU hotplug
|
2020-03-23 21:50:59 +08:00
|
|
|
* functionality embodied in smpt_shutdown_nonboot_cpus() to achieve this.
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 09:41:22 +08:00
|
|
|
*/
|
2012-03-05 19:49:28 +08:00
|
|
|
void machine_shutdown(void)
|
|
|
|
{
|
2020-03-23 21:51:00 +08:00
|
|
|
smp_shutdown_nonboot_cpus(reboot_cpu);
|
2012-03-05 19:49:28 +08:00
|
|
|
}
|
|
|
|
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 09:41:22 +08:00
|
|
|
/*
|
|
|
|
* Halting simply requires that the secondary CPUs stop performing any
|
|
|
|
* activity (executing tasks, handling interrupts). smp_send_stop()
|
|
|
|
* achieves this.
|
|
|
|
*/
|
2012-03-05 19:49:28 +08:00
|
|
|
void machine_halt(void)
|
|
|
|
{
|
2014-05-07 09:41:23 +08:00
|
|
|
local_irq_disable();
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 09:41:22 +08:00
|
|
|
smp_send_stop();
|
2012-03-05 19:49:28 +08:00
|
|
|
while (1);
|
|
|
|
}
|
|
|
|
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 09:41:22 +08:00
|
|
|
/*
|
|
|
|
* Power-off simply requires that the secondary CPUs stop performing any
|
|
|
|
* activity (executing tasks, handling interrupts). smp_send_stop()
|
|
|
|
* achieves this. When the system power is turned off, it will take all CPUs
|
|
|
|
* with it.
|
|
|
|
*/
|
2012-03-05 19:49:28 +08:00
|
|
|
void machine_power_off(void)
|
|
|
|
{
|
2014-05-07 09:41:23 +08:00
|
|
|
local_irq_disable();
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 09:41:22 +08:00
|
|
|
smp_send_stop();
|
2022-05-10 07:32:20 +08:00
|
|
|
do_kernel_power_off();
|
2012-03-05 19:49:28 +08:00
|
|
|
}
|
|
|
|
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 09:41:22 +08:00
|
|
|
/*
|
|
|
|
* Restart requires that the secondary CPUs stop performing any activity
|
2015-04-20 17:24:35 +08:00
|
|
|
* while the primary CPU resets the system. Systems with multiple CPUs must
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 09:41:22 +08:00
|
|
|
* provide a HW restart implementation, to ensure that all CPUs reset at once.
|
|
|
|
* This is required so that any code running after reset on the primary CPU
|
|
|
|
* doesn't have to co-ordinate with other CPUs to ensure they aren't still
|
|
|
|
* executing pre-reset code, and using RAM that the primary CPU's code wishes
|
|
|
|
* to use. Implementing such co-ordination would be essentially impossible.
|
|
|
|
*/
|
2012-03-05 19:49:28 +08:00
|
|
|
void machine_restart(char *cmd)
|
|
|
|
{
|
|
|
|
/* Disable interrupts first */
|
|
|
|
local_irq_disable();
|
2014-05-07 09:41:23 +08:00
|
|
|
smp_send_stop();
|
2012-03-05 19:49:28 +08:00
|
|
|
|
2015-03-06 22:49:24 +08:00
|
|
|
/*
|
|
|
|
* UpdateCapsule() depends on the system being reset via
|
|
|
|
* ResetSystem().
|
|
|
|
*/
|
|
|
|
if (efi_enabled(EFI_RUNTIME_SERVICES))
|
|
|
|
efi_reboot(reboot_mode, NULL);
|
|
|
|
|
2012-03-05 19:49:28 +08:00
|
|
|
/* Now call the architecture specific reboot code. */
|
2021-06-04 22:07:36 +08:00
|
|
|
do_kernel_restart(cmd);
|
2012-03-05 19:49:28 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Whoops - the architecture was unable to reboot.
|
|
|
|
*/
|
|
|
|
printk("Reboot failed -- System halted\n");
|
|
|
|
while (1);
|
|
|
|
}
|
|
|
|
|
arm64: BTI: Decode BYTPE bits when printing PSTATE
The current code to print PSTATE symbolically when generating
backtraces etc., does not include the BYTPE field used by Branch
Target Identification.
So, decode BYTPE and print it too.
In the interests of human-readability, print the classes of BTI
matched. The symbolic notation, BYTPE (PSTATE[11:10]) and
permitted classes of subsequent instruction are:
-- (BTYPE=0b00): any insn
jc (BTYPE=0b01): BTI jc, BTI j, BTI c, PACIxSP
-c (BYTPE=0b10): BTI jc, BTI c, PACIxSP
j- (BTYPE=0b11): BTI jc, BTI j
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-03-17 00:50:48 +08:00
|
|
|
#define bstr(suffix, str) [PSR_BTYPE_ ## suffix >> PSR_BTYPE_SHIFT] = str
|
|
|
|
static const char *const btypes[] = {
|
|
|
|
bstr(NONE, "--"),
|
|
|
|
bstr( JC, "jc"),
|
|
|
|
bstr( C, "-c"),
|
|
|
|
bstr( J , "j-")
|
|
|
|
};
|
|
|
|
#undef bstr
|
|
|
|
|
2017-10-19 20:26:26 +08:00
|
|
|
static void print_pstate(struct pt_regs *regs)
|
|
|
|
{
|
|
|
|
u64 pstate = regs->pstate;
|
|
|
|
|
|
|
|
if (compat_user_mode(regs)) {
|
2021-07-22 10:20:36 +08:00
|
|
|
printk("pstate: %08llx (%c%c%c%c %c %s %s %c%c%c %cDIT %cSSBS)\n",
|
2017-10-19 20:26:26 +08:00
|
|
|
pstate,
|
2018-07-05 22:16:52 +08:00
|
|
|
pstate & PSR_AA32_N_BIT ? 'N' : 'n',
|
|
|
|
pstate & PSR_AA32_Z_BIT ? 'Z' : 'z',
|
|
|
|
pstate & PSR_AA32_C_BIT ? 'C' : 'c',
|
|
|
|
pstate & PSR_AA32_V_BIT ? 'V' : 'v',
|
|
|
|
pstate & PSR_AA32_Q_BIT ? 'Q' : 'q',
|
|
|
|
pstate & PSR_AA32_T_BIT ? "T32" : "A32",
|
|
|
|
pstate & PSR_AA32_E_BIT ? "BE" : "LE",
|
|
|
|
pstate & PSR_AA32_A_BIT ? 'A' : 'a',
|
|
|
|
pstate & PSR_AA32_I_BIT ? 'I' : 'i',
|
2021-07-22 10:20:36 +08:00
|
|
|
pstate & PSR_AA32_F_BIT ? 'F' : 'f',
|
|
|
|
pstate & PSR_AA32_DIT_BIT ? '+' : '-',
|
|
|
|
pstate & PSR_AA32_SSBS_BIT ? '+' : '-');
|
2017-10-19 20:26:26 +08:00
|
|
|
} else {
|
arm64: BTI: Decode BYTPE bits when printing PSTATE
The current code to print PSTATE symbolically when generating
backtraces etc., does not include the BYTPE field used by Branch
Target Identification.
So, decode BYTPE and print it too.
In the interests of human-readability, print the classes of BTI
matched. The symbolic notation, BYTPE (PSTATE[11:10]) and
permitted classes of subsequent instruction are:
-- (BTYPE=0b00): any insn
jc (BTYPE=0b01): BTI jc, BTI j, BTI c, PACIxSP
-c (BYTPE=0b10): BTI jc, BTI c, PACIxSP
j- (BTYPE=0b11): BTI jc, BTI j
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-03-17 00:50:48 +08:00
|
|
|
const char *btype_str = btypes[(pstate & PSR_BTYPE_MASK) >>
|
|
|
|
PSR_BTYPE_SHIFT];
|
|
|
|
|
2021-07-22 10:20:36 +08:00
|
|
|
printk("pstate: %08llx (%c%c%c%c %c%c%c%c %cPAN %cUAO %cTCO %cDIT %cSSBS BTYPE=%s)\n",
|
2017-10-19 20:26:26 +08:00
|
|
|
pstate,
|
|
|
|
pstate & PSR_N_BIT ? 'N' : 'n',
|
|
|
|
pstate & PSR_Z_BIT ? 'Z' : 'z',
|
|
|
|
pstate & PSR_C_BIT ? 'C' : 'c',
|
|
|
|
pstate & PSR_V_BIT ? 'V' : 'v',
|
|
|
|
pstate & PSR_D_BIT ? 'D' : 'd',
|
|
|
|
pstate & PSR_A_BIT ? 'A' : 'a',
|
|
|
|
pstate & PSR_I_BIT ? 'I' : 'i',
|
|
|
|
pstate & PSR_F_BIT ? 'F' : 'f',
|
|
|
|
pstate & PSR_PAN_BIT ? '+' : '-',
|
arm64: BTI: Decode BYTPE bits when printing PSTATE
The current code to print PSTATE symbolically when generating
backtraces etc., does not include the BYTPE field used by Branch
Target Identification.
So, decode BYTPE and print it too.
In the interests of human-readability, print the classes of BTI
matched. The symbolic notation, BYTPE (PSTATE[11:10]) and
permitted classes of subsequent instruction are:
-- (BTYPE=0b00): any insn
jc (BTYPE=0b01): BTI jc, BTI j, BTI c, PACIxSP
-c (BYTPE=0b10): BTI jc, BTI c, PACIxSP
j- (BTYPE=0b11): BTI jc, BTI j
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-03-17 00:50:48 +08:00
|
|
|
pstate & PSR_UAO_BIT ? '+' : '-',
|
2019-09-16 18:51:17 +08:00
|
|
|
pstate & PSR_TCO_BIT ? '+' : '-',
|
2021-07-22 10:20:36 +08:00
|
|
|
pstate & PSR_DIT_BIT ? '+' : '-',
|
|
|
|
pstate & PSR_SSBS_BIT ? '+' : '-',
|
arm64: BTI: Decode BYTPE bits when printing PSTATE
The current code to print PSTATE symbolically when generating
backtraces etc., does not include the BYTPE field used by Branch
Target Identification.
So, decode BYTPE and print it too.
In the interests of human-readability, print the classes of BTI
matched. The symbolic notation, BYTPE (PSTATE[11:10]) and
permitted classes of subsequent instruction are:
-- (BTYPE=0b00): any insn
jc (BTYPE=0b01): BTI jc, BTI j, BTI c, PACIxSP
-c (BYTPE=0b10): BTI jc, BTI c, PACIxSP
j- (BTYPE=0b11): BTI jc, BTI j
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-03-17 00:50:48 +08:00
|
|
|
btype_str);
|
2017-10-19 20:26:26 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-03-05 19:49:28 +08:00
|
|
|
void __show_regs(struct pt_regs *regs)
|
|
|
|
{
|
2013-09-18 01:49:46 +08:00
|
|
|
int i, top_reg;
|
|
|
|
u64 lr, sp;
|
|
|
|
|
|
|
|
if (compat_user_mode(regs)) {
|
|
|
|
lr = regs->compat_lr;
|
|
|
|
sp = regs->compat_sp;
|
|
|
|
top_reg = 12;
|
|
|
|
} else {
|
|
|
|
lr = regs->regs[30];
|
|
|
|
sp = regs->sp;
|
|
|
|
top_reg = 29;
|
|
|
|
}
|
2012-03-05 19:49:28 +08:00
|
|
|
|
dump_stack: unify debug information printed by show_regs()
show_regs() is inherently arch-dependent but it does make sense to print
generic debug information and some archs already do albeit in slightly
different forms. This patch introduces a generic function to print debug
information from show_regs() so that different archs print out the same
information and it's much easier to modify what's printed.
show_regs_print_info() prints out the same debug info as dump_stack()
does plus task and thread_info pointers.
* Archs which didn't print debug info now do.
alpha, arc, blackfin, c6x, cris, frv, h8300, hexagon, ia64, m32r,
metag, microblaze, mn10300, openrisc, parisc, score, sh64, sparc,
um, xtensa
* Already prints debug info. Replaced with show_regs_print_info().
The printed information is superset of what used to be there.
arm, arm64, avr32, mips, powerpc, sh32, tile, unicore32, x86
* s390 is special in that it used to print arch-specific information
along with generic debug info. Heiko and Martin think that the
arch-specific extra isn't worth keeping s390 specfic implementation.
Converted to use the generic version.
Note that now all archs print the debug info before actual register
dumps.
An example BUG() dump follows.
kernel BUG at /work/os/work/kernel/workqueue.c:4841!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #7
Hardware name: empty empty/S3992, BIOS 080011 10/26/2007
task: ffff88007c85e040 ti: ffff88007c860000 task.ti: ffff88007c860000
RIP: 0010:[<ffffffff8234a07e>] [<ffffffff8234a07e>] init_workqueues+0x4/0x6
RSP: 0000:ffff88007c861ec8 EFLAGS: 00010246
RAX: ffff88007c861fd8 RBX: ffffffff824466a8 RCX: 0000000000000001
RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffffffff8234a07a
RBP: ffff88007c861ec8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff8234a07a
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88015f7ff000 CR3: 00000000021f1000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffff88007c861ef8 ffffffff81000312 ffffffff824466a8 ffff88007c85e650
0000000000000003 0000000000000000 ffff88007c861f38 ffffffff82335e5d
ffff88007c862080 ffffffff8223d8c0 ffff88007c862080 ffffffff81c47760
Call Trace:
[<ffffffff81000312>] do_one_initcall+0x122/0x170
[<ffffffff82335e5d>] kernel_init_freeable+0x9b/0x1c8
[<ffffffff81c47760>] ? rest_init+0x140/0x140
[<ffffffff81c4776e>] kernel_init+0xe/0xf0
[<ffffffff81c6be9c>] ret_from_fork+0x7c/0xb0
[<ffffffff81c47760>] ? rest_init+0x140/0x140
...
v2: Typo fix in x86-32.
v3: CPU number dropped from show_regs_print_info() as
dump_stack_print_info() has been updated to print it. s390
specific implementation dropped as requested by s390 maintainers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Chris Metcalf <cmetcalf@tilera.com> [tile bits]
Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon bits]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01 06:27:17 +08:00
|
|
|
show_regs_print_info(KERN_DEFAULT);
|
2017-10-19 20:26:26 +08:00
|
|
|
print_pstate(regs);
|
2018-02-20 00:46:57 +08:00
|
|
|
|
|
|
|
if (!user_mode(regs)) {
|
|
|
|
printk("pc : %pS\n", (void *)regs->pc);
|
2020-03-13 17:05:00 +08:00
|
|
|
printk("lr : %pS\n", (void *)ptrauth_strip_insn_pac(lr));
|
2018-02-20 00:46:57 +08:00
|
|
|
} else {
|
|
|
|
printk("pc : %016llx\n", regs->pc);
|
|
|
|
printk("lr : %016llx\n", lr);
|
|
|
|
}
|
|
|
|
|
2017-10-19 20:26:26 +08:00
|
|
|
printk("sp : %016llx\n", sp);
|
arm64: fix show_regs fallout from KERN_CONT changes
Recently in commit 4bcc595ccd80decb ("printk: reinstate KERN_CONT for
printing continuation lines"), the behaviour of printk changed w.r.t.
KERN_CONT. Now, KERN_CONT is mandatory to continue existing lines.
Without this, prefixes are inserted, making output illegible, e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0 [ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8 [ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001 [ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
... or when dumped with the userpace dmesg tool, which has slightly
different implicit newline behaviour. e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0
[ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8
[ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001
[ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
We can't simply always use KERN_CONT for lines which may or may not be
continuations. That causes line prefixes (e.g. timestamps) to be
supressed, and the alignment of all but the first line will be broken.
For even more fun, we can't simply insert some dummy empty-string printk
calls, as GCC warns for an empty printk string, and even if we pass
KERN_DEFAULT explcitly to silence the warning, the prefix gets swallowed
unless there is an additional part to the string.
Instead, we must manually iterate over pairs of registers, which gives
us the legible output we want in either case, e.g.
[ 169.771790] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 169.779109] sp : ffff000008d53ec0
[ 169.782386] x29: ffff000008d53ec0 x28: 0000000080c50018
[ 169.787650] x27: ffff000008e0c7f8 x26: ffff80097631de00
[ 169.792913] x25: 0000000000000001 x24: 00000027827b2cf4
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-10-20 19:23:16 +08:00
|
|
|
|
2019-01-31 22:58:46 +08:00
|
|
|
if (system_uses_irq_prio_masking())
|
|
|
|
printk("pmr_save: %08llx\n", regs->pmr_save);
|
|
|
|
|
arm64: fix show_regs fallout from KERN_CONT changes
Recently in commit 4bcc595ccd80decb ("printk: reinstate KERN_CONT for
printing continuation lines"), the behaviour of printk changed w.r.t.
KERN_CONT. Now, KERN_CONT is mandatory to continue existing lines.
Without this, prefixes are inserted, making output illegible, e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0 [ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8 [ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001 [ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
... or when dumped with the userpace dmesg tool, which has slightly
different implicit newline behaviour. e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0
[ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8
[ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001
[ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
We can't simply always use KERN_CONT for lines which may or may not be
continuations. That causes line prefixes (e.g. timestamps) to be
supressed, and the alignment of all but the first line will be broken.
For even more fun, we can't simply insert some dummy empty-string printk
calls, as GCC warns for an empty printk string, and even if we pass
KERN_DEFAULT explcitly to silence the warning, the prefix gets swallowed
unless there is an additional part to the string.
Instead, we must manually iterate over pairs of registers, which gives
us the legible output we want in either case, e.g.
[ 169.771790] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 169.779109] sp : ffff000008d53ec0
[ 169.782386] x29: ffff000008d53ec0 x28: 0000000080c50018
[ 169.787650] x27: ffff000008e0c7f8 x26: ffff80097631de00
[ 169.792913] x25: 0000000000000001 x24: 00000027827b2cf4
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-10-20 19:23:16 +08:00
|
|
|
i = top_reg;
|
|
|
|
|
|
|
|
while (i >= 0) {
|
2021-04-21 01:22:45 +08:00
|
|
|
printk("x%-2d: %016llx", i, regs->regs[i]);
|
arm64: fix show_regs fallout from KERN_CONT changes
Recently in commit 4bcc595ccd80decb ("printk: reinstate KERN_CONT for
printing continuation lines"), the behaviour of printk changed w.r.t.
KERN_CONT. Now, KERN_CONT is mandatory to continue existing lines.
Without this, prefixes are inserted, making output illegible, e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0 [ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8 [ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001 [ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
... or when dumped with the userpace dmesg tool, which has slightly
different implicit newline behaviour. e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0
[ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8
[ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001
[ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
We can't simply always use KERN_CONT for lines which may or may not be
continuations. That causes line prefixes (e.g. timestamps) to be
supressed, and the alignment of all but the first line will be broken.
For even more fun, we can't simply insert some dummy empty-string printk
calls, as GCC warns for an empty printk string, and even if we pass
KERN_DEFAULT explcitly to silence the warning, the prefix gets swallowed
unless there is an additional part to the string.
Instead, we must manually iterate over pairs of registers, which gives
us the legible output we want in either case, e.g.
[ 169.771790] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 169.779109] sp : ffff000008d53ec0
[ 169.782386] x29: ffff000008d53ec0 x28: 0000000080c50018
[ 169.787650] x27: ffff000008e0c7f8 x26: ffff80097631de00
[ 169.792913] x25: 0000000000000001 x24: 00000027827b2cf4
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-10-20 19:23:16 +08:00
|
|
|
|
2021-04-21 01:22:45 +08:00
|
|
|
while (i-- % 3)
|
|
|
|
pr_cont(" x%-2d: %016llx", i, regs->regs[i]);
|
arm64: fix show_regs fallout from KERN_CONT changes
Recently in commit 4bcc595ccd80decb ("printk: reinstate KERN_CONT for
printing continuation lines"), the behaviour of printk changed w.r.t.
KERN_CONT. Now, KERN_CONT is mandatory to continue existing lines.
Without this, prefixes are inserted, making output illegible, e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0 [ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8 [ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001 [ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
... or when dumped with the userpace dmesg tool, which has slightly
different implicit newline behaviour. e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0
[ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8
[ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001
[ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
We can't simply always use KERN_CONT for lines which may or may not be
continuations. That causes line prefixes (e.g. timestamps) to be
supressed, and the alignment of all but the first line will be broken.
For even more fun, we can't simply insert some dummy empty-string printk
calls, as GCC warns for an empty printk string, and even if we pass
KERN_DEFAULT explcitly to silence the warning, the prefix gets swallowed
unless there is an additional part to the string.
Instead, we must manually iterate over pairs of registers, which gives
us the legible output we want in either case, e.g.
[ 169.771790] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 169.779109] sp : ffff000008d53ec0
[ 169.782386] x29: ffff000008d53ec0 x28: 0000000080c50018
[ 169.787650] x27: ffff000008e0c7f8 x26: ffff80097631de00
[ 169.792913] x25: 0000000000000001 x24: 00000027827b2cf4
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-10-20 19:23:16 +08:00
|
|
|
|
|
|
|
pr_cont("\n");
|
2012-03-05 19:49:28 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-02-04 09:43:49 +08:00
|
|
|
void show_regs(struct pt_regs *regs)
|
2012-03-05 19:49:28 +08:00
|
|
|
{
|
|
|
|
__show_regs(regs);
|
2020-06-09 12:30:23 +08:00
|
|
|
dump_backtrace(regs, NULL, KERN_DEFAULT);
|
2012-03-05 19:49:28 +08:00
|
|
|
}
|
|
|
|
|
2014-09-11 21:38:16 +08:00
|
|
|
static void tls_thread_flush(void)
|
|
|
|
{
|
2016-09-08 20:55:38 +08:00
|
|
|
write_sysreg(0, tpidr_el0);
|
2022-04-19 19:22:20 +08:00
|
|
|
if (system_supports_tpidr2())
|
|
|
|
write_sysreg_s(0, SYS_TPIDR2_EL0);
|
2014-09-11 21:38:16 +08:00
|
|
|
|
|
|
|
if (is_compat_task()) {
|
2018-03-28 17:50:49 +08:00
|
|
|
current->thread.uw.tp_value = 0;
|
2014-09-11 21:38:16 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to ensure ordering between the shadow state and the
|
|
|
|
* hardware state, so that we don't corrupt the hardware state
|
|
|
|
* with a stale shadow state during context switch.
|
|
|
|
*/
|
|
|
|
barrier();
|
2016-09-08 20:55:38 +08:00
|
|
|
write_sysreg(0, tpidrro_el0);
|
2014-09-11 21:38:16 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-07-24 01:58:39 +08:00
|
|
|
static void flush_tagged_addr_state(void)
|
|
|
|
{
|
|
|
|
if (IS_ENABLED(CONFIG_ARM64_TAGGED_ADDR_ABI))
|
|
|
|
clear_thread_flag(TIF_TAGGED_ADDR);
|
|
|
|
}
|
|
|
|
|
2012-03-05 19:49:28 +08:00
|
|
|
void flush_thread(void)
|
|
|
|
{
|
|
|
|
fpsimd_flush_thread();
|
2014-09-11 21:38:16 +08:00
|
|
|
tls_thread_flush();
|
2012-03-05 19:49:28 +08:00
|
|
|
flush_ptrace_hw_breakpoint(current);
|
2019-07-24 01:58:39 +08:00
|
|
|
flush_tagged_addr_state();
|
2012-03-05 19:49:28 +08:00
|
|
|
}
|
|
|
|
|
arm64/sve: Core task context handling
This patch adds the core support for switching and managing the SVE
architectural state of user tasks.
Calls to the existing FPSIMD low-level save/restore functions are
factored out as new functions task_fpsimd_{save,load}(), since SVE
now dynamically may or may not need to be handled at these points
depending on the kernel configuration, hardware features discovered
at boot, and the runtime state of the task. To make these
decisions as fast as possible, const cpucaps are used where
feasible, via the system_supports_sve() helper.
The SVE registers are only tracked for threads that have explicitly
used SVE, indicated by the new thread flag TIF_SVE. Otherwise, the
FPSIMD view of the architectural state is stored in
thread.fpsimd_state as usual.
When in use, the SVE registers are not stored directly in
thread_struct due to their potentially large and variable size.
Because the task_struct slab allocator must be configured very
early during kernel boot, it is also tricky to configure it
correctly to match the maximum vector length provided by the
hardware, since this depends on examining secondary CPUs as well as
the primary. Instead, a pointer sve_state in thread_struct points
to a dynamically allocated buffer containing the SVE register data,
and code is added to allocate and free this buffer at appropriate
times.
TIF_SVE is set when taking an SVE access trap from userspace, if
suitable hardware support has been detected. This enables SVE for
the thread: a subsequent return to userspace will disable the trap
accordingly. If such a trap is taken without sufficient system-
wide hardware support, SIGILL is sent to the thread instead as if
an undefined instruction had been executed: this may happen if
userspace tries to use SVE in a system where not all CPUs support
it for example.
The kernel will clear TIF_SVE and disable SVE for the thread
whenever an explicit syscall is made by userspace. For backwards
compatibility reasons and conformance with the spirit of the base
AArch64 procedure call standard, the subset of the SVE register
state that aliases the FPSIMD registers is still preserved across a
syscall even if this happens. The remainder of the SVE register
state logically becomes zero at syscall entry, though the actual
zeroing work is currently deferred until the thread next tries to
use SVE, causing another trap to the kernel. This implementation
is suboptimal: in the future, the fastpath case may be optimised
to zero the registers in-place and leave SVE enabled for the task,
where beneficial.
TIF_SVE is also cleared in the following slowpath cases, which are
taken as reasonable hints that the task may no longer use SVE:
* exec
* fork and clone
Code is added to sync data between thread.fpsimd_state and
thread.sve_state whenever enabling/disabling SVE, in a manner
consistent with the SVE architectural programmer's model.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Alex Bennée <alex.bennee@linaro.org>
[will: added #include to fix allnoconfig build]
[will: use enable_daif in do_sve_acc]
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-10-31 23:51:05 +08:00
|
|
|
void arch_release_task_struct(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
fpsimd_release_task(tsk);
|
|
|
|
}
|
|
|
|
|
2012-03-05 19:49:28 +08:00
|
|
|
int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
|
|
|
|
{
|
2015-06-11 12:04:32 +08:00
|
|
|
if (current->mm)
|
|
|
|
fpsimd_preserve_current_state();
|
2012-03-05 19:49:28 +08:00
|
|
|
*dst = *src;
|
arm64/sve: Core task context handling
This patch adds the core support for switching and managing the SVE
architectural state of user tasks.
Calls to the existing FPSIMD low-level save/restore functions are
factored out as new functions task_fpsimd_{save,load}(), since SVE
now dynamically may or may not need to be handled at these points
depending on the kernel configuration, hardware features discovered
at boot, and the runtime state of the task. To make these
decisions as fast as possible, const cpucaps are used where
feasible, via the system_supports_sve() helper.
The SVE registers are only tracked for threads that have explicitly
used SVE, indicated by the new thread flag TIF_SVE. Otherwise, the
FPSIMD view of the architectural state is stored in
thread.fpsimd_state as usual.
When in use, the SVE registers are not stored directly in
thread_struct due to their potentially large and variable size.
Because the task_struct slab allocator must be configured very
early during kernel boot, it is also tricky to configure it
correctly to match the maximum vector length provided by the
hardware, since this depends on examining secondary CPUs as well as
the primary. Instead, a pointer sve_state in thread_struct points
to a dynamically allocated buffer containing the SVE register data,
and code is added to allocate and free this buffer at appropriate
times.
TIF_SVE is set when taking an SVE access trap from userspace, if
suitable hardware support has been detected. This enables SVE for
the thread: a subsequent return to userspace will disable the trap
accordingly. If such a trap is taken without sufficient system-
wide hardware support, SIGILL is sent to the thread instead as if
an undefined instruction had been executed: this may happen if
userspace tries to use SVE in a system where not all CPUs support
it for example.
The kernel will clear TIF_SVE and disable SVE for the thread
whenever an explicit syscall is made by userspace. For backwards
compatibility reasons and conformance with the spirit of the base
AArch64 procedure call standard, the subset of the SVE register
state that aliases the FPSIMD registers is still preserved across a
syscall even if this happens. The remainder of the SVE register
state logically becomes zero at syscall entry, though the actual
zeroing work is currently deferred until the thread next tries to
use SVE, causing another trap to the kernel. This implementation
is suboptimal: in the future, the fastpath case may be optimised
to zero the registers in-place and leave SVE enabled for the task,
where beneficial.
TIF_SVE is also cleared in the following slowpath cases, which are
taken as reasonable hints that the task may no longer use SVE:
* exec
* fork and clone
Code is added to sync data between thread.fpsimd_state and
thread.sve_state whenever enabling/disabling SVE, in a manner
consistent with the SVE architectural programmer's model.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Alex Bennée <alex.bennee@linaro.org>
[will: added #include to fix allnoconfig build]
[will: use enable_daif in do_sve_acc]
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-10-31 23:51:05 +08:00
|
|
|
|
2019-10-01 04:56:00 +08:00
|
|
|
/* We rely on the above assignment to initialize dst's thread_flags: */
|
|
|
|
BUILD_BUG_ON(!IS_ENABLED(CONFIG_THREAD_INFO_IN_TASK));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Detach src's sve_state (if any) from dst so that it does not
|
2022-04-19 19:22:24 +08:00
|
|
|
* get erroneously used or freed prematurely. dst's copies
|
2019-10-01 04:56:00 +08:00
|
|
|
* will be allocated on demand later on if dst uses SVE.
|
|
|
|
* For consistency, also clear TIF_SVE here: this could be done
|
|
|
|
* later in copy_process(), but to avoid tripping up future
|
2022-04-19 19:22:24 +08:00
|
|
|
* maintainers it is best not to leave TIF flags and buffers in
|
2019-10-01 04:56:00 +08:00
|
|
|
* an inconsistent state, even temporarily.
|
|
|
|
*/
|
|
|
|
dst->thread.sve_state = NULL;
|
|
|
|
clear_tsk_thread_flag(dst, TIF_SVE);
|
|
|
|
|
2022-04-19 19:22:24 +08:00
|
|
|
/*
|
|
|
|
* In the unlikely event that we create a new thread with ZA
|
|
|
|
* enabled we should retain the ZA state so duplicate it here.
|
|
|
|
* This may be shortly freed if we exec() or if CLONE_SETTLS
|
|
|
|
* but it's simpler to do it here. To avoid confusing the rest
|
|
|
|
* of the code ensure that we have a sve_state allocated
|
|
|
|
* whenever za_state is allocated.
|
|
|
|
*/
|
|
|
|
if (thread_za_enabled(&src->thread)) {
|
|
|
|
dst->thread.sve_state = kzalloc(sve_state_size(src),
|
|
|
|
GFP_KERNEL);
|
2022-04-26 19:30:53 +08:00
|
|
|
if (!dst->thread.sve_state)
|
2022-04-19 19:22:24 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
dst->thread.za_state = kmemdup(src->thread.za_state,
|
|
|
|
za_state_size(src),
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (!dst->thread.za_state) {
|
|
|
|
kfree(dst->thread.sve_state);
|
|
|
|
dst->thread.sve_state = NULL;
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
dst->thread.za_state = NULL;
|
|
|
|
clear_tsk_thread_flag(dst, TIF_SME);
|
|
|
|
}
|
2022-04-19 19:22:21 +08:00
|
|
|
|
2019-09-16 18:51:17 +08:00
|
|
|
/* clear any pending asynchronous tag fault raised by the parent */
|
|
|
|
clear_tsk_thread_flag(dst, TIF_MTE_ASYNC_FAULT);
|
|
|
|
|
2012-03-05 19:49:28 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
asmlinkage void ret_from_fork(void) asm("ret_from_fork");
|
|
|
|
|
2022-04-09 07:07:50 +08:00
|
|
|
int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
|
2012-03-05 19:49:28 +08:00
|
|
|
{
|
2022-04-09 07:07:50 +08:00
|
|
|
unsigned long clone_flags = args->flags;
|
|
|
|
unsigned long stack_start = args->stack;
|
|
|
|
unsigned long tls = args->tls;
|
2012-03-05 19:49:28 +08:00
|
|
|
struct pt_regs *childregs = task_pt_regs(p);
|
|
|
|
|
2012-10-05 19:31:20 +08:00
|
|
|
memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context));
|
2012-03-05 19:49:28 +08:00
|
|
|
|
arm64: fpsimd: Prevent registers leaking from dead tasks
Currently, loading of a task's fpsimd state into the CPU registers
is skipped if that task's state is already present in the registers
of that CPU.
However, the code relies on the struct fpsimd_state * (and by
extension struct task_struct *) to unambiguously identify a task.
There is a particular case in which this doesn't work reliably:
when a task exits, its task_struct may be recycled to describe a
new task.
Consider the following scenario:
1) Task P loads its fpsimd state onto cpu C.
per_cpu(fpsimd_last_state, C) := P;
P->thread.fpsimd_state.cpu := C;
2) Task X is scheduled onto C and loads its fpsimd state on C.
per_cpu(fpsimd_last_state, C) := X;
X->thread.fpsimd_state.cpu := C;
3) X exits, causing X's task_struct to be freed.
4) P forks a new child T, which obtains X's recycled task_struct.
T == X.
T->thread.fpsimd_state.cpu == C (inherited from P).
5) T is scheduled on C.
T's fpsimd state is not loaded, because
per_cpu(fpsimd_last_state, C) == T (== X) &&
T->thread.fpsimd_state.cpu == C.
(This is the check performed by fpsimd_thread_switch().)
So, T gets X's registers because the last registers loaded onto C
were those of X, in (2).
This patch fixes the problem by ensuring that the sched-in check
fails in (5): fpsimd_flush_task_state(T) is called when T is
forked, so that T->thread.fpsimd_state.cpu == C cannot be true.
This relies on the fact that T is not schedulable until after
copy_thread() completes.
Once T's fpsimd state has been loaded on some CPU C there may still
be other cpus D for which per_cpu(fpsimd_last_state, D) ==
&X->thread.fpsimd_state. But D is necessarily != C in this case,
and the check in (5) must fail.
An alternative fix would be to do refcounting on task_struct. This
would result in each CPU holding a reference to the last task whose
fpsimd state was loaded there. It's not clear whether this is
preferable, and it involves higher overhead than the fix proposed
in this patch. It would also move all the task_struct freeing
work into the context switch critical section, or otherwise some
deferred cleanup mechanism would need to be introduced, neither of
which seems obviously justified.
Cc: <stable@vger.kernel.org>
Fixes: 005f78cd8849 ("arm64: defer reloading a task's FPSIMD state to userland resume")
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
[will: word-smithed the comment so it makes more sense]
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-12-05 22:56:42 +08:00
|
|
|
/*
|
|
|
|
* In case p was allocated the same task_struct pointer as some
|
|
|
|
* other recently-exited task, make sure p is disassociated from
|
|
|
|
* any cpu that may have run that now-exited task recently.
|
|
|
|
* Otherwise we could erroneously skip reloading the FPSIMD
|
|
|
|
* registers for p.
|
|
|
|
*/
|
|
|
|
fpsimd_flush_task_state(p);
|
|
|
|
|
2020-03-13 17:04:56 +08:00
|
|
|
ptrauth_thread_init_kernel(p);
|
|
|
|
|
2022-04-12 23:18:48 +08:00
|
|
|
if (likely(!args->fn)) {
|
2012-10-22 03:56:52 +08:00
|
|
|
*childregs = *current_pt_regs();
|
2012-10-05 19:31:20 +08:00
|
|
|
childregs->regs[0] = 0;
|
2015-05-27 22:39:40 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Read the current TLS pointer from tpidr_el0 as it may be
|
|
|
|
* out-of-sync with the saved value.
|
|
|
|
*/
|
2016-09-08 20:55:38 +08:00
|
|
|
*task_user_tls(p) = read_sysreg(tpidr_el0);
|
2022-04-19 19:22:20 +08:00
|
|
|
if (system_supports_tpidr2())
|
|
|
|
p->thread.tpidr2_el0 = read_sysreg_s(SYS_TPIDR2_EL0);
|
2015-05-27 22:39:40 +08:00
|
|
|
|
|
|
|
if (stack_start) {
|
|
|
|
if (is_compat_thread(task_thread_info(p)))
|
2012-10-18 12:55:54 +08:00
|
|
|
childregs->compat_sp = stack_start;
|
2015-05-27 22:39:40 +08:00
|
|
|
else
|
2012-10-18 12:55:54 +08:00
|
|
|
childregs->sp = stack_start;
|
2012-10-05 19:31:20 +08:00
|
|
|
}
|
2015-05-27 22:39:40 +08:00
|
|
|
|
2012-03-05 19:49:28 +08:00
|
|
|
/*
|
2020-01-03 01:24:08 +08:00
|
|
|
* If a TLS pointer was passed to clone, use it for the new
|
2022-04-19 19:22:20 +08:00
|
|
|
* thread. We also reset TPIDR2 if it's in use.
|
2012-03-05 19:49:28 +08:00
|
|
|
*/
|
2022-04-19 19:22:20 +08:00
|
|
|
if (clone_flags & CLONE_SETTLS) {
|
2020-01-03 01:24:08 +08:00
|
|
|
p->thread.uw.tp_value = tls;
|
2022-04-19 19:22:20 +08:00
|
|
|
p->thread.tpidr2_el0 = 0;
|
|
|
|
}
|
2012-10-05 19:31:20 +08:00
|
|
|
} else {
|
2020-11-13 20:49:21 +08:00
|
|
|
/*
|
|
|
|
* A kthread has no context to ERET to, so ensure any buggy
|
|
|
|
* ERET is treated as an illegal exception return.
|
|
|
|
*
|
|
|
|
* When a user task is created from a kthread, childregs will
|
|
|
|
* be initialized by start_thread() or start_compat_thread().
|
|
|
|
*/
|
2012-10-05 19:31:20 +08:00
|
|
|
memset(childregs, 0, sizeof(struct pt_regs));
|
2020-11-13 20:49:21 +08:00
|
|
|
childregs->pstate = PSR_MODE_EL1h | PSR_IL_BIT;
|
2019-01-31 22:58:46 +08:00
|
|
|
|
2022-04-12 23:18:48 +08:00
|
|
|
p->thread.cpu_context.x19 = (unsigned long)args->fn;
|
|
|
|
p->thread.cpu_context.x20 = (unsigned long)args->fn_arg;
|
2012-03-05 19:49:28 +08:00
|
|
|
}
|
|
|
|
p->thread.cpu_context.pc = (unsigned long)ret_from_fork;
|
2012-10-05 19:31:20 +08:00
|
|
|
p->thread.cpu_context.sp = (unsigned long)childregs;
|
arm64: Implement stack trace termination record
Reliable stacktracing requires that we identify when a stacktrace is
terminated early. We can do this by ensuring all tasks have a final
frame record at a known location on their task stack, and checking
that this is the final frame record in the chain.
We'd like to use task_pt_regs(task)->stackframe as the final frame
record, as this is already setup upon exception entry from EL0. For
kernel tasks we need to consistently reserve the pt_regs and point x29
at this, which we can do with small changes to __primary_switched,
__secondary_switched, and copy_process().
Since the final frame record must be at a specific location, we must
create the final frame record in __primary_switched and
__secondary_switched rather than leaving this to start_kernel and
secondary_start_kernel. Thus, __primary_switched and
__secondary_switched will now show up in stacktraces for the idle tasks.
Since the final frame record is now identified by its location rather
than by its contents, we identify it at the start of unwind_frame(),
before we read any values from it.
External debuggers may terminate the stack trace when FP == 0. In the
pt_regs->stackframe, the PC is 0 as well. So, stack traces taken in the
debugger may print an extra record 0x0 at the end. While this is not
pretty, this does not do any harm. This is a small price to pay for
having reliable stack trace termination in the kernel. That said, gdb
does not show the extra record probably because it uses DWARF and not
frame pointers for stack traces.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
[Mark: rebase, use ASM_BUG(), update comments, update commit message]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20210510110026.18061-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-05-10 19:00:26 +08:00
|
|
|
/*
|
|
|
|
* For the benefit of the unwinder, set up childregs->stackframe
|
|
|
|
* as the final frame for the new task.
|
|
|
|
*/
|
|
|
|
p->thread.cpu_context.fp = (unsigned long)childregs->stackframe;
|
2012-03-05 19:49:28 +08:00
|
|
|
|
|
|
|
ptrace_hw_copy_thread(p);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-06-21 23:00:44 +08:00
|
|
|
void tls_preserve_current_state(void)
|
|
|
|
{
|
|
|
|
*task_user_tls(current) = read_sysreg(tpidr_el0);
|
2022-04-19 19:22:20 +08:00
|
|
|
if (system_supports_tpidr2() && !is_compat_task())
|
|
|
|
current->thread.tpidr2_el0 = read_sysreg_s(SYS_TPIDR2_EL0);
|
2017-06-21 23:00:44 +08:00
|
|
|
}
|
|
|
|
|
2012-03-05 19:49:28 +08:00
|
|
|
static void tls_thread_switch(struct task_struct *next)
|
|
|
|
{
|
2017-06-21 23:00:44 +08:00
|
|
|
tls_preserve_current_state();
|
2012-03-05 19:49:28 +08:00
|
|
|
|
2017-11-14 22:33:28 +08:00
|
|
|
if (is_compat_thread(task_thread_info(next)))
|
2018-03-28 17:50:49 +08:00
|
|
|
write_sysreg(next->thread.uw.tp_value, tpidrro_el0);
|
2017-11-14 22:33:28 +08:00
|
|
|
else if (!arm64_kernel_unmapped_at_el0())
|
|
|
|
write_sysreg(0, tpidrro_el0);
|
2012-03-05 19:49:28 +08:00
|
|
|
|
2017-11-14 22:33:28 +08:00
|
|
|
write_sysreg(*task_user_tls(next), tpidr_el0);
|
2022-04-19 19:22:20 +08:00
|
|
|
if (system_supports_tpidr2())
|
|
|
|
write_sysreg_s(next->thread.tpidr2_el0, SYS_TPIDR2_EL0);
|
2012-03-05 19:49:28 +08:00
|
|
|
}
|
|
|
|
|
2019-07-22 21:53:09 +08:00
|
|
|
/*
|
|
|
|
* Force SSBS state on context-switch, since it may be lost after migrating
|
|
|
|
* from a CPU which treats the bit as RES0 in a heterogeneous system.
|
|
|
|
*/
|
|
|
|
static void ssbs_thread_switch(struct task_struct *next)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Nothing to do for kernel threads, but 'regs' may be junk
|
|
|
|
* (e.g. idle task) so check the flags and bail early.
|
|
|
|
*/
|
|
|
|
if (unlikely(next->flags & PF_KTHREAD))
|
|
|
|
return;
|
|
|
|
|
2020-02-06 18:42:58 +08:00
|
|
|
/*
|
|
|
|
* If all CPUs implement the SSBS extension, then we just need to
|
|
|
|
* context-switch the PSTATE field.
|
|
|
|
*/
|
2020-09-18 18:54:33 +08:00
|
|
|
if (cpus_have_const_cap(ARM64_SSBS))
|
2019-07-22 21:53:09 +08:00
|
|
|
return;
|
|
|
|
|
2020-09-18 18:54:33 +08:00
|
|
|
spectre_v4_enable_task_mitigation(next);
|
2019-07-22 21:53:09 +08:00
|
|
|
}
|
|
|
|
|
arm64: split thread_info from task stack
This patch moves arm64's struct thread_info from the task stack into
task_struct. This protects thread_info from corruption in the case of
stack overflows, and makes its address harder to determine if stack
addresses are leaked, making a number of attacks more difficult. Precise
detection and handling of overflow is left for subsequent patches.
Largely, this involves changing code to store the task_struct in sp_el0,
and acquire the thread_info from the task struct. Core code now
implements current_thread_info(), and as noted in <linux/sched.h> this
relies on offsetof(task_struct, thread_info) == 0, enforced by core
code.
This change means that the 'tsk' register used in entry.S now points to
a task_struct, rather than a thread_info as it used to. To make this
clear, the TI_* field offsets are renamed to TSK_TI_*, with asm-offsets
appropriately updated to account for the structural change.
Userspace clobbers sp_el0, and we can no longer restore this from the
stack. Instead, the current task is cached in a per-cpu variable that we
can safely access from early assembly as interrupts are disabled (and we
are thus not preemptible).
Both secondary entry and idle are updated to stash the sp and task
pointer separately.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: James Morse <james.morse@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-11-04 04:23:13 +08:00
|
|
|
/*
|
|
|
|
* We store our current task in sp_el0, which is clobbered by userspace. Keep a
|
|
|
|
* shadow copy so that we can restore this upon entry from userspace.
|
|
|
|
*
|
|
|
|
* This is *only* for exception entry from EL0, and is not valid until we
|
|
|
|
* __switch_to() a user task.
|
|
|
|
*/
|
|
|
|
DEFINE_PER_CPU(struct task_struct *, __entry_task);
|
|
|
|
|
|
|
|
static void entry_task_switch(struct task_struct *next)
|
|
|
|
{
|
|
|
|
__this_cpu_write(__entry_task, next);
|
|
|
|
}
|
|
|
|
|
2020-08-01 01:38:23 +08:00
|
|
|
/*
|
|
|
|
* ARM erratum 1418040 handling, affecting the 32bit view of CNTVCT.
|
2021-12-21 07:41:14 +08:00
|
|
|
* Ensure access is disabled when switching to a 32bit task, ensure
|
|
|
|
* access is enabled when switching to a 64bit task.
|
2020-08-01 01:38:23 +08:00
|
|
|
*/
|
2021-12-21 07:41:14 +08:00
|
|
|
static void erratum_1418040_thread_switch(struct task_struct *next)
|
2020-08-01 01:38:23 +08:00
|
|
|
{
|
2021-12-21 07:41:14 +08:00
|
|
|
if (!IS_ENABLED(CONFIG_ARM64_ERRATUM_1418040) ||
|
|
|
|
!this_cpu_has_cap(ARM64_WORKAROUND_1418040))
|
2020-08-01 01:38:23 +08:00
|
|
|
return;
|
|
|
|
|
2021-12-21 07:41:14 +08:00
|
|
|
if (is_compat_thread(task_thread_info(next)))
|
|
|
|
sysreg_clear_set(cntkctl_el1, ARCH_TIMER_USR_VCT_ACCESS_EN, 0);
|
2020-08-01 01:38:23 +08:00
|
|
|
else
|
2021-12-21 07:41:14 +08:00
|
|
|
sysreg_clear_set(cntkctl_el1, 0, ARCH_TIMER_USR_VCT_ACCESS_EN);
|
|
|
|
}
|
2020-08-01 01:38:23 +08:00
|
|
|
|
2021-12-21 07:41:14 +08:00
|
|
|
static void erratum_1418040_new_exec(void)
|
|
|
|
{
|
|
|
|
preempt_disable();
|
|
|
|
erratum_1418040_thread_switch(current);
|
|
|
|
preempt_enable();
|
2020-08-01 01:38:23 +08:00
|
|
|
}
|
|
|
|
|
2021-07-28 04:52:57 +08:00
|
|
|
/*
|
|
|
|
* __switch_to() checks current->thread.sctlr_user as an optimisation. Therefore
|
|
|
|
* this function must be called with preemption disabled and the update to
|
|
|
|
* sctlr_user must be made in the same preemption disabled block so that
|
|
|
|
* __switch_to() does not see the variable update before the SCTLR_EL1 one.
|
|
|
|
*/
|
|
|
|
void update_sctlr_el1(u64 sctlr)
|
2021-03-19 11:10:52 +08:00
|
|
|
{
|
2021-03-19 11:10:53 +08:00
|
|
|
/*
|
|
|
|
* EnIA must not be cleared while in the kernel as this is necessary for
|
|
|
|
* in-kernel PAC. It will be cleared on kernel exit if needed.
|
|
|
|
*/
|
|
|
|
sysreg_clear_set(sctlr_el1, SCTLR_USER_MASK & ~SCTLR_ELx_ENIA, sctlr);
|
2021-03-19 11:10:52 +08:00
|
|
|
|
|
|
|
/* ISB required for the kernel uaccess routines when setting TCF0. */
|
|
|
|
isb();
|
|
|
|
}
|
|
|
|
|
2012-03-05 19:49:28 +08:00
|
|
|
/*
|
|
|
|
* Thread switching.
|
|
|
|
*/
|
2021-11-29 22:28:43 +08:00
|
|
|
__notrace_funcgraph __sched
|
|
|
|
struct task_struct *__switch_to(struct task_struct *prev,
|
2012-03-05 19:49:28 +08:00
|
|
|
struct task_struct *next)
|
|
|
|
{
|
|
|
|
struct task_struct *last;
|
|
|
|
|
|
|
|
fpsimd_thread_switch(next);
|
|
|
|
tls_thread_switch(next);
|
|
|
|
hw_breakpoint_thread_switch(next);
|
2013-04-04 02:01:01 +08:00
|
|
|
contextidr_thread_switch(next);
|
arm64: split thread_info from task stack
This patch moves arm64's struct thread_info from the task stack into
task_struct. This protects thread_info from corruption in the case of
stack overflows, and makes its address harder to determine if stack
addresses are leaked, making a number of attacks more difficult. Precise
detection and handling of overflow is left for subsequent patches.
Largely, this involves changing code to store the task_struct in sp_el0,
and acquire the thread_info from the task struct. Core code now
implements current_thread_info(), and as noted in <linux/sched.h> this
relies on offsetof(task_struct, thread_info) == 0, enforced by core
code.
This change means that the 'tsk' register used in entry.S now points to
a task_struct, rather than a thread_info as it used to. To make this
clear, the TI_* field offsets are renamed to TSK_TI_*, with asm-offsets
appropriately updated to account for the structural change.
Userspace clobbers sp_el0, and we can no longer restore this from the
stack. Instead, the current task is cached in a per-cpu variable that we
can safely access from early assembly as interrupts are disabled (and we
are thus not preemptible).
Both secondary entry and idle are updated to stash the sp and task
pointer separately.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: James Morse <james.morse@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-11-04 04:23:13 +08:00
|
|
|
entry_task_switch(next);
|
2019-07-22 21:53:09 +08:00
|
|
|
ssbs_thread_switch(next);
|
2021-12-21 07:41:14 +08:00
|
|
|
erratum_1418040_thread_switch(next);
|
2021-03-19 11:10:54 +08:00
|
|
|
ptrauth_thread_switch_user(next);
|
2012-03-05 19:49:28 +08:00
|
|
|
|
2013-04-24 21:47:02 +08:00
|
|
|
/*
|
|
|
|
* Complete any pending TLB or cache maintenance on this CPU in case
|
|
|
|
* the thread migrates to a different CPU.
|
membarrier: Provide expedited private command
Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
from all runqueues for which current thread's mm is the same as the
thread calling sys_membarrier. It executes faster than the non-expedited
variant (no blocking). It also works on NOHZ_FULL configurations.
Scheduler-wise, it requires a memory barrier before and after context
switching between processes (which have different mm). The memory
barrier before context switch is already present. For the barrier after
context switch:
* Our TSO archs can do RELEASE without being a full barrier. Look at
x86 spin_unlock() being a regular STORE for example. But for those
archs, all atomics imply smp_mb and all of them have atomic ops in
switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full
barrier.
* From all weakly ordered machines, only ARM64 and PPC can do RELEASE,
the rest does indeed do smp_mb(), so there the spin_unlock() is a full
barrier and we're good.
* ARM64 has a very heavy barrier in switch_to(), which suffices.
* PPC just removed its barrier from switch_to(), but appears to be
talking about adding something to switch_mm(). So add a
smp_mb__after_unlock_lock() for now, until this is settled on the PPC
side.
Changes since v3:
- Properly document the memory barriers provided by each architecture.
Changes since v2:
- Address comments from Peter Zijlstra,
- Add smp_mb__after_unlock_lock() after finish_lock_switch() in
finish_task_switch() to add the memory barrier we need after storing
to rq->curr. This is much simpler than the previous approach relying
on atomic_dec_and_test() in mmdrop(), which actually added a memory
barrier in the common case of switching between userspace processes.
- Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full
kernel, rather than having the whole membarrier system call returning
-ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full.
Adapt the CMD_QUERY mask accordingly.
Changes since v1:
- move membarrier code under kernel/sched/ because it uses the
scheduler runqueue,
- only add the barrier when we switch from a kernel thread. The case
where we switch from a user-space thread is already handled by
the atomic_dec_and_test() in mmdrop().
- add a comment to mmdrop() documenting the requirement on the implicit
memory barrier.
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Andrew Hunter <ahh@google.com>
CC: Maged Michael <maged.michael@gmail.com>
CC: gromer@google.com
CC: Avi Kivity <avi@scylladb.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Dave Watson <davejwatson@fb.com>
2017-07-29 04:40:40 +08:00
|
|
|
* This full barrier is also required by the membarrier system
|
|
|
|
* call.
|
2013-04-24 21:47:02 +08:00
|
|
|
*/
|
2014-05-02 23:24:10 +08:00
|
|
|
dsb(ish);
|
2012-03-05 19:49:28 +08:00
|
|
|
|
2019-11-27 18:30:15 +08:00
|
|
|
/*
|
|
|
|
* MTE thread switching must happen after the DSB above to ensure that
|
|
|
|
* any asynchronous tag check faults have been logged in the TFSR*_EL1
|
|
|
|
* registers.
|
|
|
|
*/
|
|
|
|
mte_thread_switch(next);
|
2021-03-19 11:10:52 +08:00
|
|
|
/* avoid expensive SCTLR_EL1 accesses if no change */
|
|
|
|
if (prev->thread.sctlr_user != next->thread.sctlr_user)
|
|
|
|
update_sctlr_el1(next->thread.sctlr_user);
|
2019-11-27 18:30:15 +08:00
|
|
|
|
2012-03-05 19:49:28 +08:00
|
|
|
/* the actual thread switch */
|
|
|
|
last = cpu_switch_to(prev, next);
|
|
|
|
|
|
|
|
return last;
|
|
|
|
}
|
|
|
|
|
arm64: Make __get_wchan() use arch_stack_walk()
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently, __get_wchan() walks the stack of a blocked task by calling
start_backtrace() with the task's saved PC and FP values, and iterating
unwind steps using unwind_frame(). The initialization is functionally
equivalent to calling arch_stack_walk() with the blocked task, which
will start with the task's saved PC and FP values.
Currently __get_wchan() always performs an initial unwind step, which
will stkip __switch_to(), but as this is now marked as a __sched
function, this no longer needs special handling and will be skipped in
the same way as other sched functions.
Make __get_wchan() use arch_stack_walk(). This simplifies __get_wchan(),
and in future will alow us to make unwind_frame() private to
stacktrace.c. At the same time, we can simplify the try_get_task_stack()
check and avoid the unnecessary `stack_page` variable.
The change to the skipping logic means we may terminate one frame
earlier than previously where there are an excessive number of sched
functions in the trace, but this isn't seen in practice, and wchan is
best-effort anyway, so this should not be a problem.
Other than the above, there should be no functional change as a result
of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
[Mark: rebase atop wchan changes, elaborate commit message, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-11-29 22:28:45 +08:00
|
|
|
struct wchan_info {
|
|
|
|
unsigned long pc;
|
|
|
|
int count;
|
|
|
|
};
|
|
|
|
|
|
|
|
static bool get_wchan_cb(void *arg, unsigned long pc)
|
|
|
|
{
|
|
|
|
struct wchan_info *wchan_info = arg;
|
|
|
|
|
|
|
|
if (!in_sched_functions(pc)) {
|
|
|
|
wchan_info->pc = pc;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return wchan_info->count++ < 16;
|
|
|
|
}
|
|
|
|
|
2021-09-30 06:02:14 +08:00
|
|
|
unsigned long __get_wchan(struct task_struct *p)
|
2012-03-05 19:49:28 +08:00
|
|
|
{
|
arm64: Make __get_wchan() use arch_stack_walk()
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently, __get_wchan() walks the stack of a blocked task by calling
start_backtrace() with the task's saved PC and FP values, and iterating
unwind steps using unwind_frame(). The initialization is functionally
equivalent to calling arch_stack_walk() with the blocked task, which
will start with the task's saved PC and FP values.
Currently __get_wchan() always performs an initial unwind step, which
will stkip __switch_to(), but as this is now marked as a __sched
function, this no longer needs special handling and will be skipped in
the same way as other sched functions.
Make __get_wchan() use arch_stack_walk(). This simplifies __get_wchan(),
and in future will alow us to make unwind_frame() private to
stacktrace.c. At the same time, we can simplify the try_get_task_stack()
check and avoid the unnecessary `stack_page` variable.
The change to the skipping logic means we may terminate one frame
earlier than previously where there are an excessive number of sched
functions in the trace, but this isn't seen in practice, and wchan is
best-effort anyway, so this should not be a problem.
Other than the above, there should be no functional change as a result
of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
[Mark: rebase atop wchan changes, elaborate commit message, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-11-29 22:28:45 +08:00
|
|
|
struct wchan_info wchan_info = {
|
|
|
|
.pc = 0,
|
|
|
|
.count = 0,
|
|
|
|
};
|
2012-03-05 19:49:28 +08:00
|
|
|
|
arm64: Make __get_wchan() use arch_stack_walk()
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently, __get_wchan() walks the stack of a blocked task by calling
start_backtrace() with the task's saved PC and FP values, and iterating
unwind steps using unwind_frame(). The initialization is functionally
equivalent to calling arch_stack_walk() with the blocked task, which
will start with the task's saved PC and FP values.
Currently __get_wchan() always performs an initial unwind step, which
will stkip __switch_to(), but as this is now marked as a __sched
function, this no longer needs special handling and will be skipped in
the same way as other sched functions.
Make __get_wchan() use arch_stack_walk(). This simplifies __get_wchan(),
and in future will alow us to make unwind_frame() private to
stacktrace.c. At the same time, we can simplify the try_get_task_stack()
check and avoid the unnecessary `stack_page` variable.
The change to the skipping logic means we may terminate one frame
earlier than previously where there are an excessive number of sched
functions in the trace, but this isn't seen in practice, and wchan is
best-effort anyway, so this should not be a problem.
Other than the above, there should be no functional change as a result
of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
[Mark: rebase atop wchan changes, elaborate commit message, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-11-29 22:28:45 +08:00
|
|
|
if (!try_get_task_stack(p))
|
2016-11-04 04:23:08 +08:00
|
|
|
return 0;
|
|
|
|
|
arm64: Make __get_wchan() use arch_stack_walk()
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently, __get_wchan() walks the stack of a blocked task by calling
start_backtrace() with the task's saved PC and FP values, and iterating
unwind steps using unwind_frame(). The initialization is functionally
equivalent to calling arch_stack_walk() with the blocked task, which
will start with the task's saved PC and FP values.
Currently __get_wchan() always performs an initial unwind step, which
will stkip __switch_to(), but as this is now marked as a __sched
function, this no longer needs special handling and will be skipped in
the same way as other sched functions.
Make __get_wchan() use arch_stack_walk(). This simplifies __get_wchan(),
and in future will alow us to make unwind_frame() private to
stacktrace.c. At the same time, we can simplify the try_get_task_stack()
check and avoid the unnecessary `stack_page` variable.
The change to the skipping logic means we may terminate one frame
earlier than previously where there are an excessive number of sched
functions in the trace, but this isn't seen in practice, and wchan is
best-effort anyway, so this should not be a problem.
Other than the above, there should be no functional change as a result
of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
[Mark: rebase atop wchan changes, elaborate commit message, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-11-29 22:28:45 +08:00
|
|
|
arch_stack_walk(get_wchan_cb, &wchan_info, p, NULL);
|
2019-07-02 21:07:28 +08:00
|
|
|
|
2016-11-04 04:23:08 +08:00
|
|
|
put_task_stack(p);
|
arm64: Make __get_wchan() use arch_stack_walk()
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently, __get_wchan() walks the stack of a blocked task by calling
start_backtrace() with the task's saved PC and FP values, and iterating
unwind steps using unwind_frame(). The initialization is functionally
equivalent to calling arch_stack_walk() with the blocked task, which
will start with the task's saved PC and FP values.
Currently __get_wchan() always performs an initial unwind step, which
will stkip __switch_to(), but as this is now marked as a __sched
function, this no longer needs special handling and will be skipped in
the same way as other sched functions.
Make __get_wchan() use arch_stack_walk(). This simplifies __get_wchan(),
and in future will alow us to make unwind_frame() private to
stacktrace.c. At the same time, we can simplify the try_get_task_stack()
check and avoid the unnecessary `stack_page` variable.
The change to the skipping logic means we may terminate one frame
earlier than previously where there are an excessive number of sched
functions in the trace, but this isn't seen in practice, and wchan is
best-effort anyway, so this should not be a problem.
Other than the above, there should be no functional change as a result
of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
[Mark: rebase atop wchan changes, elaborate commit message, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-11-29 22:28:45 +08:00
|
|
|
|
|
|
|
return wchan_info.pc;
|
2012-03-05 19:49:28 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
unsigned long arch_align_stack(unsigned long sp)
|
|
|
|
{
|
|
|
|
if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)
|
2022-10-05 22:43:38 +08:00
|
|
|
sp -= prandom_u32_max(PAGE_SIZE);
|
2012-03-05 19:49:28 +08:00
|
|
|
return sp & ~0xf;
|
|
|
|
}
|
|
|
|
|
2021-07-30 19:24:38 +08:00
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
int compat_elf_check_arch(const struct elf32_hdr *hdr)
|
|
|
|
{
|
|
|
|
if (!system_supports_32bit_el0())
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if ((hdr)->e_machine != EM_ARM)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (!((hdr)->e_flags & EF_ARM_EABI_MASK))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prevent execve() of a 32-bit program from a deadline task
|
|
|
|
* if the restricted affinity mask would be inadmissible on an
|
|
|
|
* asymmetric system.
|
|
|
|
*/
|
|
|
|
return !static_branch_unlikely(&arm64_mismatched_32bit_el0) ||
|
|
|
|
!dl_task_check_affinity(current, system_32bit_el0_cpumask());
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2017-08-20 18:20:48 +08:00
|
|
|
/*
|
|
|
|
* Called from setup_new_exec() after (COMPAT_)SET_PERSONALITY.
|
|
|
|
*/
|
|
|
|
void arch_setup_new_exec(void)
|
|
|
|
{
|
2021-06-09 02:02:57 +08:00
|
|
|
unsigned long mmflags = 0;
|
|
|
|
|
|
|
|
if (is_compat_task()) {
|
|
|
|
mmflags = MMCF_AARCH32;
|
2021-07-30 19:24:38 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Restrict the CPU affinity mask for a 32-bit task so that
|
|
|
|
* it contains only 32-bit-capable CPUs.
|
|
|
|
*
|
|
|
|
* From the perspective of the task, this looks similar to
|
|
|
|
* what would happen if the 64-bit-only CPUs were hot-unplugged
|
|
|
|
* at the point of execve(), although we try a bit harder to
|
|
|
|
* honour the cpuset hierarchy.
|
|
|
|
*/
|
2021-06-09 02:02:57 +08:00
|
|
|
if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
|
2021-07-30 19:24:38 +08:00
|
|
|
force_compatible_cpus_allowed_ptr(current);
|
|
|
|
} else if (static_branch_unlikely(&arm64_mismatched_32bit_el0)) {
|
|
|
|
relax_compatible_cpus_allowed_ptr(current);
|
2021-06-09 02:02:57 +08:00
|
|
|
}
|
arm64: add basic pointer authentication support
This patch adds basic support for pointer authentication, allowing
userspace to make use of APIAKey, APIBKey, APDAKey, APDBKey, and
APGAKey. The kernel maintains key values for each process (shared by all
threads within), which are initialised to random values at exec() time.
The ID_AA64ISAR1_EL1.{APA,API,GPA,GPI} fields are exposed to userspace,
to describe that pointer authentication instructions are available and
that the kernel is managing the keys. Two new hwcaps are added for the
same reason: PACA (for address authentication) and PACG (for generic
authentication).
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com>
Tested-by: Adam Wallis <awallis@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ramana Radhakrishnan <ramana.radhakrishnan@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
[will: Fix sizeof() usage and unroll address key initialisation]
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-12-08 02:39:25 +08:00
|
|
|
|
2021-06-09 02:02:57 +08:00
|
|
|
current->mm->context.flags = mmflags;
|
2021-03-19 11:10:53 +08:00
|
|
|
ptrauth_thread_init_user();
|
|
|
|
mte_thread_init_user();
|
2021-12-21 07:41:14 +08:00
|
|
|
erratum_1418040_new_exec();
|
2020-09-28 21:03:00 +08:00
|
|
|
|
|
|
|
if (task_spec_ssb_noexec(current)) {
|
|
|
|
arch_prctl_spec_ctrl_set(current, PR_SPEC_STORE_BYPASS,
|
|
|
|
PR_SPEC_ENABLE);
|
|
|
|
}
|
2017-08-20 18:20:48 +08:00
|
|
|
}
|
2019-07-24 01:58:39 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_ARM64_TAGGED_ADDR_ABI
|
|
|
|
/*
|
|
|
|
* Control the relaxed ABI allowing tagged user addresses into the kernel.
|
|
|
|
*/
|
2019-08-15 23:44:01 +08:00
|
|
|
static unsigned int tagged_addr_disabled;
|
2019-07-24 01:58:39 +08:00
|
|
|
|
2020-07-03 21:25:50 +08:00
|
|
|
long set_tagged_addr_ctrl(struct task_struct *task, unsigned long arg)
|
2019-07-24 01:58:39 +08:00
|
|
|
{
|
2019-11-27 18:30:15 +08:00
|
|
|
unsigned long valid_mask = PR_TAGGED_ADDR_ENABLE;
|
2020-07-03 21:25:50 +08:00
|
|
|
struct thread_info *ti = task_thread_info(task);
|
2019-11-27 18:30:15 +08:00
|
|
|
|
2020-07-03 21:25:50 +08:00
|
|
|
if (is_compat_thread(ti))
|
2019-07-24 01:58:39 +08:00
|
|
|
return -EINVAL;
|
2019-11-27 18:30:15 +08:00
|
|
|
|
|
|
|
if (system_supports_mte())
|
2022-02-17 01:32:24 +08:00
|
|
|
valid_mask |= PR_MTE_TCF_SYNC | PR_MTE_TCF_ASYNC \
|
|
|
|
| PR_MTE_TAG_MASK;
|
2019-11-27 18:30:15 +08:00
|
|
|
|
|
|
|
if (arg & ~valid_mask)
|
2019-07-24 01:58:39 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2019-08-15 23:44:01 +08:00
|
|
|
/*
|
|
|
|
* Do not allow the enabling of the tagged address ABI if globally
|
|
|
|
* disabled via sysctl abi.tagged_addr_disabled.
|
|
|
|
*/
|
|
|
|
if (arg & PR_TAGGED_ADDR_ENABLE && tagged_addr_disabled)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2020-07-03 21:25:50 +08:00
|
|
|
if (set_mte_ctrl(task, arg) != 0)
|
2019-11-27 18:30:15 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2020-07-03 21:25:50 +08:00
|
|
|
update_ti_thread_flag(ti, TIF_TAGGED_ADDR, arg & PR_TAGGED_ADDR_ENABLE);
|
2019-07-24 01:58:39 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-07-03 21:25:50 +08:00
|
|
|
long get_tagged_addr_ctrl(struct task_struct *task)
|
2019-07-24 01:58:39 +08:00
|
|
|
{
|
2019-11-27 18:30:15 +08:00
|
|
|
long ret = 0;
|
2020-07-03 21:25:50 +08:00
|
|
|
struct thread_info *ti = task_thread_info(task);
|
2019-11-27 18:30:15 +08:00
|
|
|
|
2020-07-03 21:25:50 +08:00
|
|
|
if (is_compat_thread(ti))
|
2019-07-24 01:58:39 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2020-07-03 21:25:50 +08:00
|
|
|
if (test_ti_thread_flag(ti, TIF_TAGGED_ADDR))
|
2019-11-27 18:30:15 +08:00
|
|
|
ret = PR_TAGGED_ADDR_ENABLE;
|
2019-07-24 01:58:39 +08:00
|
|
|
|
2020-07-03 21:25:50 +08:00
|
|
|
ret |= get_mte_ctrl(task);
|
2019-11-27 18:30:15 +08:00
|
|
|
|
|
|
|
return ret;
|
2019-07-24 01:58:39 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Global sysctl to disable the tagged user addresses support. This control
|
|
|
|
* only prevents the tagged address ABI enabling via prctl() and does not
|
|
|
|
* disable it for tasks that already opted in to the relaxed ABI.
|
|
|
|
*/
|
|
|
|
|
|
|
|
static struct ctl_table tagged_addr_sysctl_table[] = {
|
|
|
|
{
|
2019-08-15 23:44:01 +08:00
|
|
|
.procname = "tagged_addr_disabled",
|
2019-07-24 01:58:39 +08:00
|
|
|
.mode = 0644,
|
2019-08-15 23:44:01 +08:00
|
|
|
.data = &tagged_addr_disabled,
|
2019-07-24 01:58:39 +08:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2020-01-24 23:51:27 +08:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
|
|
|
.extra2 = SYSCTL_ONE,
|
2019-07-24 01:58:39 +08:00
|
|
|
},
|
|
|
|
{ }
|
|
|
|
};
|
|
|
|
|
|
|
|
static int __init tagged_addr_init(void)
|
|
|
|
{
|
|
|
|
if (!register_sysctl("abi", tagged_addr_sysctl_table))
|
|
|
|
return -EINVAL;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
core_initcall(tagged_addr_init);
|
|
|
|
#endif /* CONFIG_ARM64_TAGGED_ADDR_ABI */
|
arm64: entry.S: Do not preempt from IRQ before all cpufeatures are enabled
Preempting from IRQ-return means that the task has its PSTATE saved
on the stack, which will get restored when the task is resumed and does
the actual IRQ return.
However, enabling some CPU features requires modifying the PSTATE. This
means that, if a task was scheduled out during an IRQ-return before all
CPU features are enabled, the task might restore a PSTATE that does not
include the feature enablement changes once scheduled back in.
* Task 1:
PAN == 0 ---| |---------------
| |<- return from IRQ, PSTATE.PAN = 0
| <- IRQ |
+--------+ <- preempt() +--
^
|
reschedule Task 1, PSTATE.PAN == 1
* Init:
--------------------+------------------------
^
|
enable_cpu_features
set PSTATE.PAN on all CPUs
Worse than this, since PSTATE is untouched when task switching is done,
a task missing the new bits in PSTATE might affect another task, if both
do direct calls to schedule() (outside of IRQ/exception contexts).
Fix this by preventing preemption on IRQ-return until features are
enabled on all CPUs.
This way the only PSTATE values that are saved on the stack are from
synchronous exceptions. These are expected to be fatal this early, the
exception is BRK for WARN_ON(), but as this uses do_debug_exception()
which keeps IRQs masked, it shouldn't call schedule().
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
[james: Replaced a really cool hack, with an even simpler static key in C.
expanded commit message with Julien's cover-letter ascii art]
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2019-10-16 01:25:44 +08:00
|
|
|
|
2020-03-17 00:50:47 +08:00
|
|
|
#ifdef CONFIG_BINFMT_ELF
|
|
|
|
int arch_elf_adjust_prot(int prot, const struct arch_elf_state *state,
|
|
|
|
bool has_interp, bool is_interp)
|
|
|
|
{
|
2020-03-24 01:01:19 +08:00
|
|
|
/*
|
|
|
|
* For dynamically linked executables the interpreter is
|
|
|
|
* responsible for setting PROT_BTI on everything except
|
|
|
|
* itself.
|
|
|
|
*/
|
2020-03-17 00:50:47 +08:00
|
|
|
if (is_interp != has_interp)
|
|
|
|
return prot;
|
|
|
|
|
|
|
|
if (!(state->flags & ARM64_ELF_BTI))
|
|
|
|
return prot;
|
|
|
|
|
|
|
|
if (prot & PROT_EXEC)
|
|
|
|
prot |= PROT_BTI;
|
|
|
|
|
|
|
|
return prot;
|
|
|
|
}
|
|
|
|
#endif
|