Commit f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB") increased
the task size on book3s, and introduced a mechanism to dynamically
control whether a task uses these larger addresses. While the change to
the task size itself was ifdef-protected to only apply on book3s, the
change to STACK_TOP_USER64 was not. On book3e, this had the effect of
trying to use addresses up to 128TiB for the stack despite a 64TiB task
size limit -- which broke 64-bit userspace producing the following errors:
Starting init: /sbin/init exists but couldn't execute it (error -14)
Starting init: /bin/sh exists but couldn't execute it (error -14)
Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Scott Wood <oss@buserror.net>
This patch allows the use of IRQ to notify the change of GPIO status
on MPC8xx CPM IO ports. This then allows to associate IRQs to GPIOs
in the Device Tree.
Ex:
CPM1_PIO_C: gpio-controller@960 {
#gpio-cells = <2>;
compatible = "fsl,cpm1-pario-bank-c";
reg = <0x960 0x10>;
fsl,cpm1-gpio-irq-mask = <0x0fff>;
interrupts = <1 2 6 9 10 11 14 15 23 24 26 31>;
interrupt-parent = <&CPM_PIC>;
gpio-controller;
};
The property 'fsl,cpm1-gpio-irq-mask' defines which of the 16 GPIOs
have the associated interrupts defined in the 'interrupts' property.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
Work for Congestion State Notifications (CSCN) and Message Ring (MR)
handling is handled via the workqueue mechanism. This requires the
driver to disable those IRQs before scheduling the work and re-enabling
it once the work is completed so that the interrupt doesn't continually
fire.
Signed-off-by: Roy Pledge <roy.pledge@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
This allows to build the fsl_ucc_hdlc driver as a module.
Signed-off-by: Valentin Longchamp <valentin.longchamp@keymile.com>
Signed-off-by: Scott Wood <oss@buserror.net>
The QE_General4 workaround is only valid for the MPC832x and MPC836x
SoCs. The other SoCs that embed a QUICC engine are not affected by this
hardware bug and thus can use the computed divisors (this was
successfully tested on the T1040).
Similalry to what was done in commit 8ce795cb0c ("i2c: mpc: assign the
correct prescaler from SVR") in order to avoid changes in
the device tree nodes of the QE (with maybe a variant of the compatible
property), the PVR reg is read out to find out if the workaround must be
applied or not.
Signed-off-by: Valentin Longchamp <valentin.longchamp@keymile.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Because of integer computation rounding in u-boot (that sets the QE
brg-frequency DTS prop), the clk value is 99999999 Hz even though it is
100 MHz.
When setting brg clks that are exact divisors of 100 MHz, this small
differnce plays a role and can result in lower clks to be output (for
instance 20 MHz - divide by 5 - results in 16.666 MHz - divide by 6).
This patch fixes that by "forcing" the brg_clk to the nearest kHz when
the difference is below 2 integer rounding errors (i.e. 4).
Signed-off-by: Valentin Longchamp <valentin.longchamp@keymile.com>
Signed-off-by: Scott Wood <oss@buserror.net>
immrbar_virt_to_phys() is not used anymore
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Li Yang <pku.leo@gmail.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Since commit 5093bb965a ("powerpc/QE: switch to the cpm_muram
implementation"), muram area is not part of immrbar mapping anymore
so immrbar_virt_to_phys() is not usable anymore.
Fixes: 5093bb965a ("powerpc/QE: switch to the cpm_muram implementation")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Li Yang <pku.leo@gmail.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Debug interrupts can be taken during interrupt entry, since interrupt
entry does not automatically turn them off. The kernel will check
whether the faulting instruction is between [interrupt_base_book3e,
__end_interrupts], and if so clear MSR[DE] and return.
However, when the kernel is built with CONFIG_RELOCATABLE, it can't use
LOAD_REG_IMMEDIATE(r14,interrupt_base_book3e) and
LOAD_REG_IMMEDIATE(r15,__end_interrupts), as they ignore relocation.
Thus, if the kernel is actually running at a different address than it
was built at, the address comparison will fail, and the exception entry
code will hang at kernel_dbg_exc.
r2(toc) is also not usable here, as r2 still holds data from the
interrupted context, so LOAD_REG_ADDR() doesn't work either. So we use
the *name@got* to get the EV of two labels directly.
Test programs test.c shows as follows:
int main(int argc, char *argv[])
{
if (access("/proc/sys/kernel/perf_event_paranoid", F_OK) == -1)
printf("Kernel doesn't have perf_event support\n");
}
Steps to reproduce the bug, for example:
1) ./gdb ./test
2) (gdb) b access
3) (gdb) r
4) (gdb) s
Signed-off-by: Liu Hailong <liu.hailong6@zte.com.cn>
Signed-off-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Reviewed-by: Liu Song <liu.song11@zte.com.cn>
Reviewed-by: Huang Jian <huang.jian@zte.com.cn>
[scottwood: cleaned up commit message, and specified bad behavior
as a hang rather than an oops to correspond to mainline kernel behavior]
Fixes: 1cb6e06492 ("powerpc/book3e: support CONFIG_RELOCATABLE")
Cc: <stable@vger.kernel.org> # 4.4.x-
Signed-off-by: Scott Wood <oss@buserror.net>
Split ftrace_64.S further retaining the core ftrace 64-bit aspects
in ftrace_64.S and moving ftrace_caller() and ftrace_graph_caller() into
separate files based on -mprofile-kernel. The livepatch routines are all
now contained within the mprofile file.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
entry_*.S now includes a lot more than just kernel entry/exit code. As a
first step at cleaning this up, let's split out the ftrace bits into
separate files. Also move all related tracing code into a new trace/
subdirectory.
No functional changes.
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Page table dump debugfs file is named 'kernel_page_tables' on
all other architectures implementing it, while is is named
'kernel_pagetables' on powerpc. This patch renames it.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On some targets, _PAGE_RW is 0 and this is _PAGE_RO which is used.
There is also _PAGE_SHARED that is missing.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On PPC32 (eg. mpc885_ads_defconfig), page table dump compilation fails as
follows. This is because the memory layout is slightly different on PPC32. This
patch adapts it.
arch/powerpc/mm/dump_linuxpagetables.c: In function 'walk_pagetables':
arch/powerpc/mm/dump_linuxpagetables.c:369:10: error: 'KERN_VIRT_START' undeclared (first use in this function)
...
Fixes: 8eb07b1870 ("powerpc/mm: Dump linux pagetables")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
_tlbiel_pid() is called with a ric (Radix Invalidation Control) argument of
either RIC_FLUSH_TLB or RIC_FLUSH_ALL.
RIC_FLUSH_ALL says to invalidate the entire TLB and the Page Walk Cache (PWC).
To flush the whole TLB, we have to iterate over each set (congruence class) of
the TLB. Currently we do that and pass RIC_FLUSH_ALL each time. That is not
incorrect but it means we flush the PWC 128 times, when once would suffice.
Fix it by doing the first flush with the ric value we're passed, and then if it
was RIC_FLUSH_ALL, we downgrade it to RIC_FLUSH_TLB, because we know we have
just flushed the PWC and don't need to do it again.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Split out of combined patch, tweak logic, rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently we implement flushing of the page walk cache (PWC) by calling
_tlbiel_pid() with a RIC (Radix Invalidation Control) value of 1 which says to
only flush the PWC.
But _tlbiel_pid() loops over each set (congruence class) of the TLB, which is
not necessary when we're just flushing the PWC.
In fact the set argument is ignored for a PWC flush, so essentially we're just
flushing the PWC 127 extra times for no benefit.
Fix it by adding tlbiel_pwc() which just does a single flush of the PWC.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Split out of combined patch, drop _ in name, rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Recently we merged the native xive support for Power9, and then separately some
reworks for doorbell IPI support. In isolation both series were OK, but the
merged result had a bug in one case.
On P9 DD1 we use pnv_p9_dd1_cause_ipi() which tries to use doorbells, and then
falls back to the interrupt controller. However the fallback is implemented by
calling icp_ops->cause_ipi. But now that xive support is merged we might be
using xive, in which case icp_ops is not initialised, it's a xics specific
structure. This leads to an oops such as:
Unable to handle kernel paging request for data at address 0x00000028
Oops: Kernel access of bad area, sig: 11 [#1]
NIP pnv_p9_dd1_cause_ipi+0x74/0xe0
LR smp_muxed_ipi_message_pass+0x54/0x70
To fix it, rather than using icp_ops which might be NULL, have both xics and
xive set smp_ops->cause_ipi, and then in the powernv code we save that as
ic_cause_ipi before overriding smp_ops->cause_ipi. For paranoia add a WARN_ON()
to check if somehow smp_ops->cause_ipi is NULL.
Fixes: b866cc2199 ("powerpc: Change the doorbell IPI calling convention")
Tested-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In opal_export_attrs() we dynamically allocate some bin_attributes. They're
allocated with kmalloc() and although we initialise most of the fields, we don't
initialise write() or mmap(), and in particular we don't initialise the lockdep
related fields in the embedded struct attribute.
This leads to a lockdep warning at boot:
BUG: key c0000000f11906d8 not in .data!
WARNING: CPU: 0 PID: 1 at ../kernel/locking/lockdep.c:3136 lockdep_init_map+0x28c/0x2a0
...
Call Trace:
lockdep_init_map+0x288/0x2a0 (unreliable)
__kernfs_create_file+0x8c/0x170
sysfs_add_file_mode_ns+0xc8/0x240
__machine_initcall_powernv_opal_init+0x60c/0x684
do_one_initcall+0x60/0x1c0
kernel_init_freeable+0x2f4/0x3d4
kernel_init+0x24/0x160
ret_from_kernel_thread+0x5c/0xb0
Fix it by kzalloc'ing the attr, which fixes the uninitialised write() and
mmap(), and calling sysfs_bin_attr_init() on it to initialise the lockdep
fields.
Fixes: 11fe909d23 ("powerpc/powernv: Add OPAL exports attributes to sysfs")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The recent patch to add runtime configuration of the ASLR limits added a bug in
arch_mmap_rnd() where we may shift an integer (32-bits) by up to 33 bits,
leading to undefined behaviour.
In practice it exhibits as every process seg faulting instantly, presumably
because the rnd value hasn't been restricited by the modulus at all. We didn't
notice because it only happens under certain kernel configurations and if the
number of bits is actually set to a large value.
Fix it by switching to unsigned long.
Fixes: 9fea59bd7c ("powerpc/mm: Add support for runtime configuration of ASLR limits")
Reported-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc expects IRQs to already be (soft) disabled when switch_mm() is
called, as made clear in the commit message of 9c1e105238 ("powerpc: Allow
perf_counters to access user memory at interrupt time").
Aside from any race conditions that might exist between switch_mm() and an IRQ,
there is also an unconditional hard_irq_disable() in switch_slb(). If that isn't
followed at some point by an IRQ enable then interrupts will remain disabled
until we return to userspace.
It is true that when switch_mm() is called from the scheduler IRQs are off, but
not when it's called by use_mm(). Looking closer we see that last year in commit
f98db6013c ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
this was made more explicit by the addition of switch_mm_irqs_off() which is now
called by the scheduler, vs switch_mm() which is used by use_mm().
Arguably it is a bug in use_mm() to call switch_mm() in a different context than
it expects, but fixing that will take time.
This was discovered recently when vhost started throwing warnings such as:
BUG: sleeping function called from invalid context at kernel/mutex.c:578
in_atomic(): 0, irqs_disabled(): 1, pid: 10768, name: vhost-10760
no locks held by vhost-10760/10768.
irq event stamp: 10
hardirqs last enabled at (9): _raw_spin_unlock_irq+0x40/0x80
hardirqs last disabled at (10): switch_slb+0x2e4/0x490
softirqs last enabled at (0): copy_process+0x5e8/0x1260
softirqs last disabled at (0): (null)
Call Trace:
show_stack+0x88/0x390 (unreliable)
dump_stack+0x30/0x44
__might_sleep+0x1c4/0x2d0
mutex_lock_nested+0x74/0x5c0
cgroup_attach_task_all+0x5c/0x180
vhost_attach_cgroups_work+0x58/0x80 [vhost]
vhost_worker+0x24c/0x3d0 [vhost]
kthread+0xec/0x100
ret_from_kernel_thread+0x5c/0xd4
Prior to commit 04b96e5528 ("vhost: lockless enqueuing") (Aug 2016) the
vhost_worker() would do a spin_unlock_irq() not long after calling use_mm(),
which had the effect of reenabling IRQs. Since that commit removed the locking
in vhost_worker() the body of the vhost_worker() loop now runs with interrupts
off causing the warnings.
This patch addresses the problem by making the powerpc code mirror the x86 code,
ie. we disable interrupts in switch_mm(), and optimise the scheduler case by
defining switch_mm_irqs_off().
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Flesh out/rewrite change log, add stable]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For CPUs present at boot each logical CPU acquires a reference to the
associated device node of the core. This happens in register_cpu() which
is called by topology_init(). The result of this is that we end up with
a reference held by each thread of the core. However, these references
are never freed if the CPU core is DLPAR removed.
This patch fixes the reference leaks by acquiring and releasing the references
in the CPU hotplug callbacks un/register_cpu_online(). With this patch symmetric
reference counting is observed with both CPUs present at boot, and those DLPAR
added after boot.
Fixes: f86e4718f2 ("driver/core: cpu: initialize of_node in cpu's device struture")
Cc: stable@vger.kernel.org # v3.12+
Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Historically struct device_node references were tracked using a kref embedded as
a struct field. Commit 75b57ecf9d ("of: Make device nodes kobjects so they
show up in sysfs") (Mar 2014) refactored device_nodes to be kobjects such that
the device tree could by more simply exposed to userspace using sysfs.
Commit 0829f6d1f6 ("of: device_node kobject lifecycle fixes") (Mar 2014)
followed up these changes to better control the kobject lifecycle and in
particular the referecne counting via of_node_get(), of_node_put(), and
of_node_init().
A result of this second commit was that it introduced an of_node_put() call when
a dynamic node is detached, in of_node_remove(), that removes the initial kobj
reference created by of_node_init().
Traditionally as the original dynamic device node user the pseries code had
assumed responsibilty for releasing this final reference in its platform
specific DLPAR detach code.
This patch fixes a refcount underflow introduced by commit 0829f6d1f6, and
recently exposed by the upstreaming of the recount API.
Messages like the following are no longer seen in the kernel log with this
patch following DLPAR remove operations of cpus and pci devices.
rpadlpar_io: slot PHB 72 removed
refcount_t: underflow; use-after-free.
------------[ cut here ]------------
WARNING: CPU: 5 PID: 3335 at lib/refcount.c:128 refcount_sub_and_test+0xf4/0x110
Fixes: 0829f6d1f6 ("of: device_node kobject lifecycle fixes")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
[mpe: Make change log commit references more verbose]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently the code that dumps SLB entries uses a double-nested if. This
means the actual dumping logic is a bit squashed. Deindent it by using
continue.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Although most of these kprobes patches are powerpc specific, there's a couple
that touch generic code (with Acks). At the moment there's one conflict with
acme's tree, but it's not too bad. Still just in case some other conflicts show
up, we've put these in a topic branch so another tree could merge some or all of
it if necessary.
KPROBES_ON_FTRACE avoids much of the overhead of regular kprobes as it
eliminates the need for a trap, as well as the need to emulate or single-step
instructions.
Though OPTPROBES provides us with similar performance, we have limited
optprobes trampoline slots. As such, when asked to probe at a function
entry, default to using the ftrace infrastructure.
With:
# cd /sys/kernel/debug/tracing
# echo 'p _do_fork' > kprobe_events
before patch:
# cat ../kprobes/list
c0000000000daf08 k _do_fork+0x8 [DISABLED]
c000000000044fc0 k kretprobe_trampoline+0x0 [OPTIMIZED]
and after patch:
# cat ../kprobes/list
c0000000000d074c k _do_fork+0xc [DISABLED][FTRACE]
c0000000000412b0 k kretprobe_trampoline+0x0 [OPTIMIZED]
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
kprobe_lookup_name() is specific to the kprobe subsystem and may not always
return the function entry point (in a subsequent patch for KPROBES_ON_FTRACE).
For looking up function entry points, introduce a separate helper and use it
in optprobes.c
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Allow kprobes to be placed on ftrace _mcount() call sites. This optimization
avoids the use of a trap, by riding on ftrace infrastructure.
This depends on HAVE_DYNAMIC_FTRACE_WITH_REGS which depends on MPROFILE_KERNEL,
which is only currently enabled on powerpc64le with newer toolchains.
Based on the x86 code by Masami.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pass the real LR to the ftrace handler. This is needed for KPROBES_ON_FTRACE for
the pre handlers.
Also, with KPROBES_ON_FTRACE, the link register may be updated by the pre
handlers or by a registed kretprobe. Honor updated LR by restoring it from
pt_regs, rather than from the stack save area.
Live patch and function graph continue to work fine with this change.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Blacklist all the exception common/OOL handlers as the kernel stack is not yet
setup, which means we can't take a trap at this point.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Introduce __head_end to mark end of the early fixed sections and use it to
blacklist all exception handlers from kprobes.
mpe: We do not need to do anything special for relocatable kernels, where the
exception vectors are split from the main kernel, as the split vectors are
already excluded by the check for kernel_text_address().
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
[mpe: Move __head_end outside #ifdef 64-bit to unbreak the 32-bit build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Along similar lines as commit 9326638cbe ("kprobes, x86: Use NOKPROBE_SYMBOL()
instead of __kprobes annotation"), convert __kprobes annotation to either
NOKPROBE_SYMBOL() or nokprobe_inline. The latter forces inlining, in which case
the caller needs to be added to NOKPROBE_SYMBOL().
Also:
- blacklist arch_deref_entry_point(), and
- convert a few regular inlines to nokprobe_inline in lib/sstep.c
A key benefit is the ability to detect such symbols as being
blacklisted. Before this patch:
$ cat /sys/kernel/debug/kprobes/blacklist | grep read_mem
$ perf probe read_mem
Failed to write event: Invalid argument
Error: Failed to add events.
$ dmesg | tail -1
[ 3736.112815] Could not insert probe at _text+10014968: -22
After patch:
$ cat /sys/kernel/debug/kprobes/blacklist | grep read_mem
0xc000000000072b50-0xc000000000072d20 read_mem
$ perf probe read_mem
read_mem is blacklisted function, skip it.
Added new events:
(null):(null) (on read_mem)
probe:read_mem (on read_mem)
You can now use it in all perf tools, such as:
perf record -e probe:read_mem -aR sleep 1
$ grep " read_mem" /proc/kallsyms
c000000000072b50 t read_mem
c0000000005f3b40 t read_mem
$ cat /sys/kernel/debug/kprobes/list
c0000000005f3b48 k read_mem+0x8 [DISABLED]
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
[mpe: Minor change log formatting, fix up some conflicts]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Move the stack setup and teardown code into ftrace_graph_caller(). This way, we
don't incur the cost of setting it up unless function graph is enabled for this
function.
Also, remove the extraneous LR restore code after the function graph stub. LR
has previously been restored and neither livepatch_handler() nor
ftrace_graph_caller() return back here.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
[mpe: Drop bad change to non-mprofile-kernel version of ftrace_graph_caller]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The idle workaround does not need to load PACATOC, and it does not
need to be called within a nested function that requires LR to be
saved.
Load the PACATOC at entry to the idle wakeup. It does not matter which
PACA this comes from, so it's okay to call before the workaround. Then
apply the workaround to get the right PACA.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If not all threads were in winkle, full state loss recovery is not
necessary and can be avoided. A previous patch removed this optimisation
due to some complexity with the implementation. Re-implement it by
counting the number of threads in winkle with the per-core idle state.
Only restore full state loss if all threads were in winkle.
This has a small window of false positives right before threads execute
winkle and just after they wake up, when the winkle count does not
reflect the true number of threads in winkle. This is not a significant
problem in comparison with even the minimum winkle duration. For
correctness, a false positive is not a problem (only false negatives
would be).
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When taking the core idle state lock, grab it immediately like a regular
lock, rather than adding more tests in there. Holding the lock keeps it
stable, so there is no need to do it whole holding the reservation.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In preparation for adding more bits to the core idle state word, move
the lock bit up, and unlock by flipping the lock bit rather than masking
off all but the thread bits.
Add branch hints for atomic operations while we're here.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The ISA specifies power save wakeup due to a machine check exception can
cause a machine check interrupt (rather than the usual system reset
interrupt).
The machine check handler copes with this by doing low level machine
check recovery without restoring full state from idle, then queues up a
machine check event for logging, then directly executes the same idle
instruction it woke from. This minimises the work done before recovery
is performed.
The problem is that it requires machine specific instructions and
knowledge of the book3s idle code. Currently it only has code to handle
POWER8 idle, so POWER9 crashes when trying to execute the P8 idle
instructions which don't exist in ISAv3.0B.
cpu 0x0: Vector: e40 (Emulation Assist) at [c0000000008f3810]
pc: c000000000008380: machine_check_handle_early+0x130/0x2f0
lr: c00000000053a098: stop_loop+0x68/0xd0
sp: c0000000008f3a90
msr: 9000000000081001
current = 0xc0000000008a1080
paca = 0xc00000000ffd0000 softe: 0 irq_happened: 0x01
pid = 0, comm = swapper/0
Instead of going to sleep after recovery, do the usual idle wakeup and
state restoration by calling into the normal idle wakeup path. This
reuses the normal idle wakeup paths.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This reduces the number of nops for POWER8.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The POWER8 idle code has a neat trick of programming the power on engine
to restore a low bit into HSPRG0, so idle wakeup code can test and see
if it has been programmed this way and therefore lost all state. Restore
time can be reduced if winkle has not been reached.
However this messes with our r13 PACA pointer, and requires HSPRG0 to be
written to. It also optimizes the slowest and most uncommon case at the
expense of another SPR write in the common nap state wakeup.
Remove this complexity and assume winkle sleeps always require a state
restore. This speedup could be made entirely contained within the winkle
idle code by counting per-core winkles and setting a thread bitmap when
all have gone to winkle.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
No functional change.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The system reset idle handler system_reset_idle_common is relocated, so
relocation is not required to branch to kvm_start_guest. The superfluous
relocation does not result in incorrect code, but it does not compile
outside of exception-64s.S (with fixed section definitions).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add powerpc support for mmap_rnd_bits and mmap_rnd_compat_bits, which are two
sysctls that allow a user to configure the number of bits of randomness used for
ASLR.
Because of the way the Kconfig for ARCH_MMAP_RND_BITS is defined, we have to
construct at least the MIN value in Kconfig, vs in a header which would be more
natural. Given that we just go ahead and do it all in Kconfig.
At least according to the code (the documentation makes no mention of it), the
value is defined as the number of bits of randomisation *of the page*, not the
address. This makes some sense, with larger page sizes more of the low bits are
forced to zero, which would reduce the randomisation if we didn't take the
PAGE_SIZE into account. However it does mean the min/max values have to change
depending on the PAGE_SIZE in order to actually limit the amount of address
space consumed by the randomisation.
The result of that is that we have to define the default values based on both
32-bit vs 64-bit, but also the configured PAGE_SIZE. Furthermore now that we
have 128TB address space support on Book3S, we also have to take that into
account.
Finally we can wire up the value in arch_mmap_rnd().
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
Tested-by: Bhupesh Sharma <bhsharma@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
The default implementation of ioremap_cache() is aliased to ioremap().
On powerpc ioremap() creates cache-inhibited mappings by default which
is almost certainly not what you wanted.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On kprobe handler re-entry, try to emulate the instruction rather than single
stepping always.
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Factor out code to emulate instruction into a try_to_emulate()
helper function. This makes no functional changes.
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With ABIv2, we offset 8 bytes into a function to get at the local entry
point.
mpe: NB this function is currently not called, the change to generic code to
call it is being merged via the tip tree.
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
commit 239aeba764 ("perf powerpc: Fix kprobe and kretprobe handling with
kallsyms on ppc64le") changed how we use the offset field in struct kprobe on
ABIv2. perf now offsets from the global entry point if an offset is specified
and otherwise chooses the local entry point.
Fix the same in kernel for kprobe API users. We do this by extending
kprobe_lookup_name() to accept an additional parameter to indicate the offset
specified with the kprobe registration. If offset is 0, we return the local
function entry and return the global entry point otherwise.
With:
# cd /sys/kernel/debug/tracing/
# echo "p _do_fork" >> kprobe_events
# echo "p _do_fork+0x10" >> kprobe_events
before this patch:
# cat ../kprobes/list
c0000000000d0748 k _do_fork+0x8 [DISABLED]
c0000000000d0758 k _do_fork+0x18 [DISABLED]
c0000000000412b0 k kretprobe_trampoline+0x0 [OPTIMIZED]
and after:
# cat ../kprobes/list
c0000000000d04c8 k _do_fork+0x8 [DISABLED]
c0000000000d04d0 k _do_fork+0x10 [DISABLED]
c0000000000412b0 k kretprobe_trampoline+0x0 [OPTIMIZED]
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The macro is now pretty long and ugly on powerpc. In the light of further
changes needed here, convert it to a __weak variant to be over-ridden with a
nicer looking function.
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>