enable_kernel_vsx() function was commented since anything was using
it. However, vmx-crypto driver uses VSX instructions which are
only available if VSX is enable. Otherwise it rises an exception oops.
This patch uncomment enable_kernel_vsx() routine and makes it available.
Signed-off-by: Leonidas S. Barbosa <leosilva@linux.vnet.ibm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The process context switch code no longer uses dscr_default variable
from the sysfs.c file. The variable became unused when we started
storing the CPU specific DSCR value in the PACA structure instead.
This patch just removes this extern declaration. It was originally
added by the following commit.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The 'arg' argument to copy_thread() is only ever used when forking a new
kernel thread. Hence, rename it to 'kthread_arg' for clarity.
Signed-off-by: Alex Dowad <alexinbeijing@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Back in 2009 we merged 501cb16d3c "Randomise PIEs", which added support for
randomizing PIE (Position Independent Executable) binaries.
That commit added randomize_et_dyn(), which correctly randomized the addresses,
but failed to honor PF_RANDOMIZE. That means it was not possible to disable PIE
randomization via the personality flag, or /proc/sys/kernel/randomize_va_space.
Since then there has been generic support for PIE randomization added to
binfmt_elf.c, selectable via ARCH_BINFMT_ELF_RANDOMIZE_PIE.
Enabling that allows us to drop randomize_et_dyn(), which means we start
honoring PF_RANDOMIZE correctly.
It also causes a fairly major change to how we layout PIE binaries.
Currently we will place the binary at 512MB-520MB for 32 bit binaries, or
512MB-1.5GB for 64 bit binaries, eg:
$ cat /proc/$$/maps
4e550000-4e580000 r-xp 00000000 08:02 129813 /bin/dash
4e580000-4e590000 rw-p 00020000 08:02 129813 /bin/dash
10014110000-10014140000 rw-p 00000000 00:00 0 [heap]
3fffaa3f0000-3fffaa5a0000 r-xp 00000000 08:02 921 /lib/powerpc64le-linux-gnu/libc-2.19.so
3fffaa5a0000-3fffaa5b0000 rw-p 001a0000 08:02 921 /lib/powerpc64le-linux-gnu/libc-2.19.so
3fffaa5c0000-3fffaa5d0000 rw-p 00000000 00:00 0
3fffaa5d0000-3fffaa5f0000 r-xp 00000000 00:00 0 [vdso]
3fffaa5f0000-3fffaa620000 r-xp 00000000 08:02 1246 /lib/powerpc64le-linux-gnu/ld-2.19.so
3fffaa620000-3fffaa630000 rw-p 00020000 08:02 1246 /lib/powerpc64le-linux-gnu/ld-2.19.so
3ffffc340000-3ffffc370000 rw-p 00000000 00:00 0 [stack]
With this commit applied we don't do any special randomisation for the binary,
and instead rely on mmap randomisation. This means the binary ends up at high
addresses, eg:
$ cat /proc/$$/maps
3fff99820000-3fff999d0000 r-xp 00000000 08:02 921 /lib/powerpc64le-linux-gnu/libc-2.19.so
3fff999d0000-3fff999e0000 rw-p 001a0000 08:02 921 /lib/powerpc64le-linux-gnu/libc-2.19.so
3fff999f0000-3fff99a00000 rw-p 00000000 00:00 0
3fff99a00000-3fff99a20000 r-xp 00000000 00:00 0 [vdso]
3fff99a20000-3fff99a50000 r-xp 00000000 08:02 1246 /lib/powerpc64le-linux-gnu/ld-2.19.so
3fff99a50000-3fff99a60000 rw-p 00020000 08:02 1246 /lib/powerpc64le-linux-gnu/ld-2.19.so
3fff99a60000-3fff99a90000 r-xp 00000000 08:02 129813 /bin/dash
3fff99a90000-3fff99aa0000 rw-p 00020000 08:02 129813 /bin/dash
3fffc3de0000-3fffc3e10000 rw-p 00000000 00:00 0 [stack]
3fffc55e0000-3fffc5610000 rw-p 00000000 00:00 0 [heap]
Although this should be OK, it's possible it might break badly written
binaries that make assumptions about the address space layout.
Signed-off-by: Vineeth Vijayan <vvijayan@mvista.com>
[mpe: Rewrite changelog]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
mod_return_to_handler is the same as return_to_handler, except
it handles the change of the TOC (r2). Add this into
return_to_handler and remove mod_return_to_handler.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We really don't want to take a pagefault in show_instructions,
so use probe_kernel_address instead of __get_user.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This still has not been merged and now powerpc is the only arch that does
not have this change. Sorry about missing linuxppc-dev before.
V2->V2
- Fix up to work against 3.18-rc1
__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x). This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.
Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.
__get_cpu_var() is defined as :
__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.
this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.
This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less registers
are used when code is generated.
At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.
The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e. using a global
register that may be set to the per cpu base.
Transformations done to __get_cpu_var()
1. Determine the address of the percpu instance of the current processor.
DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(&y);
2. Same as #1 but this time an array structure is involved.
DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(y);
3. Retrieve the content of the current processors instance of a per cpu
variable.
DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)
Converts to
int x = __this_cpu_read(y);
4. Retrieve the content of a percpu struct
DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);
Converts to
memcpy(&x, this_cpu_ptr(&y), sizeof(x));
5. Assignment to a per cpu variable
DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;
Converts to
__this_cpu_write(y, x);
6. Increment/Decrement etc of a per cpu variable
DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++
Converts to
__this_cpu_inc(y)
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
[mpe: Fix build errors caused by set/or_softirq_pending(), and rework
assignment in __set_breakpoint() to use memcpy().]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael points out that __get_SP() is a pretty horrible
function name. Let's give it a better name.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Li Zhong points out an issue with our current __get_SP()
implementation. If ftrace function tracing is enabled (ie -pg
profiling using _mcount) we spill a stack frame on 64bit all the
time.
If a function calls __get_SP() and later calls a function that is
tail call optimised, we will pop the stack frame and the value
returned by __get_SP() is no longer valid. An example from Li can
be found in save_stack_trace -> save_context_stack:
c0000000000432c0 <.save_stack_trace>:
c0000000000432c0: mflr r0
c0000000000432c4: std r0,16(r1)
c0000000000432c8: stdu r1,-128(r1) <-- stack frame for _mcount
c0000000000432cc: std r3,112(r1)
c0000000000432d0: bl <._mcount>
c0000000000432d4: nop
c0000000000432d8: mr r4,r1 <-- __get_SP()
c0000000000432dc: ld r5,632(r13)
c0000000000432e0: ld r3,112(r1)
c0000000000432e4: li r6,1
c0000000000432e8: addi r1,r1,128 <-- pop stack frame
c0000000000432ec: ld r0,16(r1)
c0000000000432f0: mtlr r0
c0000000000432f4: b <.save_context_stack> <-- tail call optimized
save_context_stack ends up with a stack pointer below the current
one, and it is likely to be scribbled over.
Fix this by making __get_SP() a function which returns the
callers stack frame. Also replace inline assembly which grabs
the stack pointer in save_stack_trace and show_stack with
__get_SP().
This also fixes an issue with perf_arch_fetch_caller_regs().
It currently unwinds the stack once, which will skip a
valid stack frame on a leaf function. With the __get_SP() fixes
in this patch, we never need to unwind the stack frame to get
to the first interesting frame.
We have to export __get_SP() because perf_arch_fetch_caller_regs()
(which is used in modules) calls it from a header file.
Reported-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some people see things like "Exception: 501" in stack traces in dmesg
and assume that means that something has gone badly wrong, when in
fact "Exception: 501" just means a device interrupt was taken.
This changes "Exception" to "interrupt" to make it clearer that we
are just recording the fact of a change in control flow rather than
some error condition.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The previous patch left a bit of a wart in copy_process(). Clean it up a
bit by moving the logic out into a helper.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We now only support cpus that use an SLB, so we don't need an MMU
feature to indicate that.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Correct the DSCR SPR becoming temporarily corrupted if a task is
context switched during a transaction.
The problem occurs while suspending the task and is caused by saving
the DSCR to thread.dscr after it has already been set to the CPU's
default value:
__switch_to() calls __switch_to_tm()
which calls tm_reclaim_task()
which calls tm_reclaim_thread()
which calls tm_reclaim()
where the DSCR is set to the CPU's default
__switch_to() calls _switch()
where thread.dscr is set to the DSCR
When the task is resumed, it's transaction will be doomed (as usual)
and the DSCR SPR will be corrupted, although the checkpointed value
will be correct. Therefore the DSCR will be immediately corrected by
the transaction aborting, unless it has been suspended. In that case
the incorrect value can be seen by the task until it resumes the
transaction.
The fix is to treat the DSCR similarly to the TAR and save it early
in __switch_to().
A program exposing the problem is added to the kernel self tests as:
tools/testing/selftests/powerpc/tm/tm-resched-dscr.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
CC: <stable@vger.kernel.org> [v3.10+]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, on 8641D, which doesn't set CONFIG_HAVE_HW_BREAKPOINT
we get the following splat:
BUG: using smp_processor_id() in preemptible [00000000] code: login/1382
caller is set_breakpoint+0x1c/0xa0
CPU: 0 PID: 1382 Comm: login Not tainted 3.15.0-rc3-00041-g2aafe1a4d451 #1
Call Trace:
[decd5d80] [c0008dc4] show_stack+0x50/0x158 (unreliable)
[decd5dc0] [c03c6fa0] dump_stack+0x7c/0xdc
[decd5de0] [c01f8818] check_preemption_disabled+0xf4/0x104
[decd5e00] [c00086b8] set_breakpoint+0x1c/0xa0
[decd5e10] [c00d4530] flush_old_exec+0x2bc/0x588
[decd5e40] [c011c468] load_elf_binary+0x2ac/0x1164
[decd5ec0] [c00d35f8] search_binary_handler+0xc4/0x1f8
[decd5ef0] [c00d4ee8] do_execve+0x3d8/0x4b8
[decd5f40] [c001185c] ret_from_syscall+0x0/0x38
--- Exception: c01 at 0xfeee554
LR = 0xfeee7d4
The call path in this case is:
flush_thread
--> set_debug_reg_defaults
--> set_breakpoint
--> __get_cpu_var
Since preemption is enabled in the cleanup of flush thread, and
there is no need to disable it, introduce the distinction between
set_breakpoint and __set_breakpoint, leaving only the flush_thread
instance as the current user of set_breakpoint.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
None of the callers check the return value, so it might as
well not have one at all.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Change how we setup registers for ret_from_kernel_thread. In
ABIv1, instead of passing a function descriptor in, dereference
it and pass the target in directly.
Use ppc_global_function_entry to get it right on both ABIv1 and ABIv2.
Signed-off-by: Anton Blanchard <anton@samba.org>
We can't take an IRQ when we're about to do a trechkpt as our GPR state is set
to user GPR values.
We've hit this when running some IBM Java stress tests in the lab resulting in
the following dump:
cpu 0x3f: Vector: 700 (Program Check) at [c000000007eb3d40]
pc: c000000000050074: restore_gprs+0xc0/0x148
lr: 00000000b52a8184
sp: ac57d360
msr: 8000000100201030
current = 0xc00000002c500000
paca = 0xc000000007dbfc00 softe: 0 irq_happened: 0x00
pid = 34535, comm = Pooled Thread #
R00 = 00000000b52a8184 R16 = 00000000b3e48fda
R01 = 00000000ac57d360 R17 = 00000000ade79bd8
R02 = 00000000ac586930 R18 = 000000000fac9bcc
R03 = 00000000ade60000 R19 = 00000000ac57f930
R04 = 00000000f6624918 R20 = 00000000ade79be8
R05 = 00000000f663f238 R21 = 00000000ac218a54
R06 = 0000000000000002 R22 = 000000000f956280
R07 = 0000000000000008 R23 = 000000000000007e
R08 = 000000000000000a R24 = 000000000000000c
R09 = 00000000b6e69160 R25 = 00000000b424cf00
R10 = 0000000000000181 R26 = 00000000f66256d4
R11 = 000000000f365ec0 R27 = 00000000b6fdcdd0
R12 = 00000000f66400f0 R28 = 0000000000000001
R13 = 00000000ada71900 R29 = 00000000ade5a300
R14 = 00000000ac2185a8 R30 = 00000000f663f238
R15 = 0000000000000004 R31 = 00000000f6624918
pc = c000000000050074 restore_gprs+0xc0/0x148
cfar= c00000000004fe28 dont_restore_vec+0x1c/0x1a4
lr = 00000000b52a8184
msr = 8000000100201030 cr = 24804888
ctr = 0000000000000000 xer = 0000000000000000 trap = 700
This moves tm_recheckpoint to a C function and moves the tm_restore_sprs into
that function. It then adds IRQ disabling over the trechkpt critical section.
It also sets the TEXASR FS in the signals code to ensure this is never set now
that we explictly write the TM sprs in tm_recheckpoint.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When we fork/clone we currently don't copy any of the TM state to the new
thread. This results in a TM bad thing (program check) when the new process is
switched in as the kernel does a tmrechkpt with TEXASR FS not set. Also, since
R1 is from userspace, we trigger the bad kernel stack pointer detection. So we
end up with something like this:
Bad kernel stack pointer 0 at c0000000000404fc
cpu 0x2: Vector: 700 (Program Check) at [c00000003ffefd40]
pc: c0000000000404fc: restore_gprs+0xc0/0x148
lr: 0000000000000000
sp: 0
msr: 9000000100201030
current = 0xc000001dd1417c30
paca = 0xc00000000fe00800 softe: 0 irq_happened: 0x01
pid = 0, comm = swapper/2
WARNING: exception is not recoverable, can't continue
The below fixes this by flushing the TM state before we copy the task_struct to
the clone. To do this we go through the tmreclaim patch, which removes the
checkpointed registers from the CPU and transitions the CPU out of TM suspend
mode. Hence we need to call tmrechkpt after to restore the checkpointed state
and the TM mode for the current task.
To make this fail from userspace is simply:
tbegin
li r0, 2
sc
<boom>
Kudos to Adhemerval Zanella Neto for finding this.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: Adhemerval Zanella Neto <azanella@br.ibm.com>
cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This fixes a logic error that caused a failure to update the hw breakpoint
registers when not using the hw-breakpoint interface.
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull powerpc updates from Ben Herrenschmidt:
"So here's my next branch for powerpc. A bit late as I was on vacation
last week. It's mostly the same stuff that was in next already, I
just added two patches today which are the wiring up of lockref for
powerpc, which for some reason fell through the cracks last time and
is trivial.
The highlights are, in addition to a bunch of bug fixes:
- Reworked Machine Check handling on kernels running without a
hypervisor (or acting as a hypervisor). Provides hooks to handle
some errors in real mode such as TLB errors, handle SLB errors,
etc...
- Support for retrieving memory error information from the service
processor on IBM servers running without a hypervisor and routing
them to the memory poison infrastructure.
- _PAGE_NUMA support on server processors
- 32-bit BookE relocatable kernel support
- FSL e6500 hardware tablewalk support
- A bunch of new/revived board support
- FSL e6500 deeper idle states and altivec powerdown support
You'll notice a generic mm change here, it has been acked by the
relevant authorities and is a pre-req for our _PAGE_NUMA support"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (121 commits)
powerpc: Implement arch_spin_is_locked() using arch_spin_value_unlocked()
powerpc: Add support for the optimised lockref implementation
powerpc/powernv: Call OPAL sync before kexec'ing
powerpc/eeh: Escalate error on non-existing PE
powerpc/eeh: Handle multiple EEH errors
powerpc: Fix transactional FP/VMX/VSX unavailable handlers
powerpc: Don't corrupt transactional state when using FP/VMX in kernel
powerpc: Reclaim two unused thread_info flag bits
powerpc: Fix races with irq_work
Move precessing of MCE queued event out from syscall exit path.
pseries/cpuidle: Remove redundant call to ppc64_runlatch_off() in cpu idle routines
powerpc: Make add_system_ram_resources() __init
powerpc: add SATA_MV to ppc64_defconfig
powerpc/powernv: Increase candidate fw image size
powerpc: Add debug checks to catch invalid cpu-to-node mappings
powerpc: Fix the setup of CPU-to-Node mappings during CPU online
powerpc/iommu: Don't detach device without IOMMU group
powerpc/eeh: Hotplug improvement
powerpc/eeh: Call opal_pci_reinit() on powernv for restoring config space
powerpc/eeh: Add restore_config operation
...
Currently, when we have a process using the transactional memory
facilities on POWER8 (that is, the processor is in transactional
or suspended state), and the process enters the kernel and the
kernel then uses the floating-point or vector (VMX/Altivec) facility,
we end up corrupting the user-visible FP/VMX/VSX state. This
happens, for example, if a page fault causes a copy-on-write
operation, because the copy_page function will use VMX to do the
copy on POWER8. The test program below demonstrates the bug.
The bug happens because when FP/VMX state for a transactional process
is stored in the thread_struct, we store the checkpointed state in
.fp_state/.vr_state and the transactional (current) state in
.transact_fp/.transact_vr. However, when the kernel wants to use
FP/VMX, it calls enable_kernel_fp() or enable_kernel_altivec(),
which saves the current state in .fp_state/.vr_state. Furthermore,
when we return to the user process we return with FP/VMX/VSX
disabled. The next time the process uses FP/VMX/VSX, we don't know
which set of state (the current register values, .fp_state/.vr_state,
or .transact_fp/.transact_vr) we should be using, since we have no
way to tell if we are still in the same transaction, and if not,
whether the previous transaction succeeded or failed.
Thus it is necessary to strictly adhere to the rule that if FP has
been enabled at any point in a transaction, we must keep FP enabled
for the user process with the current transactional state in the
FP registers, until we detect that it is no longer in a transaction.
Similarly for VMX; once enabled it must stay enabled until the
process is no longer transactional.
In order to keep this rule, we add a new thread_info flag which we
test when returning from the kernel to userspace, called TIF_RESTORE_TM.
This flag indicates that there is FP/VMX/VSX state to be restored
before entering userspace, and when it is set the .tm_orig_msr field
in the thread_struct indicates what state needs to be restored.
The restoration is done by restore_tm_state(). The TIF_RESTORE_TM
bit is set by new giveup_fpu/altivec_maybe_transactional helpers,
which are called from enable_kernel_fp/altivec, giveup_vsx, and
flush_fp/altivec_to_thread instead of giveup_fpu/altivec.
The other thing to be done is to get the transactional FP/VMX/VSX
state from .fp_state/.vr_state when doing reclaim, if that state
has been saved there by giveup_fpu/altivec_maybe_transactional.
Having done this, we set the FP/VMX bit in the thread's MSR after
reclaim to indicate that that part of the state is now valid
(having been reclaimed from the processor's checkpointed state).
Finally, in the signal handling code, we move the clearing of the
transactional state bits in the thread's MSR a bit earlier, before
calling flush_fp_to_thread(), so that we don't unnecessarily set
the TIF_RESTORE_TM bit.
This is the test program:
/* Michael Neuling 4/12/2013
*
* See if the altivec state is leaked out of an aborted transaction due to
* kernel vmx copy loops.
*
* gcc -m64 htm_vmxcopy.c -o htm_vmxcopy
*
*/
/* We don't use all of these, but for reference: */
int main(int argc, char *argv[])
{
long double vecin = 1.3;
long double vecout;
unsigned long pgsize = getpagesize();
int i;
int fd;
int size = pgsize*16;
char tmpfile[] = "/tmp/page_faultXXXXXX";
char buf[pgsize];
char *a;
uint64_t aborted = 0;
fd = mkstemp(tmpfile);
assert(fd >= 0);
memset(buf, 0, pgsize);
for (i = 0; i < size; i += pgsize)
assert(write(fd, buf, pgsize) == pgsize);
unlink(tmpfile);
a = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
assert(a != MAP_FAILED);
asm __volatile__(
"lxvd2x 40,0,%[vecinptr] ; " // set 40 to initial value
TBEGIN
"beq 3f ;"
TSUSPEND
"xxlxor 40,40,40 ; " // set 40 to 0
"std 5, 0(%[map]) ;" // cause kernel vmx copy page
TABORT
TRESUME
TEND
"li %[res], 0 ;"
"b 5f ;"
"3: ;" // Abort handler
"li %[res], 1 ;"
"5: ;"
"stxvd2x 40,0,%[vecoutptr] ; "
: [res]"=r"(aborted)
: [vecinptr]"r"(&vecin),
[vecoutptr]"r"(&vecout),
[map]"r"(a)
: "memory", "r0", "r3", "r4", "r5", "r6", "r7");
if (aborted && (vecin != vecout)){
printf("FAILED: vector state leaked on abort %f != %f\n",
(double)vecin, (double)vecout);
exit(1);
}
munmap(a, size);
close(fd);
printf("PASSED!\n");
return 0;
}
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
None of these files are actually using any __init type directives
and hence don't need to include <linux/init.h>. Most are just a
left over from __devinit and __cpuinit removal, or simply due to
code getting copied from one driver to the next.
The one instance where we add an include for init.h covers off
a case where that file was implicitly getting it from another
header which itself didn't need it.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The e500 SPE floating-point emulation code clears existing exceptions
(__FPU_FPSCR &= ~FP_EX_MASK;) before ORing in the exceptions from the
emulated operation. However, these exception bits are the "sticky",
cumulative exception bits, and should only be cleared by the user
program setting SPEFSCR, not implicitly by any floating-point
instruction (whether executed purely by the hardware or emulated).
The spurious clearing of these bits shows up as missing exceptions in
glibc testing.
Fixing this, however, is not as simple as just not clearing the bits,
because while the bits may be from previous floating-point operations
(in which case they should not be cleared), the processor can also set
the sticky bits itself before the interrupt for an exception occurs,
and this can happen in cases when IEEE 754 semantics are that the
sticky bit should not be set. Specifically, the "invalid" sticky bit
is set in various cases with non-finite operands, where IEEE 754
semantics do not involve raising such an exception, and the
"underflow" sticky bit is set in cases of exact underflow, whereas
IEEE 754 semantics are that this flag is set only for inexact
underflow. Thus, for correct emulation the kernel needs to know the
setting of these two sticky bits before the instruction being
emulated.
When a floating-point operation raises an exception, the kernel can
note the state of the sticky bits immediately afterwards. Some
<fenv.h> functions that affect the state of these bits, such as
fesetenv and feholdexcept, need to use prctl with PR_GET_FPEXC and
PR_SET_FPEXC anyway, and so it is natural to record the state of those
bits during that call into the kernel and so avoid any need for a
separate call into the kernel to inform it of a change to those bits.
Thus, the interface I chose to use (in this patch and the glibc port)
is that one of those prctl calls must be made after any userspace
change to those sticky bits, other than through a floating-point
operation that traps into the kernel anyway. feclearexcept and
fesetexceptflag duly make those calls, which would not be required
were it not for this issue.
The previous EGLIBC port, and the uClibc code copied from it, is
fundamentally broken as regards any use of prctl for floating-point
exceptions because it didn't use the PR_FP_EXC_SW_ENABLE bit in its
prctl calls (and did various worse things, such as passing a pointer
when prctl expected an integer). If you avoid anything where prctl is
used, the clearing of sticky bits still means it will never give
anything approximating correct exception semantics with existing
kernels. I don't believe the patch makes things any worse for
existing code that doesn't try to inform the kernel of changes to
sticky bits - such code may get incorrect exceptions in some cases,
but it would have done so anyway in other cases.
Signed-off-by: Joseph Myers <joseph@codesourcery.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Commit ce11e48b7f ("KVM: PPC: E500: Add
userspace debug stub support") added "struct thread_struct" to the
stack of kvmppc_vcpu_run(). thread_struct is 1152 bytes on my build,
compared to 48 bytes for the recently-introduced "struct debug_reg".
Use the latter instead.
This fixes the following error:
cc1: warnings being treated as errors
arch/powerpc/kvm/booke.c: In function 'kvmppc_vcpu_run':
arch/powerpc/kvm/booke.c:760:1: error: the frame size of 1424 bytes is larger than 1024 bytes
make[2]: *** [arch/powerpc/kvm/booke.o] Error 1
make[1]: *** [arch/powerpc/kvm] Error 2
make[1]: *** Waiting for unfinished jobs....
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Pull third set of powerpc updates from Benjamin Herrenschmidt:
"This is a small collection of random bug fixes and a few improvements
of Oops output which I deemed valuable enough to include as well.
The fixes are essentially recent build breakage and regressions, and a
couple of older bugs such as the DTL log duplication, the EEH issue
with PCI_COMMAND_MASTER and the problem with small contexts passed to
get/set_context with VSX enabled"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/signals: Mark VSX not saved with small contexts
powerpc/pseries: Fix SMP=n build of rng.c
powerpc: Make cpu_to_chip_id() available when SMP=n
powerpc/vio: Fix a dma_mask issue of vio
powerpc: booke: Fix build failures
powerpc: ppc64 address space capped at 32TB, mmap randomisation disabled
powerpc: Only print PACATMSCRATCH in oops when TM is active
powerpc/pseries: Duplicate dtl entries sometimes sent to userspace
powerpc: Remove a few lines of oops output
powerpc: Print DAR and DSISR on machine check oopses
powerpc: Fix __get_user_pages_fast() irq handling
powerpc/eeh: More accurate log
powerpc/eeh: Enable PCI_COMMAND_MASTER for PCI bridges
If TM is not active there is no need to print PACATMSCRATCH
so we can save ourselves a line.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Machine check exceptions set DAR and DSISR, so print them in our
oops output.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
No function descriptor, but we set r12 up and set TIF_RESTOREALL as it
normally isn't restored on return from syscall.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We currently turn IRQs off in __switch_to(0 but this is unnecessary as it's
already disabled in the caller.
This removes the IRQ disable but adds a check to make sure it is really off
in case this changes in future.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
KVM need this function when switching from vcpu to user-space
thread. My subsequent patch will use this function.
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Scott Wood <scottwood@freescale.com>
This way we can use same data type struct with KVM and
also help in using other debug related function.
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Acked-by: Michael Neuling <mikey@neuling.org>
[scottwood@freescale.com: removed obvious debug_reg comment]
Signed-off-by: Scott Wood <scottwood@freescale.com>
This provides a facility which is intended for use by KVM, where the
contents of the FP/VSX and VMX (Altivec) registers can be saved away
to somewhere other than the thread_struct when kernel code wants to
use floating point or VMX instructions. This is done by providing a
pointer in the thread_struct to indicate where the state should be
saved to. The giveup_fpu() and giveup_altivec() functions test these
pointers and save state to the indicated location if they are non-NULL.
Note that the MSR_FP/VEC bits in task->thread.regs->msr are still used
to indicate whether the CPU register state is live, even when an
alternate save location is being used.
This also provides load_fp_state() and load_vr_state() functions, which
load up FP/VSX and VMX state from memory into the CPU registers, and
corresponding store_fp_state() and store_vr_state() functions, which
store FP/VSX and VMX state into memory from the CPU registers.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This creates new 'thread_fp_state' and 'thread_vr_state' structures
to store FP/VSX state (including FPSCR) and Altivec/VSX state
(including VSCR), and uses them in the thread_struct. In the
thread_fp_state, the FPRs and VSRs are represented as u64 rather
than double, since we rarely perform floating-point computations
on the values, and this will enable the structures to be used
in KVM code as well. Similarly FPSCR is now a u64 rather than
a structure of two 32-bit values.
This takes the offsets out of the macros such as SAVE_32FPRS,
REST_32FPRS, etc. This enables the same macros to be used for normal
and transactional state, enabling us to delete the transactional
versions of the macros. This also removes the unused do_load_up_fpu
and do_load_up_altivec, which were in fact buggy since they didn't
create large enough stack frames to account for the fact that
load_up_fpu and load_up_altivec are not designed to be called from C
and assume that their caller's stack frame is an interrupt frame.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We've been keeping that field in thread_struct for a while, it contains
the "limit" of the current stack pointer and is meant to be used for
detecting stack overflows.
It has a few problems however:
- First, it was never actually *used* on 64-bit. Set and updated but
not actually exploited
- When switching stack to/from irq and softirq stacks, it's update
is racy unless we hard disable interrupts, which is costly. This
is fine on 32-bit as we don't soft-disable there but not on 64-bit.
Thus rather than fixing 2 in order to implement 1 in some hypothetical
future, let's remove the code completely from 64-bit. In order to avoid
a clutter of ifdef's, we remove the updates from C code completely
during interrupt stack switching, and instead maintain it from the
asm helper that is used to do the stack switching in the first place.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In the current kernel, the function flush_fp_to_thread() is not
dependent on CONFIG_PPC_FPU. So most invocations of this function
is not wrapped by CONFIG_PPC_FPU. Even through we don't really
save the FPRs to the thread struct if CONFIG_PPC_FPU is not enabled,
but there does have some runtime overhead such as the check for
tsk->thread.regs and preempt disable and enable. It really make
no sense to do that. So make it a nop when CONFIG_PPC_FPU is
disabled. Also remove the wrapped #ifdef CONFIG_PPC_FPU
when invoking this function.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This moves us to save the Target Address Register (TAR) a earlier in
__switch_to. It introduces a new function save_tar() to do this.
We need to save the TAR earlier as we will overwrite it in the transactional
memory reclaim/recheckpoint path. We are going to do this in a subsequent
patch which will fix saving the TAR register when it's modified inside a
transaction.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> [v3.10]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQEcBAABAgAGBQJR0K2gAAoJEHm+PkMAQRiGWsEH+gMZSN1qRm34hZ82q1Tx7HvL
Eb/Gsl3Qw/7G2TlTqgjBUs36IdqV9O2cui/aa3/TfXvdvrx+0GlhRkEwQPc+ygcO
Mvoyoke4tT4+4jVFdCg1J8avREsa28/6oaHs0ZZxuVmJBBLTJH7aXaNsGn6eU1q9
9+p798MQis6naIiPC63somlZcCIiBhsuWCPWpEfLMn8G1HWAFTM3xXIbNBqe/brS
bmIOfhomlIZ5dcdaXGvjtP3+KJhkNDwhkPC4tVYu8JqqgSlrE+a+EGyEuuGqKk10
U+swiqyuD31uBI9ga54u/2FzSqDiAu6YOcMXevjo/m3g9XLdYbYLvN+nvN8alCQ=
=Ob6Z
-----END PGP SIGNATURE-----
Merge tag 'v3.10' into next
Merge 3.10 in order to get some of the last minute powerpc
changes, resolve conflicts and add additional fixes on top
of them.
Add support for EBB (Event Based Branches) on 64-bit book3s. See the
included documentation for more details.
EBBs are a feature which allows the hardware to branch directly to a
specified user space address when a PMU event overflows. This can be
used by programs for self-monitoring with no kernel involvement in the
inner loop.
Most of the logic is in the generic book3s code, primarily to avoid a
proliferation of PMU callbacks.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
It's possible for us to crash when running with ftrace enabled, eg:
Bad kernel stack pointer bffffd12 at c00000000000a454
cpu 0x3: Vector: 300 (Data Access) at [c00000000ffe3d40]
pc: c00000000000a454: resume_kernel+0x34/0x60
lr: c00000000000335c: performance_monitor_common+0x15c/0x180
sp: bffffd12
msr: 8000000000001032
dar: bffffd12
dsisr: 42000000
If we look at current's stack (paca->__current->stack) we see it is
equal to c0000002ecab0000. Our stack is 16K, and comparing to
paca->kstack (c0000002ecab3e30) we can see that we have overflowed our
kernel stack. This leads to us writing over our struct thread_info, and
in this case we have corrupted thread_info->flags and set
_TIF_EMULATE_STACK_STORE.
Dumping the stack we see:
3:mon> t c0000002ecab0000
[c0000002ecab0000] c00000000002131c .performance_monitor_exception+0x5c/0x70
[c0000002ecab0080] c00000000000335c performance_monitor_common+0x15c/0x180
--- Exception: f01 (Performance Monitor) at c0000000000fb2ec .trace_hardirqs_off+0x1c/0x30
[c0000002ecab0370] c00000000016fdb0 .trace_graph_entry+0xb0/0x280 (unreliable)
[c0000002ecab0410] c00000000003d038 .prepare_ftrace_return+0x98/0x130
[c0000002ecab04b0] c00000000000a920 .ftrace_graph_caller+0x14/0x28
[c0000002ecab0520] c0000000000d6b58 .idle_cpu+0x18/0x90
[c0000002ecab05a0] c00000000000a934 .return_to_handler+0x0/0x34
[c0000002ecab0620] c00000000001e660 .timer_interrupt+0x160/0x300
[c0000002ecab06d0] c0000000000025dc decrementer_common+0x15c/0x180
--- Exception: 901 (Decrementer) at c0000000000104d4 .arch_local_irq_restore+0x74/0xa0
[c0000002ecab09c0] c0000000000fe044 .trace_hardirqs_on+0x14/0x30 (unreliable)
[c0000002ecab0fb0] c00000000016fe3c .trace_graph_entry+0x13c/0x280
[c0000002ecab1050] c00000000003d038 .prepare_ftrace_return+0x98/0x130
[c0000002ecab10f0] c00000000000a920 .ftrace_graph_caller+0x14/0x28
[c0000002ecab1160] c0000000000161f0 .__ppc64_runlatch_on+0x10/0x40
[c0000002ecab11d0] c00000000000a934 .return_to_handler+0x0/0x34
--- Exception: 901 (Decrementer) at c0000000000104d4 .arch_local_irq_restore+0x74/0xa0
... and so on
__ppc64_runlatch_on() is called from RUNLATCH_ON in the exception entry
path. At that point the irq state is not consistent, ie. interrupts are
hard disabled (by the exception entry), but the paca soft-enabled flag
may be out of sync.
This leads to the local_irq_restore() in trace_graph_entry() actually
enabling interrupts, which we do not want. Because we have not yet
reprogrammed the decrementer we immediately take another decrementer
exception, and recurse.
The fix is twofold. Firstly make sure we call DISABLE_INTS before
calling RUNLATCH_ON. The badly named DISABLE_INTS actually reconciles
the irq state in the paca with the hardware, making it safe again to
call local_irq_save/restore().
Although that should be sufficient to fix the bug, we also mark the
runlatch routines as notrace. They are called very early in the
exception entry and we are asking for trouble tracing them. They are
also fairly uninteresting and tracing them just adds unnecessary
overhead.
[ This regression was introduced by fe1952fc0a
"powerpc: Rework runlatch code" by myself --BenH
]
CC: <stable@vger.kernel.org> [v3.4+]
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When introducing support for DABRX in 4474ef0, we broke older 32-bit CPUs
that don't have that register.
Some CPUs have a DABR but not DABRX. Configuration are:
- No 32bit CPUs have DABRX but some have DABR.
- POWER4+ and below have the DABR but no DABRX.
- 970 and POWER5 and above have DABR and DABRX.
- POWER8 has DAWR, hence no DABRX.
This introduces CPU_FTR_DABRX and sets it on appropriate CPUs. We use
the top 64 bits for CPU FTR bits since only 64 bit CPUs have this.
Processors that don't have the DABRX will still work as they will fall
back to software filtering these breakpoints via perf_exclude_event().
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reported-by: "Gorelik, Jacob (335F)" <jacob.gorelik@jpl.nasa.gov>
cc: stable@vger.kernel.org (v3.9 only)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
MSR_DE is not cleared on entry to the kernel, and we don't clear it
explicitly outside of debug code. If we have MSR_DE set in
prime_debug_regs(), and the new thread has events enabled in DBCR0
(e.g. ICMP is set in thread->dbsr0, even though it was cleared in the
real DBCR0 when the thread got scheduled out), we'll end up taking a
debug exception in the kernel when DBCR0 is loaded. DSRR0 will not
point to an exception vector, and the kernel ends up hanging at
kernel_dbg_exc. Fix this by always clearing MSR_DE when we load new
debug state.
Another observed source of kernel_dbg_exc hangs is with the branch
taken event. If this event is active, but we take a non-debug trap
(e.g. a TLB miss or an asynchronous interrupt) before the next branch.
We end up taking a branch-taken debug exception on the initial branch
instruction of the exception vector, but because the debug exception is
DBSR_BT rather than DBSR_IC we branch to kernel_dbg_exc before even
checking the DSRR0 address. Fix this by checking for DBSR_BT as well
as DBSR_IC, which is what 32-bit does and what the comments suggest was
intended in the 64-bit code as well.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Saw this warning again, and this time from the ret_from_fork path.
It seems we could clear the back chain earlier in copy_thread(), which
could cover both path, and also fix potential lockdep usage in
schedule_tail(), or exception occurred before we clear the back chain.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull powerpc update from Benjamin Herrenschmidt:
"The main highlights this time around are:
- A pile of addition POWER8 bits and nits, such as updated
performance counter support (Michael Ellerman), new branch history
buffer support (Anshuman Khandual), base support for the new PCI
host bridge when not using the hypervisor (Gavin Shan) and other
random related bits and fixes from various contributors.
- Some rework of our page table format by Aneesh Kumar which fixes a
thing or two and paves the way for THP support. THP itself will
not make it this time around however.
- More Freescale updates, including Altivec support on the new e6500
cores, new PCI controller support, and a pile of new boards support
and updates.
- The usual batch of trivial cleanups & fixes"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
powerpc: Fix build error for book3e
powerpc: Context switch the new EBB SPRs
powerpc: Turn on the EBB H/FSCR bits
powerpc: Replace CPU_FTR_BCTAR with CPU_FTR_ARCH_207S
powerpc: Setup BHRB instructions facility in HFSCR for POWER8
powerpc: Fix interrupt range check on debug exception
powerpc: Update tlbie/tlbiel as per ISA doc
powerpc: Print page size info during boot
powerpc: print both base and actual page size on hash failure
powerpc: Fix hpte_decode to use the correct decoding for page sizes
powerpc: Decode the pte-lp-encoding bits correctly.
powerpc: Use encode avpn where we need only avpn values
powerpc: Reduce PTE table memory wastage
powerpc: Move the pte free routines from common header
powerpc: Reduce the PTE_INDEX_SIZE
powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format
powerpc: New hugepage directory format
powerpc: Don't truncate pgd_index wrongly
powerpc: Don't hard code the size of pte page
powerpc: Save DAR and DSISR in pt_regs on MCE
...
show_regs() is inherently arch-dependent but it does make sense to print
generic debug information and some archs already do albeit in slightly
different forms. This patch introduces a generic function to print debug
information from show_regs() so that different archs print out the same
information and it's much easier to modify what's printed.
show_regs_print_info() prints out the same debug info as dump_stack()
does plus task and thread_info pointers.
* Archs which didn't print debug info now do.
alpha, arc, blackfin, c6x, cris, frv, h8300, hexagon, ia64, m32r,
metag, microblaze, mn10300, openrisc, parisc, score, sh64, sparc,
um, xtensa
* Already prints debug info. Replaced with show_regs_print_info().
The printed information is superset of what used to be there.
arm, arm64, avr32, mips, powerpc, sh32, tile, unicore32, x86
* s390 is special in that it used to print arch-specific information
along with generic debug info. Heiko and Martin think that the
arch-specific extra isn't worth keeping s390 specfic implementation.
Converted to use the generic version.
Note that now all archs print the debug info before actual register
dumps.
An example BUG() dump follows.
kernel BUG at /work/os/work/kernel/workqueue.c:4841!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #7
Hardware name: empty empty/S3992, BIOS 080011 10/26/2007
task: ffff88007c85e040 ti: ffff88007c860000 task.ti: ffff88007c860000
RIP: 0010:[<ffffffff8234a07e>] [<ffffffff8234a07e>] init_workqueues+0x4/0x6
RSP: 0000:ffff88007c861ec8 EFLAGS: 00010246
RAX: ffff88007c861fd8 RBX: ffffffff824466a8 RCX: 0000000000000001
RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffffffff8234a07a
RBP: ffff88007c861ec8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff8234a07a
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88015f7ff000 CR3: 00000000021f1000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffff88007c861ef8 ffffffff81000312 ffffffff824466a8 ffff88007c85e650
0000000000000003 0000000000000000 ffff88007c861f38 ffffffff82335e5d
ffff88007c862080 ffffffff8223d8c0 ffff88007c862080 ffffffff81c47760
Call Trace:
[<ffffffff81000312>] do_one_initcall+0x122/0x170
[<ffffffff82335e5d>] kernel_init_freeable+0x9b/0x1c8
[<ffffffff81c47760>] ? rest_init+0x140/0x140
[<ffffffff81c4776e>] kernel_init+0xe/0xf0
[<ffffffff81c6be9c>] ret_from_fork+0x7c/0xb0
[<ffffffff81c47760>] ? rest_init+0x140/0x140
...
v2: Typo fix in x86-32.
v3: CPU number dropped from show_regs_print_info() as
dump_stack_print_info() has been updated to print it. s390
specific implementation dropped as requested by s390 maintainers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Chris Metcalf <cmetcalf@tilera.com> [tile bits]
Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon bits]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both dump_stack() and show_stack() are currently implemented by each
architecture. show_stack(NULL, NULL) dumps the backtrace for the
current task as does dump_stack(). On some archs, dump_stack() prints
extra information - pid, utsname and so on - in addition to the
backtrace while the two are identical on other archs.
The usages in arch-independent code of the two functions indicate
show_stack(NULL, NULL) should print out bare backtrace while
dump_stack() is used for debugging purposes when something went wrong,
so it does make sense to print additional information on the task which
triggered dump_stack().
There's no reason to require archs to implement two separate but mostly
identical functions. It leads to unnecessary subtle information.
This patch expands the dummy fallback dump_stack() implementation in
lib/dump_stack.c such that it prints out debug information (taken from
x86) and invokes show_stack(NULL, NULL) and drops arch-specific
dump_stack() implementations in all archs except blackfin. Blackfin's
dump_stack() does something wonky that I don't understand.
Debug information can be printed separately by calling
dump_stack_print_info() so that arch-specific dump_stack()
implementation can still emit the same debug information. This is used
in blackfin.
This patch brings the following behavior changes.
* On some archs, an extra level in backtrace for show_stack() could be
printed. This is because the top frame was determined in
dump_stack() on those archs while generic dump_stack() can't do that
reliably. It can be compensated by inlining dump_stack() but not
sure whether that'd be necessary.
* Most archs didn't use to print debug info on dump_stack(). They do
now.
An example WARN dump follows.
WARNING: at kernel/workqueue.c:4841 init_workqueues+0x35/0x505()
Hardware name: empty
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #9
0000000000000009 ffff88007c861e08 ffffffff81c614dc ffff88007c861e48
ffffffff8108f50f ffffffff82228240 0000000000000040 ffffffff8234a03c
0000000000000000 0000000000000000 0000000000000000 ffff88007c861e58
Call Trace:
[<ffffffff81c614dc>] dump_stack+0x19/0x1b
[<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20
[<ffffffff8234a071>] init_workqueues+0x35/0x505
...
v2: CPU number added to the generic debug info as requested by s390
folks and dropped the s390 specific dump_stack(). This loses %ksp
from the debug message which the maintainers think isn't important
enough to keep the s390-specific dump_stack() implementation.
dump_stack_print_info() is moved to kernel/printk.c from
lib/dump_stack.c. Because linkage is per objecct file,
dump_stack_print_info() living in the same lib file as generic
dump_stack() means that archs which implement custom dump_stack()
- at this point, only blackfin - can't use dump_stack_print_info()
as that will bring in the generic version of dump_stack() too. v1
The v1 patch broke build on blackfin due to this issue. The build
breakage was reported by Fengguang Wu.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [s390 bits]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon bits]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
arch_dup_task_struct() does flush_ptrace_hw_breakpoint(src), this
destroys the parent's breakpoints for no reason. We should clear
child->thread.ptrace_bps[] copied by dup_task_struct() instead.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We can't compile a kernel with CONFIG_ALTIVEC=n when
CONFIG_PPC_TRANSACTIONAL_MEM=y. We currently get:
arch/powerpc/kernel/tm.S:320: Error: unsupported relocation against THREAD_VSCR
arch/powerpc/kernel/tm.S:323: Error: unsupported relocation against THREAD_VR0
arch/powerpc/kernel/tm.S:323: Error: unsupported relocation against THREAD_VR0
etc.
The below fixes this with a sprinkling of #ifdefs.
This was found by mpe with kisskb:
http://kisskb.ellerman.id.au/kisskb/buildresult/8539442/
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
This hooks the new transactional memory code into context switching, FP/VMX/VMX
unavailable and exception return.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When we switch out a task, we need to save both the checkpointed and the
speculated state into the thread struct.
Similarly when we are switching in a task we need to load both the checkpointed
and speculated state. If the task was using FP, we non-lazily reload both the
original and the speculative FP register states. This is because the kernel
doesn't see if/when a TM rollback occurs, so if we take an FP unavoidable
later, we are unable to determine which set of FP regs need to be restored.
This simply adds these functions. It doesn't hook them into the existing code
yet.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add transactional memory paca scratch register to show_regs. This is useful
for debugging.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds new macros for saving and restoring checkpointed architected state
from and to the thread_struct.
It also adds some debugging macros for when your brain explodes trying to debug
your transactional memory enabled kernel.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently we set the length field in the DAWR to 0 which defaults it to one
double word (64bits) which is the same as the DABR.
Change this so that we can set it to longer values as supported by the DAWR.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
With allmodconfig we are getting:
drivers/tty/synclink_gt.c:160:12: error: conflicting types for 'set_break'
arch/powerpc/include/asm/debug.h:49:5: note: previous declaration of 'set_break' was here
drivers/tty/synclinkmp.c:526:12: error: conflicting types for 'set_break'
arch/powerpc/include/asm/debug.h:49:5: note: previous declaration of 'set_break' was here
This renames set_break to set_breakpoint to avoid this naming conflict
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds DAWR supoprt to the set_break().
It does both bare metal and PAPR versions of setting the DAWR.
There is still some work we can do to make full use of the watchpoint but that
will come later.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This is a rewrite so that we don't assume we are using the DABR throughout the
code. We now use the arch_hw_breakpoint to store the breakpoint in a generic
manner in the thread_struct, rather than storing the raw DABR value.
The ptrace GET/SET_DEBUGREG interface currently passes the raw DABR in from
userspace. We keep this functionality, so that future changes (like the POWER8
DAWR), will still fake the DABR to userspace.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[PATCH 4/6] powerpc: Define ppr in thread_struct
ppr in thread_struct is used to save PPR and restore it before process exits
from kernel.
This patch sets the default priority to 3 when tasks are created such
that users can use 4 for higher priority tasks.
Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
kernel_thread() callbacks are *not* in modules and are not going to
be there. And it's not even read in ppc32 ret_from_kernel_thread(),
so no need to bother with it there either.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull pile 2 of execve and kernel_thread unification work from Al Viro:
"Stuff in there: kernel_thread/kernel_execve/sys_execve conversions for
several more architectures plus assorted signal fixes and cleanups.
There'll be more (in particular, real fixes for the alpha
do_notify_resume() irq mess)..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (43 commits)
alpha: don't open-code trace_report_syscall_{enter,exit}
Uninclude linux/freezer.h
m32r: trim masks
avr32: trim masks
tile: don't bother with SIGTRAP in setup_frame
microblaze: don't bother with SIGTRAP in setup_rt_frame()
mn10300: don't bother with SIGTRAP in setup_frame()
frv: no need to raise SIGTRAP in setup_frame()
x86: get rid of duplicate code in case of CONFIG_VM86
unicore32: remove pointless test
h8300: trim _TIF_WORK_MASK
parisc: decide whether to go to slow path (tracesys) based on thread flags
parisc: don't bother looping in do_signal()
parisc: fix double restarts
bury the rest of TIF_IRET
sanitize tsk_is_polling()
bury _TIF_RESTORE_SIGMASK
unicore32: unobfuscate _TIF_WORK_MASK
mips: NOTIFY_RESUME is not needed in TIF masks
mips: merge the identical "return from syscall" per-ABI code
...
Conflicts:
arch/arm/include/asm/thread_info.h
Pull powerpc updates from Benjamin Herrenschmidt:
"Some highlights in addition to the usual batch of fixes:
- 64TB address space support for 64-bit processes by Aneesh Kumar
- Gavin Shan did a major cleanup & re-organization of our EEH support
code (IBM fancy PCI error handling & recovery infrastructure) which
paves the way for supporting different platform backends, along
with some rework of the PCIe code for the PowerNV platform in order
to remove home made resource allocations and instead use the
generic code (which is possible after some small improvements to it
done by Gavin).
- Uprobes support by Ananth N Mavinakayanahalli
- A pile of embedded updates from Freescale folks, including new SoC
and board supports, more KVM stuff including preparing for 64-bit
BookE KVM support, ePAPR 1.1 updates, etc..."
Fixup trivial conflicts in drivers/scsi/ipr.c
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (146 commits)
powerpc/iommu: Fix multiple issues with IOMMU pools code
powerpc: Fix VMX fix for memcpy case
driver/mtd:IFC NAND:Initialise internal SRAM before any write
powerpc/fsl-pci: use 'Header Type' to identify PCIE mode
powerpc/eeh: Don't release eeh_mutex in eeh_phb_pe_get
powerpc: Remove tlb batching hack for nighthawk
powerpc: Set paca->data_offset = 0 for boot cpu
powerpc/perf: Sample only if SIAR-Valid bit is set in P7+
powerpc/fsl-pci: fix warning when CONFIG_SWIOTLB is disabled
powerpc/mpc85xx: Update interrupt handling for IFC controller
powerpc/85xx: Enable USB support in p1023rds_defconfig
powerpc/smp: Do not disable IPI interrupts during suspend
powerpc/eeh: Fix crash on converting OF node to edev
powerpc/eeh: Lock module while handling EEH event
powerpc/kprobe: Don't emulate store when kprobe stwu r1
powerpc/kprobe: Complete kprobe and migrate exception frame
powerpc/kprobe: Introduce a new thread flag
powerpc: Remove unused __get_user64() and __put_user64()
powerpc/eeh: Global mutex to protect PE tree
powerpc/eeh: Remove EEH PE for normal PCI hotplug
...
Pull scheduler changes from Ingo Molnar:
"Continued quest to clean up and enhance the cputime code by Frederic
Weisbecker, in preparation for future tickless kernel features.
Other than that, smallish changes."
Fix up trivial conflicts due to additions next to each other in arch/{x86/}Kconfig
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
cputime: Make finegrained irqtime accounting generally available
cputime: Gather time/stats accounting config options into a single menu
ia64: Reuse system and user vtime accounting functions on task switch
ia64: Consolidate user vtime accounting
vtime: Consolidate system/idle context detection
cputime: Use a proper subsystem naming for vtime related APIs
sched: cpu_power: enable ARCH_POWER
sched/nohz: Clean up select_nohz_load_balancer()
sched: Fix load avg vs. cpu-hotplug
sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW
sched: Fix nohz_idle_balance()
sched: Remove useless code in yield_to()
sched: Add time unit suffix to sched sysctl knobs
sched/debug: Limit sd->*_idx range on sysctl
sched: Remove AFFINE_WAKEUPS feature flag
s390: Remove leftover account_tick_vtime() header
cputime: Consolidate vtime handling on context switch
sched: Move cputime code to its own file
cputime: Generalize CONFIG_VIRT_CPU_ACCOUNTING
tile: Remove SD_PREFER_LOCAL leftover
...
the only non-obvious part is that current_pt_regs() is really needed
here - task_pt_regs() is NULL for kernel threads; it's OK for ptrace
uses (the thing task_pt_regs() is intended for), but not for us.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rework set_dabr to take a DABRX value as well.
Both the pseries and PS3 hypervisors do some checks on the DABRX
values that are passed in the hcall. This patch stops bogus values
from being passed to hypervisor. Also, in the case where we are
clearing the breakpoint, where DABR and DABRX are zero, we modify the
DABRX value to make it valid so that the hcall won't fail.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If the default DSCR is non zero we set thread.dscr_inherit in
copy_thread() meaning the new thread and all its children will ignore
future updates to the default DSCR. This is not intended and is
a change in behaviour that a number of our users have hit.
We just need to inherit thread.dscr and thread.dscr_inherit from
the parent which ends up being much simpler.
This was found with the following test case:
http://ozlabs.org/~anton/junkcode/dscr_default_test.c
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org> # 3.0+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add thread_struct.trap_nr and use it to store the last exception
the thread experienced. In this patch, we populate the field at
various places where we force_sig_info() to the process.
This is also used in uprobes to determine if the probed instruction
caused an exception.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The archs that implement virtual cputime accounting all
flush the cputime of a task when it gets descheduled
and sometimes set up some ground initialization for the
next task to account its cputime.
These archs all put their own hooks in their context
switch callbacks and handle the off-case themselves.
Consolidate this by creating a new account_switch_vtime()
callback called in generic code right after a context switch
and that these archs must implement to flush the prev task
cputime and initialize the next task cputime related state.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Pull fpu state cleanups from Ingo Molnar:
"This tree streamlines further aspects of FPU handling by eliminating
the prepare_to_copy() complication and moving that logic to
arch_dup_task_struct().
It also fixes the FPU dumps in threaded core dumps, removes and old
(and now invalid) assumption plus micro-optimizes the exit path by
avoiding an FPU save for dead tasks."
Fixed up trivial add-add conflict in arch/sh/kernel/process.c that came
in because we now do the FPU handling in arch_dup_task_struct() rather
than the legacy (and now gone) prepare_to_copy().
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, fpu: drop the fpu state during thread exit
x86, xsave: remove thread_has_fpu() bug check in __sanitize_i387_state()
coredump: ensure the fpu state is flushed for proper multi-threaded core dump
fork: move the real prepare_to_copy() users to arch_dup_task_struct()
Pull powerpc updates from Benjamin Herrenschmidt:
"Here are the powerpc goodies for 3.5. Main highlights are:
- Support for the NX crypto engine in Power7+
- A bunch of Anton goodness, including some micro optimization of our
syscall entry on Power7
- I converted a pile of our thermal control drivers to the new i2c
APIs (essentially turning the old therm_pm72 into a proper set of
windfarm drivers). That's one more step toward removing the
deprecated i2c APIs, there's still a few drivers to fix, but we are
getting close
- kexec/kdump support for 47x embedded cores
The big missing thing here is no updates from Freescale. Not sure
what's up here, but with Kumar not working for them anymore things are
a bit in a state of flux in that area."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (71 commits)
powerpc: Fix irq distribution
Revert "powerpc/hw-breakpoint: Use generic hw-breakpoint interfaces for new PPC ptrace flags"
powerpc: Fixing a cputhread code documentation
powerpc/crypto: Enable the PFO-based encryption device
powerpc/crypto: Build files for the nx device driver
powerpc/crypto: debugfs routines and docs for the nx device driver
powerpc/crypto: SHA512 hash routines for nx encryption
powerpc/crypto: SHA256 hash routines for nx encryption
powerpc/crypto: AES-XCBC mode routines for nx encryption
powerpc/crypto: AES-GCM mode routines for nx encryption
powerpc/crypto: AES-ECB mode routines for nx encryption
powerpc/crypto: AES-CTR mode routines for nx encryption
powerpc/crypto: AES-CCM mode routines for nx encryption
powerpc/crypto: AES-CBC mode routines for nx encryption
powerpc/crypto: nx driver code supporting nx encryption
powerpc/pseries: Enable the PFO-based RNG accelerator
powerpc/pseries/hwrng: PFO-based hwrng driver
powerpc/pseries: Add PFO support to the VIO bus
powerpc/pseries: Add pseries update notifier for OFDT prop changes
powerpc/pseries: Add new hvcall constants to support PFO
...
Historical prepare_to_copy() is mostly a no-op, duplicated for majority of
the architectures and the rest following the x86 model of flushing the extended
register state like fpu there.
Remove it and use the arch_dup_task_struct() instead.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1336692811-30576-1-git-send-email-suresh.b.siddha@intel.com
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The core now has a threadinfo allocator which uses a kmemcache when
THREAD_SIZE < PAGE_SIZE.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Link: http://lkml.kernel.org/r/20120505150142.059161130@linutronix.de
Add two optimisations to enable_kernel_altivec:
- enable_kernel_altivec has already determined if we need to
save the previous task's state but we call giveup_altivec
in both cases, requiring an extra branch in giveup_altivec. Create
giveup_altivec_notask which only turns on the VMX bit in the
MSR.
- We write the VMX MSR bit each time we call enable_kernel_altivec
even it was already set. Check the bit and branch out if we have
already set it. The classic case for this is vectored IO
where we have to copy multiple buffers to or from userspace.
The following testcase was used to confirm this patch improves
performance:
http://ozlabs.org/~anton/junkcode/copy_to_user.c
Since the current breakpoint for using VMX in copy_tofrom_user is
4096 bytes, I'm using buffers of 4096 + 1 cacheline (4224) bytes.
A benchmark of 16 entry readvs (-s 16):
time copy_to_user -l 4224 -s 16 -i 1000000
completes 5.2% faster on a POWER7 PS700.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit fe1952fc0a
"powerpc: Rework runlatch code" has a nasty typo
where it uses "TLF_RUNLATCH" instead of "_TLF_RUNLATCH"
(bit number instead of bit mask), causing some flags to
be potentially lost such as _TLF_RESTORE_SIGMASK
(Brown paper bag for me ! We should be able to make
that break at compile time with a bit of magic, any
volunteer ?)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Disintegrate asm/system.h for PowerPC.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
cc: linuxppc-dev@lists.ozlabs.org
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
This moves the inlines into system.h and changes the runlatch
code to use the thread local flags (non-atomic) rather than
the TIF flags (atomic) to keep track of the latch state.
The code to turn it back on in an asynchronous interrupt is
now simplified and partially inlined.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
A kernel oops/panic prints an instruction dump showing several
instructions before and after the instruction which caused the
oops/panic.
The code intended that the faulting instruction be enclosed in angle
brackets, however a bug caused the faulting instruction to be
interpreted by printk() as the message log level.
To fix this, the KERN_CONT log level is added before the actual text of
the printed message.
=== Before the patch ===
[ 1081.587266] Instruction dump:
[ 1081.590236] 7c000110 7c0000f8 5400077c 552907f6 7d290378 992b0003 4e800020 38000001
[ 1081.598034] 3d20c03a 9009a114 7c0004ac 39200000
[ 1081.602500] 4e800020 3803ffd0 2b800009
<4>[ 1081.587266] Instruction dump:
<4>[ 1081.590236] 7c000110 7c0000f8 5400077c 552907f6 7d290378 992b0003 4e800020 38000001
<4>[ 1081.598034] 3d20c03a 9009a114 7c0004ac 39200000
<98090000>[ 1081.602500] 4e800020 3803ffd0 2b800009
=== After the patch ===
[ 51.385216] Instruction dump:
[ 51.388186] 7c000110 7c0000f8 5400077c 552907f6 7d290378 992b0003 4e800020 38000001
[ 51.395986] 3d20c03a 9009a114 7c0004ac 39200000 <98090000> 4e800020 3803ffd0 2b800009
<4>[ 51.385216] Instruction dump:
<4>[ 51.388186] 7c000110 7c0000f8 5400077c 552907f6 7d290378 992b0003 4e800020 38000001
<4>[ 51.395986] 3d20c03a 9009a114 7c0004ac 39200000 <98090000> 4e800020 3803ffd0 2b800009
Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On a 64bit book3s machine I have an oops from a system reset that
claims the book3e CE bit was set:
MSR: 8000000000021032 <ME,CE,IR,DR> CR: 24004082 XER: 00000010
On a book3s machine system reset sets IBM bit 46 and 47 depending on
the power saving mode. Separate the definitions by type and for
completeness add the rest of the bits in.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
With the introduction of CONFIG_PPC_ADV_DEBUG_REGS user space debug is
broken on Book-E 64-bit parts that support delayed debug events. When
switch_booke_debug_regs() sets DBCR0 we'll start getting debug events as
MSR_DE is also set and we aren't able to handle debug events from kernel
space.
We can remove the hack that always enables MSR_DE and loads up DBCR0 and
just utilize switch_booke_debug_regs() to get user space debug working
again.
We still need to handle critical/debug exception stacks & proper
save/restore of state for those exception levles to support debug events
from kernel space like we have on 32-bit.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We had an existing ifdef for 4xx & BOOKE processors that got changed to
CONFIG_PPC_ADV_DEBUG_REGS. The define has nothing to do with
CONFIG_PPC_ADV_DEBUG_REGS. The define really should be:
#if defined(CONFIG_4xx) || defined(CONFIG_BOOKE)
and not
#ifdef CONFIG_PPC_ADV_DEBUG_REGS
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
All these files were including module.h just for the basic
EXPORT_SYMBOL infrastructure. We can shift them off to the
export.h header which is a way smaller footprint and thus
realize some compile time gains.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (99 commits)
drivers/virt: add missing linux/interrupt.h to fsl_hypervisor.c
powerpc/85xx: fix mpic configuration in CAMP mode
powerpc: Copy back TIF flags on return from softirq stack
powerpc/64: Make server perfmon only built on ppc64 server devices
powerpc/pseries: Fix hvc_vio.c build due to recent changes
powerpc: Exporting boot_cpuid_phys
powerpc: Add CFAR to oops output
hvc_console: Add kdb support
powerpc/pseries: Fix hvterm_raw_get_chars to accept < 16 chars, fixing xmon
powerpc/irq: Quieten irq mapping printks
powerpc: Enable lockup and hung task detectors in pseries and ppc64 defeconfigs
powerpc: Add mpt2sas driver to pseries and ppc64 defconfig
powerpc: Disable IRQs off tracer in ppc64 defconfig
powerpc: Sync pseries and ppc64 defconfigs
powerpc/pseries/hvconsole: Fix dropped console output
hvc_console: Improve tty/console put_chars handling
powerpc/kdump: Fix timeout in crash_kexec_wait_realmode
powerpc/mm: Fix output of total_ram.
powerpc/cpufreq: Add cpufreq driver for Momentum Maple boards
powerpc: Correct annotations of pmu registration functions
...
Fix up trivial Kconfig/Makefile conflicts in arch/powerpc, drivers, and
drivers/cpufreq
Now we have the CFAR saved add it to the oops output.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The address limit is already set in flush_old_exec() so this
set_fs(USER_DS) is redundant.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
giveup_spe() saves the SPE state which is protected by MSR[SPE].
However, modifying SPEFSCR does not trap when MSR[SPE]=0.
And since SPEFSCR is already saved/restored in _switch(),
not all the callers want to save SPEFSCR again.
Thus, saving SPEFSCR should not belong to giveup_spe().
This patch moves SPEFSCR saving to flush_spe_to_thread(),
and cleans up the caller that needs to save SPEFSCR accordingly.
Signed-off-by: Liu Yu <yu.liu@freescale.com>
Acked-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Fix up powerpc to the new mmu_gather stuff.
PPC has an extra batching queue to RCU free the actual pagetable
allocations, use the ARCH extentions for that for now.
For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
hardware hash-table, keep using per-cpu arrays but flush on context switch
and use a TLF bit to track the lazy_mmu state.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some of the 64bit PPC CPU features are MMU-related, so this patch moves
them to MMU_FTR_ bits. All cpu_has_feature()-style tests are moved to
mmu_has_feature(), and seven feature bits are freed as a result.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The DSCR (aka Data Stream Control Register) is supported on some
server PowerPC chips and allow some control over the prefetch
of data streams.
This patch allows the value to be specified per thread by emulating
the corresponding mfspr and mtspr instructions. Children of such
threads inherit the value. Other threads use a default value that
can be specified in sysfs - /sys/devices/system/cpu/dscr_default.
If a thread starts with non default value in the sysfs entry,
all children threads inherit this non default value even if
the sysfs value is changed later.
Signed-off-by: Alexey Kardashevskiy <aik@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add a node parameter to alloc_thread_info(), and change its name to
alloc_thread_info_node()
This change is needed to allow NUMA aware kthread_create_on_cpu()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the error in spelling the config option for hw-breakpoints and fix
the build issue that follows.
Signed-off by: K.Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We were printing 64 bits of DSISR in show_regs even though it is 32 bit.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, when CONFIG_VIRT_CPU_ACCOUNTING is enabled, we use the
PURR register for measuring the user and system time used by
processes, as well as other related times such as hardirq and
softirq times. This turns out to be quite confusing for users
because it means that a program will often be measured as taking
less time when run on a multi-threaded processor (SMT2 or SMT4 mode)
than it does when run on a single-threaded processor (ST mode), even
though the program takes longer to finish. The discrepancy is
accounted for as stolen time, which is also confusing, particularly
when there are no other partitions running.
This changes the accounting to use the timebase instead, meaning that
the reported user and system times are the actual number of real-time
seconds that the program was executing on the processor thread,
regardless of which SMT mode the processor is in. Thus a program will
generally show greater user and system times when run on a
multi-threaded processor than on a single-threaded processor.
On pSeries systems on POWER5 or later processors, we measure the
stolen time (time when this partition wasn't running) using the
hypervisor dispatch trace log. We check for new entries in the
log on every entry from user mode and on every transition from
kernel process context to soft or hard IRQ context (i.e. when
account_system_vtime() gets called). So that we can correctly
distinguish time stolen from user time and time stolen from system
time, without having to check the log on every exit to user mode,
we store separate timestamps for exit to user mode and entry from
user mode.
On systems that have a SPURR (POWER6 and POWER7), we read the SPURR
in account_system_vtime() (as before), and then apportion the SPURR
ticks since the last time we read it between scaled user time and
scaled system time according to the relative proportions of user
time and system time over the same interval. This avoids having to
read the SPURR on every kernel entry and exit. On systems that have
PURR but not SPURR (i.e., POWER5), we do the same using the PURR
rather than the SPURR.
This disables the DTL user interface in /sys/debug/kernel/powerpc/dtl
for now since it conflicts with the use of the dispatch trace log
by the time accounting code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Simple cleanup by moving arch_sd_sibling_asym_packing from process.c to
smp.c to save an #ifdef CONFIG_SMP
No functionality change.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
I'm sick of seeing ppc64_runlatch_off in our profiles, so inline it
into the callers. To avoid a mess of circular includes I didn't add
it as an inline function.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Use is_32bit_task() helper to test 32 bit binary.
Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Make do_execve() take a const filename pointer so that kernel_execve() compiles
correctly on ARM:
arch/arm/kernel/sys_arm.c:88: warning: passing argument 1 of 'do_execve' discards qualifiers from pointer target type
This also requires the argv and envp arguments to be consted twice, once for
the pointer array and once for the strings the array points to. This is
because do_execve() passes a pointer to the filename (now const) to
copy_strings_kernel(). A simpler alternative would be to cast the filename
pointer in do_execve() when it's passed to copy_strings_kernel().
do_execve() may not change any of the strings it is passed as part of the argv
or envp lists as they are some of them in .rodata, so marking these strings as
const should be fine.
Further kernel_execve() and sys_execve() need to be changed to match.
This has been test built on x86_64, frv, arm and mips.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark arguments to certain system calls as being const where they should be but
aren't. The list includes:
(*) The filename arguments of various stat syscalls, execve(), various utimes
syscalls and some mount syscalls.
(*) The filename arguments of some syscall helpers relating to the above.
(*) The buffer argument of various write syscalls.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits)
sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug
sched: No need for bootmem special cases
sched: Revert nohz_ratelimit() for now
sched: Reduce update_group_power() calls
sched: Update rq->clock for nohz balanced cpus
sched: Fix spelling of sibling
sched, cpuset: Drop __cpuexit from cpu hotplug callbacks
sched: Fix the racy usage of thread_group_cputimer() in fastpath_timer_check()
sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()
sched: thread_group_cputime: Simplify, document the "alive" check
sched: Remove the obsolete exit_state/signal hacks
sched: task_tick_rt: Remove the obsolete ->signal != NULL check
sched: __sched_setscheduler: Read the RLIMIT_RTPRIO value lockless
sched: Fix comments to make them DocBook happy
sched: Fix fix_small_capacity
powerpc: Exclude arch_sd_sibiling_asym_packing() on UP
powerpc: Enable asymmetric SMT scheduling on POWER7
sched: Add asymmetric group packing option for sibling domain
sched: Fix capacity calculations for SMT4
sched: Change nohz idle load balancing logic to push model
...
Our handling of debug interrupts on Book3E 64-bit is not quite
the way it should be just yet. This is a workaround to let gdb
work at least for now. We ensure that when context switching,
we set the appropriate DBCR0 value for the new task. We also
make sure that we turn off MSR[DE] within the kernel, and set
it as part of the bits that get set when going back to userspace.
In the long run, we will probably set the userspace DBCR0 on the
exception exit code path and ensure we have some proper kernel
value to set on the way into the kernel, a bit like ppc32 does,
but that will take more work.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
No logic changes, only spelling.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: linuxppc-dev@ozlabs.org
Cc: David Howells <dhowells@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <15249.1277776921@neuling.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement perf-events based hw-breakpoint interfaces for PowerPC
64-bit server (Book III S) processors. This allows access to a
given location to be used as an event that can be counted or
profiled by the perf_events subsystem.
This is done using the DABR (data breakpoint register), which can
also be used for process debugging via ptrace. When perf_event
hw_breakpoint support is configured in, the perf_event subsystem
manages the DABR and arbitrates access to it, and ptrace then
creates a perf_event when it is requested to set a data breakpoint.
[Adopted suggestions from Paul Mackerras <paulus@samba.org> to
- emulate_step() all system-wide breakpoints and single-step only the
per-task breakpoints
- perform arch-specific cleanup before unregistration through
arch_unregister_hw_breakpoint()
]
Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Irq stacks provide an essential protection from stack overflows through
external interrupts, at the cost of two additionals stacks per CPU.
Enable them unconditionally to simplify the kernel build and prevent
people from accidentally disabling them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Only SMP systems care about load-balance features, plus this
saves some .text space on UP and also fixes the build.
Reported-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michael Neuling <mikey@neuling.org>
LKML-Reference: <tip-76cbd8a8f8b0dddbff89a6708bd5bd13c0d21a00@git.kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The POWER7 core has dynamic SMT mode switching which is controlled by
the hypervisor. There are 3 SMT modes:
SMT1 uses thread 0
SMT2 uses threads 0 & 1
SMT4 uses threads 0, 1, 2 & 3
When in any particular SMT mode, all threads have the same performance
as each other (ie. at any moment in time, all threads perform the same).
The SMT mode switching works such that when linux has threads 2 & 3 idle
and 0 & 1 active, it will cede (H_CEDE hypercall) threads 2 and 3 in the
idle loop and the hypervisor will automatically switch to SMT2 for that
core (independent of other cores). The opposite is not true, so if
threads 0 & 1 are idle and 2 & 3 are active, we will stay in SMT4 mode.
Similarly if thread 0 is active and threads 1, 2 & 3 are idle, we'll go
into SMT1 mode.
If we can get the core into a lower SMT mode (SMT1 is best), the threads
will perform better (since they share less core resources). Hence when
we have idle threads, we want them to be the higher ones.
This adds a feature bit for asymmetric packing to powerpc and then
enables it on POWER7.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@ozlabs.org
LKML-Reference: <20100608045702.31FB5CC8C7@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
476 requires an isync after loading MMU and debug related SPR's. Some of
these are in performance-critical paths and may need to be optimized, but
initially, we're playing it safe.
Signed-off-by: Torez Smith <lnxtorez@linux.vnet.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>
powerpc/booke: Add support for advanced debug registers
From: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Based on patches originally written by Torez Smith.
This patch defines context switch and trap related functionality
for BookE specific Debug Registers. It adds support to ptrace()
for setting and getting BookE related Debug Registers
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Torez Smith <lnxtorez@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <dwg@au1.ibm.com>
Cc: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Sergio Durigan Junior <sergiodj@br.ibm.com>
Cc: Thiago Jung Bauermann <bauerman@br.ibm.com>
Cc: linuxppc-dev list <Linuxppc-dev@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
powerpc/booke: Introduce new CONFIG options for advanced debug registers
From: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Introduce new config options to simplify the ifdefs pertaining to the
advanced debug registers for booke and 40x processors:
CONFIG_PPC_ADV_DEBUG_REGS - boolean: true for dac-based processors
CONFIG_PPC_ADV_DEBUG_IACS - number of IAC registers
CONFIG_PPC_ADV_DEBUG_DACS - number of DAC registers
CONFIG_PPC_ADV_DEBUG_DVCS - number of DVC registers
CONFIG_PPC_ADV_DEBUG_DAC_RANGE - DAC ranges supported
Beginning conservatively, since I only have the facilities to test 440
hardware. I believe all 40x and booke platforms support at least 2 IAC
and 2 DAC registers. For 440, 4 IAC and 2 DVC registers are enabled, as
well as the DAC ranges.
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Acked-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Here are the powerpc bits to remove TIF_ABI_PENDING now that
set_personality() is called at the appropriate place in exec.
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Fix the following 3 issues:
arch/powerpc/kernel/process.c: In function 'arch_randomize_brk':
arch/powerpc/kernel/process.c:1183: error: 'mmu_highuser_ssize' undeclared (first use in this function)
arch/powerpc/kernel/process.c:1183: error: (Each undeclared identifier is reported only once
arch/powerpc/kernel/process.c:1183: error: for each function it appears in.)
arch/powerpc/kernel/process.c:1183: error: 'MMU_SEGSIZE_1T' undeclared (first use in this function)
In file included from arch/powerpc/kernel/setup_64.c:60:
arch/powerpc/include/asm/mmu-hash64.h:132: error: redefinition of 'struct mmu_psize_def'
arch/powerpc/include/asm/mmu-hash64.h:159: error: expected identifier or '(' before numeric constant
arch/powerpc/include/asm/mmu-hash64.h:396: error: conflicting types for 'mm_context_t'
arch/powerpc/include/asm/mmu-book3e.h:184: error: previous declaration of 'mm_context_t' was here
cc1: warnings being treated as errors
arch/powerpc/kernel/pci_64.c: In function 'pcibios_unmap_io_space':
arch/powerpc/kernel/pci_64.c💯 error: unused variable 'res'
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When the function graph tracer is enabled, it replaces the return address
with a hook back to the tracer. This makes back traces see the hook instead
of the actual return address.
The current code also shows the real address by checking if the return
address jumps to the return_to_handler. If it is, is also prints out
the saved real return address.
On powerpc64, some modules may return to mod_return_to_handler, which
is not checked. This patch will also show the real address if a return
is to mod_return_to_handler as well.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If we are using 1TB segments and we are allowed to randomise the heap, we can
put it above 1TB so it is backed by a 1TB segment. Otherwise the heap will be
in the bottom 1TB which always uses 256MB segments and this may result in a
performance penalty.
This functionality is disabled when heap randomisation is turned off:
echo 1 > /proc/sys/kernel/randomize_va_space
which may be useful when trying to allocate the maximum amount of 16M or 16G
pages.
On a microbenchmark that repeatedly touches 32GB of memory with a stride of
256MB + 4kB (designed to stress 256MB segments while still mapping nicely into
the L1 cache), we see the improvement:
Force malloc to use heap all the time:
# export MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1
Disable heap randomization:
# echo 1 > /proc/sys/kernel/randomize_va_space
# time ./test
12.51s
Enable heap randomization:
# echo 2 > /proc/sys/kernel/randomize_va_space
# time ./test
1.70s
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, a single ifdef covers SLB related bits and more generic ppc64
related bits, split this in two separate ifdef's since 64-bit BookE will
need one but not the other.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For some reason we've had an explicit KERN_INFO for GPR dumps. With
recent changes we get output like:
<6>GPR00: 00000000 ef855eb0 ef858000 00000001 000000d0 f1000000 ffbc8000 ffffffff
The KERN_INFO is causing the <6>. Don't see any reason to keep it
around.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This is a random collection of added ifdef's around portions of
code that only mak sense on server processors. Using either
CONFIG_PPC_STD_MMU_64 or CONFIG_PPC_BOOK3S as seems appropriate.
This is meant to make the future merging of Book3E 64-bit support
easier.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Randomise ELF_ET_DYN_BASE, which is used when loading position independent
executables.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Randomise the lower bits of the stack address. More randomisation is good for
security but the scatter can also help with SMT threads that share an L1. A
quick test case shows this working:
int main()
{
int sp;
printf("%x\n", (unsigned long)&sp & 4095);
}
before:
80
80
80
80
80
after:
610
490
300
6b0
d80
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This is a port of the function graph tracer that was written by
Frederic Weisbecker for the x86.
This only works for PPC64 at the moment and only for static tracing.
PPC32 and dynamic function graph tracing support will come later.
The trace produces a visual calling of functions:
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
0) 2.224 us | }
0) ! 271.024 us | }
0) ! 320.080 us | }
0) ! 324.656 us | }
0) ! 329.136 us | }
0) | .put_prev_task_fair() {
0) | .update_curr() {
0) 2.240 us | .update_min_vruntime();
0) 6.512 us | }
0) 2.528 us | .__enqueue_entity();
0) + 15.536 us | }
0) | .pick_next_task_fair() {
0) 2.032 us | .__pick_next_entity();
0) 2.064 us | .__clear_buddies();
0) | .set_next_entity() {
0) 2.672 us | .__dequeue_entity();
0) 6.864 us | }
Geoff Lavand tested on PS3.
Tested-by: Geoff Levand <geoffrey.levand@am.sony.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The cpu time spent by the idle process actually doing something is
currently accounted as idle time. This is plain wrong, the architectures
that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
time spent doing nothing and the time spent by idle doing work. The first
is accounted with account_idle_time and the second with account_system_time.
The architectures that use the account_xxx_time interface directly and not
the account_xxx_ticks interface now need to do the check for the idle
process in their arch code. In particular to improve the system vs true
idle time accounting the arch code needs to measure the true idle time
instead of just testing for the idle process.
To improve the tick based accounting as well we would need an architecture
primitive that can tell us if the pt_regs of the interrupted context
points to the magic instruction that halts the cpu.
In addition idle time is no more added to the stime of the idle process.
This field now contains the system time of the idle process as it should
be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as
every tick that occurs while idle is running will be accounted as idle
time.
This patch contains the necessary common code changes to be able to
distinguish idle system time and true idle time. The architectures with
support for VIRT_CPU_ACCOUNTING need some changes to exploit this.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On my screen, when something crashes, I only have space for maybe 16
functions of the stack trace before the information above it scrolls
off the screen. It's easy to hack the kernel to print out only that
much, but it's harder to remember to do it. This introduces a config
option for it so that I can keep the setting in my config.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Now that arch/ppc is gone and CONFIG_PPC_MERGE is always set, remove
the dead code associated with !CONFIG_PPC_MERGE from arch/powerpc
and include/asm-powerpc.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
* CONFIG_BOOKE is selected by CONFIG_44x so we dont need both
* Fixed a few comments
* Go back to only using DBCR0_IDM to determine if we are using
debug resources.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch implements support for HW based watchpoint via the
DBSR_DAC (Data Address Compare) facility of the BookE processors.
It does so by interfacing with the existing DABR breakpoint code
and adding the necessary bits and pieces for the new bits to
be properly set or cleared
Signed-off-by: Luis Machado <luisgpm@br.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
giveup_vsx didn't save the FPU and VMX regsiters. Change it to be
like giveup_fpr/altivec which save these registers.
Also update call sites where FPU and VMX are already saved to use the
original giveup_vsx (renamed to __giveup_vsx).
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This changes the oops and backtrace code to use the new %pS
printk extension to print out symbols rather than manually
calling print_symbol.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since Roland's ptrace cleanup starting with commit
f65255e8d5 ("[POWERPC] Use user_regset
accessors for FP regs"), the dump_task_* functions are no longer being
used.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This correctly hooks the VSX dump into Roland McGrath core file
infrastructure. It adds the VSX dump information as an additional elf
note in the core file (after talking more to the tool chain/gdb guys).
This also ensures the formats are consistent between signals, ptrace
and core files.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch extends the floating point save and restore code to use the
VSX load/stores when VSX is available. This will make FP context
save/restore marginally slower on FP only code, when VSX is available,
as it has to load/store 128bits rather than just 64bits.
Mixing FP, VMX and VSX code will get constant architected state.
The signals interface is extended to enable access to VSR 0-31
doubleword 1 after discussions with tool chain maintainers. Backward
compatibility is maintained.
The ptrace interface is also extended to allow access to VSR 0-31 full
registers.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
We are going to change where the floating point registers are stored
in the thread_struct, so in preparation add some macros to access the
floating point registers. Update all code to use these new macros.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This makes it possible to use separate stacks for hard and soft IRQs
on 32-bit powerpc as well as on 64-bit. The code for 32-bit is just
the 32-bit analog of the 64-bit code.
* Added allocation and initialization of the irq stacks. We limit the
stacks to be in lowmem for ppc32.
* Implemented ppc32 versions of call_do_softirq() and call_handle_irq()
to switch the stack pointers
* Reworked how we do stack overflow detection. We now keep around the
limit of the stack in the thread_struct and compare against the limit
to see if we've overflowed. We can now use this on ppc64 if desired.
[ paulus@samba.org: Fixed bug on 6xx where we need to reload r9 with the
thread_info pointer. ]
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The powerpc kernel stacks need to be naturally aligned, as they
contain the thread info at the bottom, which is obtained by
clearing the low bits of the stack pointer.
However, when using 64K pages, the stack is smaller than a page,
so we use kmalloc to allocate it, but that doesn't provide the
alignment guarantee we need.
It appeared to work so far... until one enables SLUB debugging
which then returns unaligned pointers. Ooops...
This fixes it by using a slab cache with enforced alignment. It
relies on my previous patch that adds a thread_info_cache_init()
callback.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This moves various definitions used all over the place to parse stack
frames to ptrace.h so only one definition is needed.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
There is a bug in the powerpc DABR (data access breakpoint) handling,
which can result in us missing breakpoints if several threads are trying
to break on the same address.
The circumstances are that do_page_fault() calls do_dabr(), this clears
the DABR (sets it to 0) and sets up the signal which will report to
userspace that the DABR was hit. The do_signal() code will restore the DABR
value on the way out to userspace.
If we reschedule before calling do_signal(), __switch_to() will check the
cached DABR value and compare it to the new thread's value, if they match
we don't set the DABR in hardware.
So if two threads have the same DABR value, and we schedule from one to
the other after taking the interrupt for the first thread hitting the DABR,
the second thread will run without the DABR set in hardware.
The cleanest fix is to move the cache update into set_dabr(), that way we
can't forget to do it.
Reported-by: Jan Kratochvil <jan.kratochvil@redhat.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The PT_DTRACE flag is meaningless and obsolete.
Don't touch it.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Since the PMU is an NMI now, it can come at any time we are only soft
disabled. We must hard disable around the two places we allow the kernel
stack SLB and r1 to go out of sync. Otherwise the PMU exception can
force a kernel stack SLB into another slot, which can lead to it
getting evicted, which can lead to a nasty unrecoverable SLB miss
in the exception entry code.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The powerpc show_regs prints CPU using smp_processor_id: change that to
raw_smp_processor_id, so that when it's showing a WARN_ON backtrace without
preemption disabled, DEBUG_PREEMPT doesn't mess up that warning with its own.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The commit fa13a5a1f2 (sched: restore
deterministic CPU accounting on powerpc), unconditionally calls
update_process_tick() in system context. In the deterministic
accounting case this is the correct thing to do. However, in the
non-deterministic accounting case we need to not do this, since doing
this results in the time accounted as hardware irq time being
artificially elevated.
Also this collapses 2 consecutive '#ifdef CONFIG_VIRT_CPU_ACCOUNTING'
checks in time.h into one for neatness.
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Since powerpc started using CONFIG_GENERIC_CLOCKEVENTS, the
deterministic CPU accounting (CONFIG_VIRT_CPU_ACCOUNTING) has been
broken on powerpc, because we end up counting user time twice: once in
timer_interrupt() and once in update_process_times().
This fixes the problem by pulling the code in update_process_times
that updates utime and stime into a separate function called
account_process_tick. If CONFIG_VIRT_CPU_ACCOUNTING is not defined,
there is a version of account_process_tick in kernel/timer.c that
simply accounts a whole tick to either utime or stime as before. If
CONFIG_VIRT_CPU_ACCOUNTING is defined, then arch code gets to
implement account_process_tick.
This also lets us simplify the s390 code a bit; it means that the s390
timer interrupt can now call update_process_times even when
CONFIG_VIRT_CPU_ACCOUNTING is turned on, and can just implement a
suitable account_process_tick().
account_process_tick() now takes the task_struct * as an argument.
Tested both with and without CONFIG_VIRT_CPU_ACCOUNTING.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
One of the easiest things to isolate is the pid printed in kernel log.
There was a patch, that made this for arch-independent code, this one makes
so for arch/xxx files.
It took some time to cross-compile it, but hopefully these are all the
printks in arch code.
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update dump_task_altivec() (which has so far never been put to use) so that
it dumps the Altivec/VMX registers (VR[0] - VR[31], VSCR and VRSAVE) in the
same format as the ptrace get_vrregs(), and add the appropriate glue
typedef and #defines to make it work.
A new note type of NT_PPC_VMX was chosen to be 0x100 (arbitrarily) because
it allows the low range values to be used for more generic purposes and
0x100 seems an adequate starting point for PowerPC extensions.
Signed-off-by: Mark Nelson <markn@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@suse.de>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes the kernel use 1TB segments for all kernel mappings and for
user addresses of 1TB and above, on machines which support them
(currently POWER5+, POWER6 and PA6T).
We detect that the machine supports 1TB segments by looking at the
ibm,processor-segment-sizes property in the device tree.
We don't currently use 1TB segments for user addresses < 1T, since
that would effectively prevent 32-bit processes from using huge pages
unless we also had a way to revert to using 256MB segments. That
would be possible but would involve extra complications (such as
keeping track of which segment size was used when HPTEs were inserted)
and is not addressed here.
Parts of this patch were originally written by Ben Herrenschmidt.
Signed-off-by: Paul Mackerras <paulus@samba.org>
On non-book-E, exceptions execute in real mode. If a fault happens
that leads to a register dump, the kernel currently prints XXXXXXXX
because it doesn't realize that PC is a physical address.
This patch checks whether instruction address translation is turned
on, and if not converts PC into a virtual address.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Acked-by: Kumar Gala <galak@kernel.crashing.org>
Acked-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>
When PTRACE_O_TRACEEXEC is used, a ptrace call to fetch the registers at
the PTRACE_EVENT_EXEC stop (PTRACE_PEEKUSR) will oops in CHECK_FULL_REGS.
With recent versions, "gdb --args /bin/sh -c 'exec /bin/true'" and "run" at
the (gdb) prompt is sufficient to produce this. I also have written an
isolated test case, see https://bugzilla.redhat.com/show_bug.cgi?id=301791#c15.
This change fixes the problem by clearing the low bit of pt_regs.trap in
start_thread so that FULL_REGS is true again. This is correct since all of
the GPRs that "full" refers to are cleared in start_thread.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Make it so that SPE support can be determined at runtime. This is similiar
to how we handle AltiVec. This allows us to have SPE support built in and
work on processors with and without SPE.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
When we flush register state for FP, Altivec, or SPE in flush_*_to_thread
we need to respect the task_struct that the caller has passed to us.
Most cases we are called with current, however sometimes (ptrace) we may
be passed a different task_struct.
This showed up when using gdbserver debugging a simple program that used
floating point. When gdb tried to show the FP regs they all showed up as
0, because the child's FP registers were never properly flushed to memory.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
In a show_regs() message The DEAR and ESR were reported as
DAR and DSISR which only exist on classic parts.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
This patch removes the #ifdef CONFIG_PPC64 around setting the DABR.
The actual setting of the SPR inside of the set_dabr() function is dependent
on CONFIG_PPC64 || CONFIG_6xx but you can always provide a ppc_md hook to
override that. We should improve support for different HW breakpoints
facilities but this is a first step.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.
Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current tlb flush code on powerpc 64 bits has a subtle race since we
lost the page table lock due to the possible faulting in of new PTEs
after a previous one has been removed but before the corresponding hash
entry has been evicted, which can leads to all sort of fatal problems.
This patch reworks the batch code completely. It doesn't use the mmu_gather
stuff anymore. Instead, we use the lazy mmu hooks that were added by the
paravirt code. They have the nice property that the enter/leave lazy mmu
mode pair is always fully contained by the PTE lock for a given range
of PTEs. Thus we can guarantee that all batches are flushed on a given
CPU before it drops that lock.
We also generalize batching for any PTE update that require a flush.
Batching is now enabled on a CPU by arch_enter_lazy_mmu_mode() and
disabled by arch_leave_lazy_mmu_mode(). The code epects that this is
always contained within a PTE lock section so no preemption can happen
and no PTE insertion in that range from another CPU. When batching
is enabled on a CPU, every PTE updates that need a hash flush will
use the batch for that flush.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Use lowercase for hex printouts in oops messages. The number of times I have
tried to copy and paste from an oops into an objdump search...
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Remove last_syscall from 32bit powerpc, its been gone in 64bit for years.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Fix atomicity of TIF update in flush_thread() for powerpc
Fixes it correctly with *_ti_thread_flag.
Race :
parent process executing :
sys_ptrace()
(lock_kernel())
(ptrace_get_task_struct(pid))
arch_ptrace()
ptrace_detach()
ptrace_disable(child);
clear_singlestep(child);
clear_tsk_thread_flag(child, TIF_SINGLESTEP);
(which clears the TIF_SINGLESTEP flag atomically from a different
process)
(put_task_struct(child))
(unlock_kernel())
And at the same time, in the child process :
sys_execve()
do_execve()
search_binary_handler()
load_elf_binary()
flush_old_exec()
flush_thread()
doing a non-atomic thread flag update
Applies on 2.6.20.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Paul Mackerras <paulus@samba.org>
If something has overflowed or corrupted the stack and causes an oops,
and we try to print a stack trace, that will call validate_sp, which
can itself cause an oops if the cpu field of the thread_info struct at
the bottom of the stack has been corrupted (if CONFIG_IRQSTACKS is
set). This makes debugging harder.
To avoid the second oops, this adds a check to make sure that the cpu
number is reasonable before using it to check whether the stack is on
the softirq or hardirq stack.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Instead of just checking that an address is in the right range, use the
provided __kernel_text_address() helper which covers both the kernel and
module text sections.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
In some places, particularly drivers and __init code, the init utsns is the
appropriate one to use. This patch replaces those with a the init_utsname
helper.
Changes: Removed several uses of init_utsname(). Hope I picked all the
right ones in net/ipv4/ipconfig.c. These are now changed to
utsname() (the per-process namespace utsname) in the previous
patch (2/7)
[akpm@osdl.org: CIFS fix]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This gives the ability to control whether alignment exceptions get
fixed up or reported to the process as a SIGBUS, using the existing
PR_SET_UNALIGN and PR_GET_UNALIGN prctls. We do not implement the
option of logging a message on alignment exceptions.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This adds the PowerPC part of the code to allow processes to change
their endian mode via prctl.
This also extends the alignment exception handler to be able to fix up
alignment exceptions that occur in little-endian mode, both for
"PowerPC" little-endian and true little-endian.
We always enter signal handlers in big-endian mode -- the support for
little-endian mode does not amount to the creation of a little-endian
user/kernel ABI. If the signal handler returns, the endian mode is
restored to what it was when the signal was delivered.
We have two new kernel CPU feature bits, one for PPC little-endian and
one for true little-endian. Most of the classic 32-bit processors
support PPC little-endian, and this is reflected in the CPU feature
table. There are two corresponding feature bits reported to userland
in the AT_HWCAP aux vector entry.
This is based on an earlier patch by Anton Blanchard.
Signed-off-by: Paul Mackerras <paulus@samba.org>
The only user of get_wchan is the proc fs - and proc can't be built modular.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Export validate_sp so we can use it in the oprofile calltrace code.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
powerpc currently declares some of its own system calls
in <asm/unistd.h>, but not all of them. That place also
contains remainders of the now almost unused kernel syscall
hack.
- Add a new <asm/syscalls.h> with clean declarations
- Include that file from every source that implements one
of these
- Get rid of old declarations in <asm/unistd.h>
This patch is required as a base for implementing system
calls from an SPU, but also makes sense as a general
cleanup.
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
When kretprobe probes the schedule() function, if the probed process exits
then schedule() will never return, so some kretprobe instances will never
be recycled.
In this patch the parent process will recycle retprobe instances of the
probed function and there will be no memory leak of kretprobe instances.
Signed-off-by: bibo mao <bibo.mao@intel.com>
Cc: Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This implements accurate task and cpu time accounting for 64-bit
powerpc kernels. Instead of accounting a whole jiffy of time to a
task on a timer interrupt because that task happened to be running at
the time, we now account time in units of timebase ticks according to
the actual time spent by the task in user mode and kernel mode. We
also count the time spent processing hardware and software interrupts
accurately. This is conditional on CONFIG_VIRT_CPU_ACCOUNTING. If
that is not set, we do tick-based approximate accounting as before.
To get this accurate information, we read either the PURR (processor
utilization of resources register) on POWER5 machines, or the timebase
on other machines on
* each entry to the kernel from usermode
* each exit to usermode
* transitions between process context, hard irq context and soft irq
context in kernel mode
* context switches.
On POWER5 systems with shared-processor logical partitioning we also
read both the PURR and the timebase at each timer interrupt and
context switch in order to determine how much time has been taken by
the hypervisor to run other partitions ("steal" time). Unfortunately,
since we need values of the PURR on both threads at the same time to
accurately calculate the steal time, and since we can only calculate
steal time on a per-core basis, the apportioning of the steal time
between idle time (time which we ceded to the hypervisor in the idle
loop) and actual stolen time is somewhat approximate at the moment.
This is all based quite heavily on what s390 does, and it uses the
generic interfaces that were added by the s390 developers,
i.e. account_system_time(), account_user_time(), etc.
This patch doesn't add any new interfaces between the kernel and
userspace, and doesn't change the units in which time is reported to
userspace by things such as /proc/stat, /proc/<pid>/stat, getrusage(),
times(), etc. Internally the various task and cpu times are stored in
timebase units, but they are converted to USER_HZ units (1/100th of a
second) when reported to userspace. Some precision is therefore lost
but there should not be any accumulating error, since the internal
accumulation is at full precision.
Signed-off-by: Paul Mackerras <paulus@samba.org>
The runlatch SPR can take a lot of time to write. My original runlatch
code would set it on every exception entry even though most of the time
this was not required. It would also continually set it in the idle
loop, which is an issue on an SMT capable processor.
Now we cache the runlatch value in a threadinfo bit, and only check for
it in decrementer and hardware interrupt exceptions as well as the idle
loop. Boot on POWER3, POWER5 and iseries, and compile tested on pmac32.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch removes all self references and fixes references to files
in the now defunct arch/ppc64 tree. I think this accomplises
everything wanted, though there might be a few references I missed.
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Commit 5388fb1025 made signal_32.c
use discard_lazy_cpu_state, which broke ARCH=ppc because that
uses the common signal_32.c but has its own process.c. Make ARCH=ppc
use the common process.c to fix this and to reduce the amount
of duplicated code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Heikki Lindholm pointed out that there was a potential race with the
lazy CPU state (FP, VR, EVR) stuff if preempt is enabled. The race
is that in the process of restoring FP state on sigreturn, the task
gets preempted by a user task that wants to use the FPU. It will take
an FP unavailable exception, which will write the current FPU state
to the thread_struct, overwriting the values which sigreturn has
stored. Note that this can only happen on UP since we don't implement
lazy CPU state on SMP.
The fix is to flush the lazy CPU state before updating the
thread_struct. To do this we re-use the flush_lazy_cpu_state()
function from process.c.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This fixes a bug noticed by Paolo Galtieri and fixed for ARCH=ppc in
the previous commit (ppc: fix floating point register corruption).
This fixes the arch/powerpc code by adding preempt_disable/enable,
and also cleans it up a bit by pulling out the code that discards
any lazily-switched CPU register state into a new function, rather
than having that code repeated in three places.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Fix a bug in kprobes that can cause an Oops or even a crash when a return
probe is installed on one of the following functions: sys_execve,
do_execve, load_*_binary, flush_old_exec, or flush_thread. The fix is to
remove the call to kprobe_flush_task() in flush_thread(). This fix has
been tested on all architectures for which the return-probes feature has
been implemented (i386, x86_64, ppc64, ia64). Please apply.
BACKGROUND
Up to now, we have called kprobe_flush_task() under two situations: when a
task exits, and when it execs. Flushing kretprobe_instances on exit is
correct because (a) do_exit() doesn't return, and (b) one or more
return-probed functions may be active when a task calls do_exit(). Neither
is the case for sys_execve() and its callees.
Initially, the mistaken call to kprobe_flush_task() on exec was harmless
because we put the "real" return address of each active probed function
back in the stack, just to be safe, when we recycled its
kretprobe_instance. When support for ppc64 and ia64 was added, this safety
measure couldn't be employed, and was eventually dropped even for i386 and
x86_64. sys_execve() and its callees were informally blacklisted for
return probes until this fix was developed.
Acked-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>