There is a kernel crash that happens if rt_sigreturn() is called inside
a transactional block.
This crash happens if the kernel hits an in-kernel page fault when
accessing userspace memory, usually through copy_ckvsx_to_user(). A
major page fault calls might_sleep() function, which can cause a task
reschedule. A task reschedule (switch_to()) reclaim and recheckpoint
the TM states, but, in the signal return path, the checkpointed memory
was already reclaimed, thus the exception stack has MSR that points to
MSR[TS]=0.
When the code returns from might_sleep() and a task reschedule
happened, then this task is returned with the memory recheckpointed,
and CPU MSR[TS] = suspended.
This means that there is a side effect at might_sleep() if it is
called with CPU MSR[TS] = 0 and the task has regs->msr[TS] != 0.
This side effect can cause a TM bad thing, since at the exception
entrance, the stack saves MSR[TS]=0, and this is what will be used at
RFID, but, the processor has MSR[TS] = Suspended, and this transition
will be invalid and a TM Bad thing will be raised, causing the
following crash:
Unexpected TM Bad Thing exception at c00000000000e9ec (msr 0x8000000302a03031) tm_scratch=800000010280b033
cpu 0xc: Vector: 700 (Program Check) at [c00000003ff1fd70]
pc: c00000000000e9ec: fast_exception_return+0x100/0x1bc
lr: c000000000032948: handle_rt_signal64+0xb8/0xaf0
sp: c0000004263ebc40
msr: 8000000302a03031
current = 0xc000000415050300
paca = 0xc00000003ffc4080 irqmask: 0x03 irq_happened: 0x01
pid = 25006, comm = sigfuz
Linux version 5.0.0-rc1-00001-g3bd6e94bec12 (breno@debian) (gcc version 8.2.0 (Debian 8.2.0-3)) #899 SMP Mon Jan 7 11:30:07 EST 2019
WARNING: exception is not recoverable, can't continue
enter ? for help
[c0000004263ebc40] c000000000032948 handle_rt_signal64+0xb8/0xaf0 (unreliable)
[c0000004263ebd30] c000000000022780 do_notify_resume+0x2f0/0x430
[c0000004263ebe20] c00000000000e844 ret_from_except_lite+0x70/0x74
--- Exception: c00 (System Call) at 00007fffbaac400c
SP (7fffeca90f40) is in userspace
The solution for this problem is running the sigreturn code with
regs->msr[TS] disabled, thus, avoiding hitting the side effect above.
This does not seem to be a problem since regs->msr will be replaced by
the ucontext value, so, it is being flushed already. In this case, it
is flushed earlier.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit e1c3743e1a ("powerpc/tm: Set MSR[TS] just prior to recheckpoint")
moved a code block around and this block uses a 'msr' variable outside of
the CONFIG_PPC_TRANSACTIONAL_MEM, however the 'msr' variable is declared
inside a CONFIG_PPC_TRANSACTIONAL_MEM block, causing a possible error when
CONFIG_PPC_TRANSACTION_MEM is not defined.
error: 'msr' undeclared (first use in this function)
This is not causing a compilation error in the mainline kernel, because
'msr' is being used as an argument of MSR_TM_ACTIVE(), which is defined as
the following when CONFIG_PPC_TRANSACTIONAL_MEM is *not* set:
#define MSR_TM_ACTIVE(x) 0
This patch just fixes this issue avoiding the 'msr' variable usage outside
the CONFIG_PPC_TRANSACTIONAL_MEM block, avoiding trusting in the
MSR_TM_ACTIVE() definition.
Cc: stable@vger.kernel.org
Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
Fixes: e1c3743e1a ("powerpc/tm: Set MSR[TS] just prior to recheckpoint")
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a TM Bad Thing bug that can be caused when you return from a
signal context in a suspended transaction but with ucontext MSR[TS] unset.
This forces regs->msr[TS] to be set at syscall entrance (since the CPU
state is transactional). It also calls treclaim() to flush the transaction
state, which is done based on the live (mfmsr) MSR state.
Since user context MSR[TS] is not set, then restore_tm_sigcontexts() is not
called, thus, not executing recheckpoint, keeping the CPU state as not
transactional. When calling rfid, SRR1 will have MSR[TS] set, but the CPU
state is non transactional, causing the TM Bad Thing with the following
stack:
[ 33.862316] Bad kernel stack pointer 3fffd9dce3e0 at c00000000000c47c
cpu 0x8: Vector: 700 (Program Check) at [c00000003ff7fd40]
pc: c00000000000c47c: fast_exception_return+0xac/0xb4
lr: 00003fff865f442c
sp: 3fffd9dce3e0
msr: 8000000102a03031
current = 0xc00000041f68b700
paca = 0xc00000000fb84800 softe: 0 irq_happened: 0x01
pid = 1721, comm = tm-signal-sigre
Linux version 4.9.0-3-powerpc64le (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
WARNING: exception is not recoverable, can't continue
The same problem happens on 32-bits signal handler, and the fix is very
similar, if tm_recheckpoint() is not executed, then regs->msr[TS] should be
zeroed.
This patch also fixes a sparse warning related to lack of indentation when
CONFIG_PPC_TRANSACTIONAL_MEM is set.
Fixes: 2b0a576d15 ("powerpc: Add new transactional memory state to the signal context")
CC: Stable <stable@vger.kernel.org> # 3.10+
Signed-off-by: Breno Leitao <leitao@debian.org>
Tested-by: Michal Suchánek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On a signal handler return, the user could set a context with MSR[TS] bits
set, and these bits would be copied to task regs->msr.
At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set,
several __get_user() are called and then a recheckpoint is executed.
This is a problem since a page fault (in kernel space) could happen when
calling __get_user(). If it happens, the process MSR[TS] bits were
already set, but recheckpoint was not executed, and SPRs are still invalid.
The page fault can cause the current process to be de-scheduled, with
MSR[TS] active and without tm_recheckpoint() being called. More
importantly, without TEXASR[FS] bit set also.
Since TEXASR might not have the FS bit set, and when the process is
scheduled back, it will try to reclaim, which will be aborted because of
the CPU is not in the suspended state, and, then, recheckpoint. This
recheckpoint will restore thread->texasr into TEXASR SPR, which might be
zero, hitting a BUG_ON().
kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
cpu 0xb: Vector: 700 (Program Check) at [c00000041f1576d0]
pc: c000000000054550: restore_gprs+0xb0/0x180
lr: 0000000000000000
sp: c00000041f157950
msr: 8000000100021033
current = 0xc00000041f143000
paca = 0xc00000000fb86300 softe: 0 irq_happened: 0x01
pid = 1021, comm = kworker/11:1
kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
Linux version 4.9.0-3-powerpc64le (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
enter ? for help
[c00000041f157b30] c00000000001bc3c tm_recheckpoint.part.11+0x6c/0xa0
[c00000041f157b70] c00000000001d184 __switch_to+0x1e4/0x4c0
[c00000041f157bd0] c00000000082eeb8 __schedule+0x2f8/0x990
[c00000041f157cb0] c00000000082f598 schedule+0x48/0xc0
[c00000041f157ce0] c0000000000f0d28 worker_thread+0x148/0x610
[c00000041f157d80] c0000000000f96b0 kthread+0x120/0x140
[c00000041f157e30] c00000000000c0e0 ret_from_kernel_thread+0x5c/0x7c
This patch simply delays the MSR[TS] set, so, if there is any page fault in
the __get_user() section, it does not have regs->msr[TS] set, since the TM
structures are still invalid, thus avoiding doing TM operations for
in-kernel exceptions and possible process reschedule.
With this patch, the MSR[TS] will only be set just before recheckpointing
and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in
invalid state.
Other than that, if CONFIG_PREEMPT is set, there might be a preemption just
after setting MSR[TS] and before tm_recheckpoint(), thus, this block must
be atomic from a preemption perspective, thus, calling
preempt_disable/enable() on this code.
It is not possible to move tm_recheckpoint to happen earlier, because it is
required to get the checkpointed registers from userspace, with
__get_user(), thus, the only way to avoid this undesired behavior is
delaying the MSR[TS] set.
The 32-bits signal handler seems to be safe this current issue, but, it
might be exposed to the preemption issue, thus, disabling preemption in
this chunk of code.
Changes from v2:
* Run the critical section with preempt_disable.
Fixes: 87b4e5393a ("powerpc/tm: Fix return of active 64bit signals")
Cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With SYSCALL_DEFINEx() disabling -Wattribute-alias generically, there's
no need to duplicate that for PowerPC syscalls.
This reverts commit 4155203739 ("powerpc: fix build failure by
disabling attribute-alias warning in pci_32") and commit 2479bfc9bc
("powerpc: Fix build by disabling attribute-alias warning for
SYSCALL_DEFINEx").
Signed-off-by: Paul Burton <paul.burton@mips.com>
Acked-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
GCC 8.1 emits warnings such as the following. As arch/powerpc code is
built with -Werror, this breaks the build with GCC 8.1.
In file included from arch/powerpc/kernel/pci_64.c:23:
./include/linux/syscalls.h:233:18: error: 'sys_pciconfig_iobase' alias
between functions of incompatible types 'long int(long int, long
unsigned int, long unsigned int)' and 'long int(long int, long int,
long int)' [-Werror=attribute-alias]
asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
^~~
./include/linux/syscalls.h:222:2: note: in expansion of macro '__SYSCALL_DEFINEx'
__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
This patch inhibits those warnings.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Trim change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[mpe: Fix sys_debug_setcontext() prototype to return long]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently it's possible that a thread on PPC64 LE has its endianness
flipped inadvertently to Big-Endian resulting in a crash once the process
is back from the signal handler.
If giveup_all() is called when regs->msr has the bits MSR.FP and MSR.VEC
disabled (and hence MSR.VSX disabled too) it returns without calling
check_if_tm_restore_required() which copies regs->msr to ckpt_regs->msr if
the process caught a signal whilst in transactional mode. Then once in
setup_tm_sigcontexts() MSR from ckpt_regs.msr is used, but since
check_if_tm_restore_required() was not called previuosly, gp_regs[PT_MSR]
gets a copy of invalid MSR bits as MSR in ckpt_regs was not updated from
regs->msr and so is zeroed. Later when leaving the signal handler once in
sys_rt_sigreturn() the TS bits of gp_regs[PT_MSR] are checked to determine
if restore_tm_sigcontexts() must be called to pull in the correct MSR state
into the user context. Because TS bits are zeroed
restore_tm_sigcontexts() is never called and MSR restored from the user
context on returning from the signal handler has the MSR.LE (the endianness
bit) forced to zero (Big-Endian). That leads, for instance, to 'nop' being
treated as an illegal instruction in the following sequence:
tbegin.
beq 1f
trap
tend.
1: nop
on PPC64 LE machines and the process dies just after returning from the
signal handler.
PPC64 BE is also affected but in a subtle way since forcing Big-Endian on
a BE machine does not change the endianness.
This commit fixes the issue described above by ensuring that once in
setup_tm_sigcontexts() the MSR used is from regs->msr instead of from
ckpt_regs->msr and by ensuring that we pull in only the MSR.FP, MSR.VEC,
and MSR.VSX bits from ckpt_regs->msr.
The fix was tested both on LE and BE machines and no regression regarding
the powerpc/tm selftests was observed.
Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We have always had softe in pt_regs, and accessible via PT_SOFTE, even
though it is not userspace state.
The value userspace sees should always be 1, because we should never
be in userspace with interrupts soft disabled.
In a subsequent patch we will be changing the semantics of the kernel
softe value, so hard wire the value to 1 to retain the existing
semantics. As far as we know nothing ever looks at it, but better safe
than sorry.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[mpe: Split out of larger patch, write change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Non-highlights:
- Five fixes for the >128T address space handling, both to fix bugs in our
implementation and to bring the semantics exactly into line with x86.
Highlights:
- Support for a new OPAL call on bare metal machines which gives us a true NMI
(ie. is not masked by MSR[EE]=0) for debugging etc.
- Support for Power9 DD2 in the CXL driver.
- Improvements to machine check handling so that uncorrectable errors can be
reported into the generic memory_failure() machinery.
- Some fixes and improvements for VPHN, which is used under PowerVM to notify
the Linux partition of topology changes.
- Plumbing to enable TM (transactional memory) without suspend on some Power9
processors (PPC_FEATURE2_HTM_NO_SUSPEND).
- Support for emulating vector loads form cache-inhibited memory, on some
Power9 revisions.
- Disable the fast-endian switch "syscall" by default (behind a CONFIG), we
believe it has never had any users.
- A major rework of the API drivers use when initiating and waiting for long
running operations performed by OPAL firmware, and changes to the
powernv_flash driver to use the new API.
- Several fixes for the handling of FP/VMX/VSX while processes are using
transactional memory.
- Optimisations of TLB range flushes when using the radix MMU on Power9.
- Improvements to the VAS facility used to access coprocessors on Power9, and
related improvements to the way the NX crypto driver handles requests.
- Implementation of PMEM_API and UACCESS_FLUSHCACHE for 64-bit.
Thanks to:
Alexey Kardashevskiy, Alistair Popple, Allen Pais, Andrew Donnellan, Aneesh
Kumar K.V, Arnd Bergmann, Balbir Singh, Benjamin Herrenschmidt, Breno Leitao,
Christophe Leroy, Christophe Lombard, Cyril Bur, Frederic Barrat, Gautham R.
Shenoy, Geert Uytterhoeven, Guilherme G. Piccoli, Gustavo Romero, Haren
Myneni, Joel Stanley, Kamalesh Babulal, Kautuk Consul, Markus Elfring, Masami
Hiramatsu, Michael Bringmann, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pedro Miraglia Franco de
Carvalho, Philippe Bergheaud, Sandipan Das, Seth Forshee, Shriya, Stephen
Rothwell, Stewart Smith, Sukadev Bhattiprolu, Tyrel Datwyler, Vaibhav Jain,
Vaidyanathan Srinivasan, William A. Kennington III.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJaDXGuAAoJEFHr6jzI4aWAEqwP/0TA35KFAK6wqfkCf67z4q+O
I+5piI4eDV4jdCakfoIN1JfjhQRULNePSoCHTccan30mu/bm30p69xtOLL2/h5xH
Mhz/eDBAOo0lrT20nyZfYMW3FnM66wnNf++qJ0O+8L052r4WOB02J0k1uM1ST01D
5Lb5mUoxRLRzCgKRYAYWJifn+IFPUB9NMsvMTym94krAFlIjIzMEQXhDoln+jJMr
QmY5f1BTA/fLfXobn0zwoc/C1oa2PUtxd+rxbwGrLoZ6G843mMqUi90SMr5ybhXp
RzepnBTj4by3vOsnk/X1mANyaZfLsunp75FwnjHdPzKrAS/TuPp8D/iSxxE/PzEq
cLwJFBnFXSgQMefDErXxhHSDz2dAg5r14rsTpDcq2Ko8TPV4rPsuSfmbd9Txekb0
yWHsjoJUBBMl2QcWqIHl+AlV8j1RklF6solcTBcGnH1CZJMfa05VKXV7xGEvOHa0
RJ+/xPyR9KjoB/SUp++9Vmx/M6SwQYFOJlr3Zpg9LNtR8WpoPYu1E6eO+u1Hhzny
eJqaNstH+i+VdY9eqszkAsEBh8o9M/+b+7Wx7TetvU+v368CbXtgFYs9qy2oZjPF
t9sY/BHaHZ8eZ7I00an77a0fVV5B1PVASUtIz5CqkwGpMvX6Z6W2K/XUUFI61kuu
E06HS6Ht8UPJAzrAPUMl
=Rq81
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"A bit of a small release, I suspect in part due to me travelling for
KS. But my backlog of patches to review is smaller than usual, so I
think in part folks just didn't send as much this cycle.
Non-highlights:
- Five fixes for the >128T address space handling, both to fix bugs
in our implementation and to bring the semantics exactly into line
with x86.
Highlights:
- Support for a new OPAL call on bare metal machines which gives us a
true NMI (ie. is not masked by MSR[EE]=0) for debugging etc.
- Support for Power9 DD2 in the CXL driver.
- Improvements to machine check handling so that uncorrectable errors
can be reported into the generic memory_failure() machinery.
- Some fixes and improvements for VPHN, which is used under PowerVM
to notify the Linux partition of topology changes.
- Plumbing to enable TM (transactional memory) without suspend on
some Power9 processors (PPC_FEATURE2_HTM_NO_SUSPEND).
- Support for emulating vector loads form cache-inhibited memory, on
some Power9 revisions.
- Disable the fast-endian switch "syscall" by default (behind a
CONFIG), we believe it has never had any users.
- A major rework of the API drivers use when initiating and waiting
for long running operations performed by OPAL firmware, and changes
to the powernv_flash driver to use the new API.
- Several fixes for the handling of FP/VMX/VSX while processes are
using transactional memory.
- Optimisations of TLB range flushes when using the radix MMU on
Power9.
- Improvements to the VAS facility used to access coprocessors on
Power9, and related improvements to the way the NX crypto driver
handles requests.
- Implementation of PMEM_API and UACCESS_FLUSHCACHE for 64-bit.
Thanks to: Alexey Kardashevskiy, Alistair Popple, Allen Pais, Andrew
Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Balbir Singh, Benjamin
Herrenschmidt, Breno Leitao, Christophe Leroy, Christophe Lombard,
Cyril Bur, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven,
Guilherme G. Piccoli, Gustavo Romero, Haren Myneni, Joel Stanley,
Kamalesh Babulal, Kautuk Consul, Markus Elfring, Masami Hiramatsu,
Michael Bringmann, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pedro Miraglia
Franco de Carvalho, Philippe Bergheaud, Sandipan Das, Seth Forshee,
Shriya, Stephen Rothwell, Stewart Smith, Sukadev Bhattiprolu, Tyrel
Datwyler, Vaibhav Jain, Vaidyanathan Srinivasan, and William A.
Kennington III"
* tag 'powerpc-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (151 commits)
powerpc/64s: Fix Power9 DD2.0 workarounds by adding DD2.1 feature
powerpc/64s: Fix masking of SRR1 bits on instruction fault
powerpc/64s: mm_context.addr_limit is only used on hash
powerpc/64s/radix: Fix 128TB-512TB virtual address boundary case allocation
powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary
powerpc/64s/hash: Fix fork() with 512TB process address space
powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation
powerpc/64s/hash: Fix 512T hint detection to use >= 128T
powerpc: Fix DABR match on hash based systems
powerpc/signal: Properly handle return value from uprobe_deny_signal()
powerpc/fadump: use kstrtoint to handle sysfs store
powerpc/lib: Implement UACCESS_FLUSHCACHE API
powerpc/lib: Implement PMEM API
powerpc/powernv/npu: Don't explicitly flush nmmu tlb
powerpc/powernv/npu: Use flush_all_mm() instead of flush_tlb_mm()
powerpc/powernv/idle: Round up latency and residency values
powerpc/kprobes: refactor kprobe_lookup_name for safer string operations
powerpc/kprobes: Blacklist emulate_update_regs() from kprobes
powerpc/kprobes: Do not disable interrupts for optprobes and kprobes_on_ftrace
powerpc/kprobes: Disable preemption before invoking probe handler for optprobes
...
Lazy save and restore of FP/Altivec means that a userspace process can
be sent to userspace with FP or Altivec disabled and loaded only as
required (by way of an FP/Altivec unavailable exception). Transactional
Memory complicates this situation as a transaction could be started
without FP/Altivec being loaded up. This causes the hardware to
checkpoint incorrect registers. Handling FP/Altivec unavailable
exceptions while a thread is transactional requires a reclaim and
recheckpoint to ensure the CPU has correct state for both sets of
registers.
tm_reclaim() has optimisations to not always save the FP/Altivec
registers to the checkpointed save area. This was originally done
because the caller might have information that the checkpointed
registers aren't valid due to lazy save and restore. We've also been a
little vague as to how tm_reclaim() leaves the FP/Altivec state since it
doesn't necessarily always save it to the thread struct. This has lead
to an (incorrect) assumption that it leaves the checkpointed state on
the CPU.
tm_recheckpoint() has similar optimisations in reverse. It may not
always reload the checkpointed FP/Altivec registers from the thread
struct before the trecheckpoint. It is therefore quite unclear where it
expects to get the state from. This didn't help with the assumption
made about tm_reclaim().
These optimisations sit in what is by definition a slow path. If a
process has to go through a reclaim/recheckpoint then its transaction
will be doomed on returning to userspace. This mean that the process
will be unable to complete its transaction and be forced to its failure
handler. This is already an out if line case for userspace. Furthermore,
the cost of copying 64 times 128 bits from registers isn't very long[0]
(at all) on modern processors. As such it appears these optimisations
have only served to increase code complexity and are unlikely to have
had a measurable performance impact.
Our transactional memory handling has been riddled with bugs. A cause
of this has been difficulty in following the code flow, code complexity
has not been our friend here. It makes sense to remove these
optimisations in favour of a (hopefully) more stable implementation.
This patch does mean that some times the assembly will needlessly save
'junk' registers which will subsequently get overwritten with the
correct value by the C code which calls the assembly function. This
small inefficiency is far outweighed by the reduction in complexity for
general TM code, context switching paths, and transactional facility
unavailable exception handler.
0: I tried to measure it once for other work and found that it was
hiding in the noise of everything else I was working with. I find it
exceedingly likely this will be the case here.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Unfortunately userspace can construct a sigcontext which enables
suspend. Thus userspace can force Linux into a path where trechkpt is
executed.
This patch blocks this from happening on POWER9 by sanity checking
sigcontexts passed in.
ptrace doesn't have this problem as only MSR SE and BE can be changed
via ptrace.
This patch also adds a number of WARN_ON()s in case we ever enter
suspend when we shouldn't. This should not happen, but if it does the
symptoms are soft lockup warnings which are not obviously TM related,
so the WARN_ON()s should make it obvious what's happening.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently it's possible that on returning from the signal handler
through the restore_tm_sigcontexts() code path (e.g. from a signal
caught due to a `trap` instruction executed in the middle of an HTM
block, or a deliberately constructed sigframe) an illegal TM state
(like TS=10 TM=0, i.e. "T0") is set in SRR1 and when `rfid` sets
implicitly the MSR register from SRR1 register on return to userspace
it causes a TM Bad Thing exception.
That illegal state can be set (a) by a malicious user that disables
the TM bit by tweaking the bits in uc_mcontext before returning from
the signal handler or (b) by a sufficient number of context switches
occurring such that the load_tm counter overflows and TM is disabled
whilst in the signal handler.
This commit fixes the illegal TM state by ensuring that TM bit is
always enabled before we return from restore_tm_sigcontexts(). A small
comment correction is made as well.
Fixes: 5d176f751e ("powerpc: tm: Enable transactional memory (TM) lazily for userspace")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the structures being used for checkpointed state named
consistently with the pt_regs/ckpt_regs.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is currently an inconsistency as to how the entire CPU register
state is saved and restored when a thread uses transactional memory
(TM).
Using transactional memory results in the CPU having duplicated
(almost) all of its register state. This duplication results in a set
of registers which can be considered 'live', those being currently
modified by the instructions being executed and another set that is
frozen at a point in time.
On context switch, both sets of state have to be saved and (later)
restored. These two states are often called a variety of different
things. Common terms for the state which only exists after the CPU has
entered a transaction (performed a TBEGIN instruction) in hardware are
'transactional' or 'speculative'.
Between a TBEGIN and a TEND or TABORT (or an event that causes the
hardware to abort), regardless of the use of TSUSPEND the
transactional state can be referred to as the live state.
The second state is often to referred to as the 'checkpointed' state
and is a duplication of the live state when the TBEGIN instruction is
executed. This state is kept in the hardware and will be rolled back
to on transaction failure.
Currently all the registers stored in pt_regs are ALWAYS the live
registers, that is, when a thread has transactional registers their
values are stored in pt_regs and the checkpointed state is in
ckpt_regs. A strange opposite is true for fp_state/vr_state. When a
thread is non transactional fp_state/vr_state holds the live
registers. When a thread has initiated a transaction fp_state/vr_state
holds the checkpointed state and transact_fp/transact_vr become the
structure which holds the live state (at this point it is a
transactional state).
This method creates confusion as to where the live state is, in some
circumstances it requires extra work to determine where to put the
live state and prevents the use of common functions designed (probably
before TM) to save the live state.
With this patch pt_regs, fp_state and vr_state all represent the
same thing and the other structures [pending rename] are for
checkpointed state.
Acked-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Much of the signal code takes a pt_regs on which it operates. Over
time the signal code has needed to know more about the thread than
what pt_regs can supply, this information is obtained as needed by
using 'current'.
This approach is not strictly incorrect however it does mean that
there is now a hard requirement that the pt_regs being passed around
does belong to current, this is never checked. A safer approach is for
the majority of the signal functions to take a task_struct from which
they can obtain pt_regs and any other information they need. The
caveat that the task_struct they are passed must be current doesn't go
away but can more easily be checked for.
Functions called from outside powerpc signal code are passed a pt_regs
and they can confirm that the pt_regs is that of current and pass
current to other functions, furthurmore, powerpc signal functions can
check that the task_struct they are passed is the same as current
avoiding possible corruption of current (or the task they are passed)
if this assertion ever fails.
CC: paulus@samba.org
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Normally, when MSR[VSX/VR/SPE] bits == 1, the used_vsr/used_vr/used_spe
bit have already been set. However when loading a signal frame from user
space we need to explicitly set used_vsr/used_vr/used_spe to make them
consistent with the MSR bits from the signal frame.
For example, CRIU application, who utilizes sigreturn to restore
checkpointed process, will lead to the case where MSR[VSX] bit is active
in signal frame, but used_vsr bit is not set in the kernel. (the same
applies to VR/SPE).
This patch fixes this by always setting used_* bit when MSR related bits
are active in signal frame and we are doing sigreturn.
Based on a proposal by Benh.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
[mpe: Massage change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Another set of things that are only called from assembler and so need
prototypes to keep sparse happy.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Userspace can begin and suspend a transaction within the signal
handler which means they might enter sys_rt_sigreturn() with the
processor in suspended state.
sys_rt_sigreturn() wants to restore process context (which may have
been in a transaction before signal delivery). To do this it must
restore TM SPRS. To achieve this, any transaction initiated within the
signal frame must be discarded in order to be able to restore TM SPRs
as TM SPRs can only be manipulated non-transactionally..
>From the PowerPC ISA:
TM Bad Thing Exception [Category: Transactional Memory]
An attempt is made to execute a mtspr targeting a TM register in
other than Non-transactional state.
Not doing so results in a TM Bad Thing:
[12045.221359] Kernel BUG at c000000000050a40 [verbose debug info unavailable]
[12045.221470] Unexpected TM Bad Thing exception at c000000000050a40 (msr 0x201033)
[12045.221540] Oops: Unrecoverable exception, sig: 6 [#1]
[12045.221586] SMP NR_CPUS=2048 NUMA PowerNV
[12045.221634] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter
ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables kvm_hv kvm
uio_pdrv_genirq ipmi_powernv uio powernv_rng ipmi_msghandler autofs4 ses enclosure
scsi_transport_sas bnx2x ipr mdio libcrc32c
[12045.222167] CPU: 68 PID: 6178 Comm: sigreturnpanic Not tainted 4.7.0 #34
[12045.222224] task: c0000000fce38600 ti: c0000000fceb4000 task.ti: c0000000fceb4000
[12045.222293] NIP: c000000000050a40 LR: c0000000000163bc CTR: 0000000000000000
[12045.222361] REGS: c0000000fceb7ac0 TRAP: 0700 Not tainted (4.7.0)
[12045.222418] MSR: 9000000300201033 <SF,HV,ME,IR,DR,RI,LE,TM[SE]> CR: 28444280 XER: 20000000
[12045.222625] CFAR: c0000000000163b8 SOFTE: 0 PACATMSCRATCH: 900000014280f033
GPR00: 01100000b8000001 c0000000fceb7d40 c00000000139c100 c0000000fce390d0
GPR04: 900000034280f033 0000000000000000 0000000000000000 0000000000000000
GPR08: 0000000000000000 b000000000001033 0000000000000001 0000000000000000
GPR12: 0000000000000000 c000000002926400 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24: 0000000000000000 00003ffff98cadd0 00003ffff98cb470 0000000000000000
GPR28: 900000034280f033 c0000000fceb7ea0 0000000000000001 c0000000fce390d0
[12045.223535] NIP [c000000000050a40] tm_restore_sprs+0xc/0x1c
[12045.223584] LR [c0000000000163bc] tm_recheckpoint+0x5c/0xa0
[12045.223630] Call Trace:
[12045.223655] [c0000000fceb7d80] [c000000000026e74] sys_rt_sigreturn+0x494/0x6c0
[12045.223738] [c0000000fceb7e30] [c0000000000092e0] system_call+0x38/0x108
[12045.223806] Instruction dump:
[12045.223841] 7c800164 4e800020 7c0022a6 f80304a8 7c0222a6 f80304b0 7c0122a6 f80304b8
[12045.223955] 4e800020 e80304a8 7c0023a6 e80304b0 <7c0223a6> e80304b8 7c0123a6 4e800020
[12045.224074] ---[ end trace cb8002ee240bae76 ]---
It isn't clear exactly if there is really a use case for userspace
returning with a suspended transaction, however, doing so doesn't (on
its own) constitute a bad frame. As such, this patch simply discards
the transactional state of the context calling the sigreturn and
continues.
Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Acked-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In setup_sigcontext(), we set current->thread.vrsave then use it
straight after. Since current is hidden from the compiler via inline
assembly, it cannot optimise this and we end up with a load hit store.
Fix this by using a temporary.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- tm: Block signal return from setting invalid MSR state from Michael Neuling
- tm: Check for already reclaimed tasks from Michael Neuling
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWVul+AAoJEFHr6jzI4aWAUZAQAK2s+E4WTtcExXSG1bqq05o6
5wCLtIq6M91h5HDpgBfN7S5OXpRd72ZeIVdzC0HkFeLBF3y7NSHSEezUw4g/GGfz
K2xGV1CCXC3Rb3qyHSdyi6+c1AnLVRPBVzVxPVmlXigrXeFiQ4613YW9rzf9b8fs
oktUciwW9aHbrIv7g8f82gpuk9jwwhp/sF+1H/7fGOozT4CFsKo4wj4HOOCBwH4y
ODEjs6Z+9Uwb6Kfvi/rn3k4XA1wC36WFq3ORI6KrmK/ZB1eR0Kwf0IELYpMj8cOX
q5ZtCH7t68f9vmEK2B34AUijf/amm+2vLwvF6xAuZJFPUPZtgMBdRcqkLalbtPAO
8hlyPPgoZcgR/Of+lEYxUobcL0SMNufXwmfwRO35ktkm9Z9Ee96C8NNbpybBSDXL
YRa6is5MeO4GL8Gbcc0TA50hGjok7o3acGE6HSAReyzf0guQ4xqcif7+6lfWZPkk
P3aM02ajp2qoqyjhT/Ei6JlMptAiuQY+HvELFneqn5s9nDbv6cGuYZNNap0c1fK+
74W0p7MiZh7+IF5HpyUIeYV836inXMDIoKzjA6H3OWitk/1lbcrbF34Qpz9zAWZn
YF3w786ZzzQLw0jcALaqZejm58MLGIakO4MNDB0/ZBh0nKKfEV8WvPOjev78OAp1
+pDrJh0iQzrujN6OMKQ0
=IL8A
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.4-3' into next
Merge the two TM fixes we merged in 4.4. We are about to merge selftests
for these, and without the fixes the selftests will oops.
powerpc fixes for 4.4 #2
- tm: Block signal return from setting invalid MSR state from Michael Neuling
- tm: Check for already reclaimed tasks from Michael Neuling
Move the MSR modification into c. Removing it from the assembly
function will allow us to avoid costly MSR writes by batching them
up.
Check the FP and VMX bits before calling the relevant giveup_*()
function. This makes giveup_vsx() and flush_vsx_to_thread() perform
more like their sister functions, and allows us to use
flush_vsx_to_thread() in the signal code.
Move the check_if_tm_restore_required() check in.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The UP only lazy floating point and vector optimisations were written
back when SMP was not common, and neither glibc nor gcc used vector
instructions. Now SMP is very common, glibc aggressively uses vector
instructions and gcc autovectorises.
We want to add new optimisations that apply to both UP and SMP, but
in preparation for that remove these UP only optimisations.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently we allow both the MSR T and S bits to be set by userspace on
a signal return. Unfortunately this is a reserved configuration and
will cause a TM Bad Thing exception if attempted (via rfid).
This patch checks for this case in both the 32 and 64 bit signals
code. If both T and S are set, we mark the context as invalid.
Found using a syscall fuzzer.
Fixes: 2b0a576d15 ("powerpc: Add new transactional memory state to the signal context")
Cc: stable@vger.kernel.org # v3.9+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds one helper function 'sigcontext_vmx_regs' which computes
quad word aligned pointer for 'vmx_reserve' array element in sigcontext
structure making the code more readable.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[mpe: Reword comment and fix build for CONFIG_ALTIVEC=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If an attacker can cause a controlled kernel stack overflow, overwriting
the restart block is a very juicy exploit target. This is because the
restart_block is held in the same memory allocation as the kernel stack.
Moving the restart block to struct task_struct prevents this exploit by
making the restart_block harder to locate.
Note that there are other fields in thread_info that are also easy
targets, at least on some architectures.
It's also a decent simplification, since the restart code is more or less
identical on all architectures.
[james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Miller <davem@davemloft.net>
Acked-by: Richard Weinberger <richard@nod.at>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the more generic functions get_signal() signal_setup_done()
for signal delivery.
This inverts also the return codes of setup_*frame() to follow the
kernel convention.
Signed-off-by: Richard Weinberger <richard@nod.at>
We have some compile-time disabled debug code in signal_xx.c. It's from
some ancient time BG, almost certainly part of the original port, given
the very similar code on other arches.
The show_unhandled_signal logic, added in d0c3d534a4 (2.6.24) is
cleaner and prints more useful information, so drop the debug code.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We can't take an IRQ when we're about to do a trechkpt as our GPR state is set
to user GPR values.
We've hit this when running some IBM Java stress tests in the lab resulting in
the following dump:
cpu 0x3f: Vector: 700 (Program Check) at [c000000007eb3d40]
pc: c000000000050074: restore_gprs+0xc0/0x148
lr: 00000000b52a8184
sp: ac57d360
msr: 8000000100201030
current = 0xc00000002c500000
paca = 0xc000000007dbfc00 softe: 0 irq_happened: 0x00
pid = 34535, comm = Pooled Thread #
R00 = 00000000b52a8184 R16 = 00000000b3e48fda
R01 = 00000000ac57d360 R17 = 00000000ade79bd8
R02 = 00000000ac586930 R18 = 000000000fac9bcc
R03 = 00000000ade60000 R19 = 00000000ac57f930
R04 = 00000000f6624918 R20 = 00000000ade79be8
R05 = 00000000f663f238 R21 = 00000000ac218a54
R06 = 0000000000000002 R22 = 000000000f956280
R07 = 0000000000000008 R23 = 000000000000007e
R08 = 000000000000000a R24 = 000000000000000c
R09 = 00000000b6e69160 R25 = 00000000b424cf00
R10 = 0000000000000181 R26 = 00000000f66256d4
R11 = 000000000f365ec0 R27 = 00000000b6fdcdd0
R12 = 00000000f66400f0 R28 = 0000000000000001
R13 = 00000000ada71900 R29 = 00000000ade5a300
R14 = 00000000ac2185a8 R30 = 00000000f663f238
R15 = 0000000000000004 R31 = 00000000f6624918
pc = c000000000050074 restore_gprs+0xc0/0x148
cfar= c00000000004fe28 dont_restore_vec+0x1c/0x1a4
lr = 00000000b52a8184
msr = 8000000100201030 cr = 24804888
ctr = 0000000000000000 xer = 0000000000000000 trap = 700
This moves tm_recheckpoint to a C function and moves the tm_restore_sprs into
that function. It then adds IRQ disabling over the trechkpt critical section.
It also sets the TEXASR FS in the signals code to ensure this is never set now
that we explictly write the TM sprs in tm_recheckpoint.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The new ELFv2 little-endian ABI increases the stack redzone -- the
area below the stack pointer that can be used for storing data --
from 288 bytes to 512 bytes. This means that we need to allow more
space on the user stack when delivering a signal to a 64-bit process.
To make the code a bit clearer, we define new USER_REDZONE_SIZE and
KERNEL_REDZONE_SIZE symbols in ptrace.h. For now, we leave the
kernel redzone size at 288 bytes, since increasing it to 512 bytes
would increase the size of interrupt stack frames correspondingly.
Gcc currently only makes use of 288 bytes of redzone even when
compiling for the new little-endian ABI, and the kernel cannot
currently be compiled with the new ABI anyway.
In the future, hopefully gcc will provide an option to control the
amount of redzone used, and then we could reduce it even more.
This also changes the code in arch_compat_alloc_user_space() to
preserve the expanded redzone. It is not clear why this function would
ever be used on a 64-bit process, though.
Signed-off-by: Paul Mackerras <paulus@samba.org>
CC: <stable@vger.kernel.org> [v3.13]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, when we have a process using the transactional memory
facilities on POWER8 (that is, the processor is in transactional
or suspended state), and the process enters the kernel and the
kernel then uses the floating-point or vector (VMX/Altivec) facility,
we end up corrupting the user-visible FP/VMX/VSX state. This
happens, for example, if a page fault causes a copy-on-write
operation, because the copy_page function will use VMX to do the
copy on POWER8. The test program below demonstrates the bug.
The bug happens because when FP/VMX state for a transactional process
is stored in the thread_struct, we store the checkpointed state in
.fp_state/.vr_state and the transactional (current) state in
.transact_fp/.transact_vr. However, when the kernel wants to use
FP/VMX, it calls enable_kernel_fp() or enable_kernel_altivec(),
which saves the current state in .fp_state/.vr_state. Furthermore,
when we return to the user process we return with FP/VMX/VSX
disabled. The next time the process uses FP/VMX/VSX, we don't know
which set of state (the current register values, .fp_state/.vr_state,
or .transact_fp/.transact_vr) we should be using, since we have no
way to tell if we are still in the same transaction, and if not,
whether the previous transaction succeeded or failed.
Thus it is necessary to strictly adhere to the rule that if FP has
been enabled at any point in a transaction, we must keep FP enabled
for the user process with the current transactional state in the
FP registers, until we detect that it is no longer in a transaction.
Similarly for VMX; once enabled it must stay enabled until the
process is no longer transactional.
In order to keep this rule, we add a new thread_info flag which we
test when returning from the kernel to userspace, called TIF_RESTORE_TM.
This flag indicates that there is FP/VMX/VSX state to be restored
before entering userspace, and when it is set the .tm_orig_msr field
in the thread_struct indicates what state needs to be restored.
The restoration is done by restore_tm_state(). The TIF_RESTORE_TM
bit is set by new giveup_fpu/altivec_maybe_transactional helpers,
which are called from enable_kernel_fp/altivec, giveup_vsx, and
flush_fp/altivec_to_thread instead of giveup_fpu/altivec.
The other thing to be done is to get the transactional FP/VMX/VSX
state from .fp_state/.vr_state when doing reclaim, if that state
has been saved there by giveup_fpu/altivec_maybe_transactional.
Having done this, we set the FP/VMX bit in the thread's MSR after
reclaim to indicate that that part of the state is now valid
(having been reclaimed from the processor's checkpointed state).
Finally, in the signal handling code, we move the clearing of the
transactional state bits in the thread's MSR a bit earlier, before
calling flush_fp_to_thread(), so that we don't unnecessarily set
the TIF_RESTORE_TM bit.
This is the test program:
/* Michael Neuling 4/12/2013
*
* See if the altivec state is leaked out of an aborted transaction due to
* kernel vmx copy loops.
*
* gcc -m64 htm_vmxcopy.c -o htm_vmxcopy
*
*/
/* We don't use all of these, but for reference: */
int main(int argc, char *argv[])
{
long double vecin = 1.3;
long double vecout;
unsigned long pgsize = getpagesize();
int i;
int fd;
int size = pgsize*16;
char tmpfile[] = "/tmp/page_faultXXXXXX";
char buf[pgsize];
char *a;
uint64_t aborted = 0;
fd = mkstemp(tmpfile);
assert(fd >= 0);
memset(buf, 0, pgsize);
for (i = 0; i < size; i += pgsize)
assert(write(fd, buf, pgsize) == pgsize);
unlink(tmpfile);
a = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
assert(a != MAP_FAILED);
asm __volatile__(
"lxvd2x 40,0,%[vecinptr] ; " // set 40 to initial value
TBEGIN
"beq 3f ;"
TSUSPEND
"xxlxor 40,40,40 ; " // set 40 to 0
"std 5, 0(%[map]) ;" // cause kernel vmx copy page
TABORT
TRESUME
TEND
"li %[res], 0 ;"
"b 5f ;"
"3: ;" // Abort handler
"li %[res], 1 ;"
"5: ;"
"stxvd2x 40,0,%[vecoutptr] ; "
: [res]"=r"(aborted)
: [vecinptr]"r"(&vecin),
[vecoutptr]"r"(&vecout),
[map]"r"(a)
: "memory", "r0", "r3", "r4", "r5", "r6", "r7");
if (aborted && (vecin != vecout)){
printf("FAILED: vector state leaked on abort %f != %f\n",
(double)vecin, (double)vecout);
exit(1);
}
munmap(a, size);
close(fd);
printf("PASSED!\n");
return 0;
}
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In a recent patch:
commit c13f20ac48
Author: Michael Neuling <mikey@neuling.org>
powerpc/signals: Mark VSX not saved with small contexts
We fixed an issue but an improved solution was later discussed after the patch
was merged.
Firstly, this patch doesn't handle the 64bit signals case, which could also hit
this issue (but has never been reported).
Secondly, the original patch isn't clear what MSR VSX should be set to. The
new approach below always clears the MSR VSX bit (to indicate no VSX is in the
context) and sets it only in the specific case where VSX is available (ie. when
VSX has been used and the signal context passed has space to provide the
state).
This reverts the original patch and replaces it with the improved solution. It
also adds a 64 bit version.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For the ELFv2 ABI, the hander is the entry point, not a function descriptor.
We also need to set up r12, and fortunately the fast_exception_return
exit path restores r12 for us so nothing else is required.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Topic branch for commits that the KVM tree might want to pull
in separately.
Hand merged a few files due to conflicts with the LE stuff
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This creates new 'thread_fp_state' and 'thread_vr_state' structures
to store FP/VSX state (including FPSCR) and Altivec/VSX state
(including VSCR), and uses them in the thread_struct. In the
thread_fp_state, the FPRs and VSRs are represented as u64 rather
than double, since we rarely perform floating-point computations
on the values, and this will enable the structures to be used
in KVM code as well. Similarly FPSCR is now a u64 rather than
a structure of two 32-bit values.
This takes the offsets out of the macros such as SAVE_32FPRS,
REST_32FPRS, etc. This enables the same macros to be used for normal
and transactional state, enabling us to delete the transactional
versions of the macros. This also removes the unused do_load_up_fpu
and do_load_up_altivec, which were in fact buggy since they didn't
create large enough stack frames to account for the fact that
load_up_fpu and load_up_altivec are not designed to be called from C
and assume that their caller's stack frame is an interrupt frame.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We always take signals in big endian which is wrong. Signals
should be taken in native endian.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In commit c6e6771b(powerpc: Introduce VSX thread_struct and CONFIG_VSX)
we add a invocation of flush_fp_to_thread() before copying the FPR or
VSR to users. But we already invoke the flush_fp_to_thread() in this
function. So remove one of them.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since 2002, the kernel has not saved VRSAVE on exception entry and
restored it on exit; rather, VRSAVE gets context-switched in _switch.
This means that when executing in process context in the kernel, the
userspace VRSAVE value is live in the VRSAVE register.
However, the signal code assumes that current->thread.vrsave holds
the current VRSAVE value, which is incorrect. Therefore, this
commit changes it to use the actual VRSAVE register instead. (It
still uses current->thread.vrsave as a temporary location to store
it in, as __get_user and __put_user can only transfer to/from a
variable, not an SPR.)
This also modifies the transactional memory code to save and restore
VRSAVE regardless of whether VMX is enabled in the MSR. This is
because accesses to VRSAVE are not controlled by the MSR.VEC bit,
but can happen at any time.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Address some of the trivial sparse warnings in arch/powerpc.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently we only restore signals which are transactionally suspended but it's
possible that the transaction can be restored even when it's active. Most
likely this will result in a transactional rollback by the hardware as the
transaction will have been doomed by an earlier treclaim.
The current code is a legacy of earlier kernel implementations which did
software rollback of active transactions in the kernel. That code has now gone
but we didn't correctly fix up this part of the signals code which still makes
assumptions based on having software rollback.
This changes the signal return code to always restore both contexts on 64 bit
signal return. It also ensures that the MSR TM bits are properly restored from
the signal context which they are not currently.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When in an active transaction that takes a signal, we need to be careful with
the stack. It's possible that the stack has moved back up after the tbegin.
The obvious case here is when the tbegin is called inside a function that
returns before a tend. In this case, the stack is part of the checkpointed
transactional memory state. If we write over this non transactionally or in
suspend, we are in trouble because if we get a tm abort, the program counter
and stack pointer will be back at the tbegin but our in memory stack won't be
valid anymore.
To avoid this, when taking a signal in an active transaction, we need to use
the stack pointer from the checkpointed state, rather than the speculated
state. This ensures that the signal context (written tm suspended) will be
written below the stack required for the rollback. The transaction is aborted
becuase of the treclaim, so any memory written between the tbegin and the
signal will be rolled back anyway.
For signals taken in non-TM or suspended mode, we use the
normal/non-checkpointed stack pointer.
Tested with 64 and 32 bit signals
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> # v3.9
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We can't compile a kernel with CONFIG_ALTIVEC=n when
CONFIG_PPC_TRANSACTIONAL_MEM=y. We currently get:
arch/powerpc/kernel/tm.S:320: Error: unsupported relocation against THREAD_VSCR
arch/powerpc/kernel/tm.S:323: Error: unsupported relocation against THREAD_VR0
arch/powerpc/kernel/tm.S:323: Error: unsupported relocation against THREAD_VR0
etc.
The below fixes this with a sprinkling of #ifdefs.
This was found by mpe with kisskb:
http://kisskb.ellerman.id.au/kisskb/buildresult/8539442/
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Pull signal handling cleanups from Al Viro:
"This is the first pile; another one will come a bit later and will
contain SYSCALL_DEFINE-related patches.
- a bunch of signal-related syscalls (both native and compat)
unified.
- a bunch of compat syscalls switched to COMPAT_SYSCALL_DEFINE
(fixing several potential problems with missing argument
validation, while we are at it)
- a lot of now-pointless wrappers killed
- a couple of architectures (cris and hexagon) forgot to save
altstack settings into sigframe, even though they used the
(uninitialized) values in sigreturn; fixed.
- microblaze fixes for delivery of multiple signals arriving at once
- saner set of helpers for signal delivery introduced, several
architectures switched to using those."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (143 commits)
x86: convert to ksignal
sparc: convert to ksignal
arm: switch to struct ksignal * passing
alpha: pass k_sigaction and siginfo_t using ksignal pointer
burying unused conditionals
make do_sigaltstack() static
arm64: switch to generic old sigaction() (compat-only)
arm64: switch to generic compat rt_sigaction()
arm64: switch compat to generic old sigsuspend
arm64: switch to generic compat rt_sigqueueinfo()
arm64: switch to generic compat rt_sigpending()
arm64: switch to generic compat rt_sigprocmask()
arm64: switch to generic sigaltstack
sparc: switch to generic old sigsuspend
sparc: COMPAT_SYSCALL_DEFINE does all sign-extension as well as SYSCALL_DEFINE
sparc: kill sign-extending wrappers for native syscalls
kill sparc32_open()
sparc: switch to use of generic old sigaction
sparc: switch sys_compat_rt_sigaction() to COMPAT_SYSCALL_DEFINE
mips: switch to generic sys_fork() and sys_clone()
...
This adds the new transactional memory archtected state to the signal context
in both 32 and 64 bit.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
"Whether" is misspelled in various comments across the tree; this
fixes them. No code changes.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>