The ability for modified CS and/or SS to be useful has nothing
to do with TIF_IA32. Similarly, if there's an exploit involving
changing CS or SS, it's exploitable with or without a TIF_IA32
check.
So just delete the check.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Link: http://lkml.kernel.org/r/71c7ab36456855d11ae07edd4945a7dfe80f9915.1424822291.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
TIF_NOHZ is 19 (i.e. _TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME |
_TIF_SINGLESTEP), not (1<<19).
This code is involved in Dave's trinity lockup, but I don't see why
it would cause any of the problems he's seeing, except inadvertently
by causing a different path through entry_64.S's syscall handling.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/a6cd3b60a3f53afb6e1c8081b0ec30ff19003dd7.1416434075.git.luto@amacapital.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull audit updates from Eric Paris:
"So this change across a whole bunch of arches really solves one basic
problem. We want to audit when seccomp is killing a process. seccomp
hooks in before the audit syscall entry code. audit_syscall_entry
took as an argument the arch of the given syscall. Since the arch is
part of what makes a syscall number meaningful it's an important part
of the record, but it isn't available when seccomp shoots the
syscall...
For most arch's we have a better way to get the arch (syscall_get_arch)
So the solution was two fold: Implement syscall_get_arch() everywhere
there is audit which didn't have it. Use syscall_get_arch() in the
seccomp audit code. Having syscall_get_arch() everywhere meant it was
a useless flag on the stack and we could get rid of it for the typical
syscall entry.
The other changes inside the audit system aren't grand, fixed some
records that had invalid spaces. Better locking around the task comm
field. Removing some dead functions and structs. Make some things
static. Really minor stuff"
* git://git.infradead.org/users/eparis/audit: (31 commits)
audit: rename audit_log_remove_rule to disambiguate for trees
audit: cull redundancy in audit_rule_change
audit: WARN if audit_rule_change called illegally
audit: put rule existence check in canonical order
next: openrisc: Fix build
audit: get comm using lock to avoid race in string printing
audit: remove open_arg() function that is never used
audit: correct AUDIT_GET_FEATURE return message type
audit: set nlmsg_len for multicast messages.
audit: use union for audit_field values since they are mutually exclusive
audit: invalid op= values for rules
audit: use atomic_t to simplify audit_serial()
kernel/audit.c: use ARRAY_SIZE instead of sizeof/sizeof[0]
audit: reduce scope of audit_log_fcaps
audit: reduce scope of audit_net_id
audit: arm64: Remove the audit arch argument to audit_syscall_entry
arm64: audit: Add audit hook in syscall_trace_enter/exit()
audit: x86: drop arch from __audit_syscall_entry() interface
sparc: implement is_32bit_task
sparc: properly conditionalize use of TIF_32BIT
...
This splits syscall_trace_enter into syscall_trace_enter_phase1 and
syscall_trace_enter_phase2. Only phase 2 has full pt_regs, and only
phase 2 is permitted to modify any of pt_regs except for orig_ax.
The intent is that phase 1 can be called from the syscall fast path.
In this implementation, phase1 can handle any combination of
TIF_NOHZ (RCU context tracking), TIF_SECCOMP, and TIF_SYSCALL_AUDIT,
unless seccomp requests a ptrace event, in which case phase2 is
forced.
In principle, this could yield a big speedup for TIF_NOHZ as well as
for TIF_SECCOMP if syscall exit work were similarly split up.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/2df320a600020fda055fccf2b668145729dd0c04.1409954077.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The RCU context tracking code requires that arch code call
user_exit() on any entry into kernel code if TIF_NOHZ is set. This
patch adds a check for TIF_NOHZ and a comment to the syscall entry
tracing code.
The main purpose of this patch is to make the code easier to follow:
one can read the body of user_exit and of every function it calls
without finding any explanation of why it's called for traced
syscalls but not for untraced syscalls. This makes it clear when
user_exit() is necessary.
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/0b13e0e24ec0307d67ab7a23b58764f6b1270116.1409954077.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
is_compat_task() is the wrong check for audit arch; the check should
be is_ia32_task(): x32 syscalls should be AUDIT_ARCH_X86_64, not
AUDIT_ARCH_I386.
CONFIG_AUDITSYSCALL is currently incompatible with x32, so this has
no visible effect.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/a0138ed8c709882aec06e4acc30bfa9b623b8717.1409954077.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The secure_computing function took a syscall number parameter, but
it only paid any attention to that parameter if seccomp mode 1 was
enabled. Rather than coming up with a kludge to get the parameter
to work in mode 2, just remove the parameter.
To avoid churn in arches that don't have seccomp filters (and may
not even support syscall_get_nr right now), this leaves the
parameter in secure_computing_strict, which is now a real function.
For ARM, this is a bit ugly due to the fact that ARM conditionally
supports seccomp filters. Fixing that would probably only be a
couple of lines of code, but it should be coordinated with the audit
maintainers.
This will be a slight slowdown on some arches. The right fix is to
pass in all of seccomp_data instead of trying to make just the
syscall nr part be fast.
This is a prerequisite for making two-phase seccomp work cleanly.
Cc: Russell King <linux@arm.linux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
x86_64 uses a per_cpu variable kernel_stack to always point to
the thread stack of current. This is where the thread_info is stored
and is accessed from this location even when the irq or exception stack
is in use. This removes the complexity of having to maintain the
thread info on the stack when interrupts are running and having to
copy the preempt_count and other fields to the interrupt stack.
x86_32 uses the old method of copying the thread_info from the thread
stack to the exception stack just before executing the exception.
Having the two different requires #ifdefs and also the x86_32 way
is a bit of a pain to maintain. By converting x86_32 to the same
method of x86_64, we can remove #ifdefs, clean up the x86_32 code
a little, and remove the overhead of the copy.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20110806012354.263834829@goodmis.org
Link: http://lkml.kernel.org/r/20140206144321.852942014@goodmis.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The i386 thread_info contains a previous_esp field that is used
to daisy chain the different stacks for dump_stack()
(ie. irq, softirq, thread stacks).
The goal is to eventual make i386 handling of thread_info the same
as x86_64, which means that the thread_info will not be in the stack
but as a per_cpu variable. We will no longer depend on thread_info
being able to daisy chain different stacks as it will only exist
in one location (the thread stack).
By moving previous_esp to the end of thread_info and referencing
it as an offset instead of using a thread_info field, this becomes
a stepping stone to moving the thread_info.
The offset to get to the previous stack is rather ugly in this
patch, but this is only temporary and the prev_esp will be changed
in the next commit. This commit is more for sanity checks of the
change.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Robert Richter <rric@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20110806012353.891757693@goodmis.org
Link: http://lkml.kernel.org/r/20140206144321.608754481@goodmis.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
ptrace_set_debugreg() is trivial but looks horrible. Kill the unnecessary
goto's and return's to cleanup the code.
This matches ptrace_get_debugreg() which also needs the trivial whitespace
cleanups.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 24f1e32c60 ("hw-breakpoints: Rewrite the hw-breakpoints layer
on top of perf events") introduced the minor regression. Before this
commit
PTRACE_POKEUSER DR7, enableDR0
PTRACE_POKEUSER DR0, address
was perfectly valid, now PTRACE_POKEUSER(DR7) fails if DR0 was not
previously initialized by PTRACE_POKEUSER(DR0).
Change ptrace_write_dr7() to do ptrace_register_breakpoint(addr => 0) if
!bp && !disabled.
This fixes watchpoint-zeroaddr from ptrace-tests, see
https://bugzilla.redhat.com/show_bug.cgi?id=660204.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Jan Kratochvil <jan.kratochvil@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No functional changes, preparation.
Extract the "register breakpoint" code from ptrace_get_debugreg() into
the new/generic helper, ptrace_register_breakpoint(). It will have more
users.
The patch also adds another simple helper, ptrace_fill_bp_fields(), to
factor out the arch_bp_generic_fields() logic in register/modify.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ptrace_write_dr7() skips ptrace_modify_breakpoint(disabled => true)
unless second_pass, this buys nothing but complicates the code and means
that we always do the main loop twice even if "disabled" was never true.
The comment says:
Don't unregister the breakpoints right-away,
unless all register_user_hw_breakpoint()
requests have succeeded.
Firstly, we do not do register_user_hw_breakpoint(), it was removed by
commit 24f1e32c60 ("hw-breakpoints: Rewrite the hw-breakpoints layer
on top of perf events").
We are going to restore register_user_hw_breakpoint() (see the next
patch) but this doesn't matter: after commit 44234adcdc
("hw-breakpoints: Modify breakpoints without unregistering them")
perf_event_disable() can not hurt, hw_breakpoint_del() does not free the
slot.
Remove the "second_pass" check from the main loop and simplify the code.
Since we have to check "bp != NULL" anyway, the patch also removes the
same check in ptrace_modify_breakpoint() and moves the comment into
ptrace_write_dr7().
With this patch the second pass is only needed to restore the saved
old_dr7. This should never fail, so the patch adds WARN_ON() to catch
the potential problems as Frederic suggested.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ptrace_write_dr7() looks unnecessarily overcomplicated. We can factor
out ptrace_modify_breakpoint() and do not do "continue" twice, just we
need to pass the proper "disabled" argument to
ptrace_modify_breakpoint().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 87dc669ba2 ("hw_breakpoints: Fix racy access to
ptrace breakpoints").
The patch was fine but we can no longer race with SIGKILL after commit
9899d11f65 ("ptrace: ensure arch_ptrace/ptrace_request can never race
with SIGKILL"), the __TASK_TRACED tracee can't be woken up and
->ptrace_bps[] can't go away.
The patch only removes ptrace_get_breakpoints/ptrace_put_breakpoints and
does a couple of "while at it" cleanups, it doesn't remove other changes
from the reverted commit.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit cb57a2b4cf ("x86-32: Export
kernel_stack_pointer() for modules") added an include of the
module.h header in conjunction with adding an EXPORT_SYMBOL_GPL
of kernel_stack_pointer.
But module.h should be avoided for simple exports, since it in turn
includes the world. Swap the module.h for export.h instead.
Cc: Jiri Kosina <trivial@kernel.org>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/1360872842-28417-1-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Conflicts:
arch/x86/kernel/ptrace.c
Pull the latest RCU tree from Paul E. McKenney:
" The major features of this series are:
1. A first version of no-callbacks CPUs. This version prohibits
offlining CPU 0, but only when enabled via CONFIG_RCU_NOCB_CPU=y.
Relaxing this constraint is in progress, but not yet ready
for prime time. These commits were posted to LKML at
https://lkml.org/lkml/2012/10/30/724, and are at branch rcu/nocb.
2. Changes to SRCU that allows statically initialized srcu_struct
structures. These commits were posted to LKML at
https://lkml.org/lkml/2012/10/30/296, and are at branch rcu/srcu.
3. Restructuring of RCU's debugfs output. These commits were posted
to LKML at https://lkml.org/lkml/2012/10/30/341, and are at
branch rcu/tracing.
4. Additional CPU-hotplug/RCU improvements, posted to LKML at
https://lkml.org/lkml/2012/10/30/327, and are at branch rcu/hotplug.
Note that the commit eliminating __stop_machine() was judged to
be too-high of risk, so is deferred to 3.9.
5. Changes to RCU's idle interface, most notably a new module
parameter that redirects normal grace-period operations to
their expedited equivalents. These were posted to LKML at
https://lkml.org/lkml/2012/10/30/739, and are at branch rcu/idle.
6. Additional diagnostics for RCU's CPU stall warning facility,
posted to LKML at https://lkml.org/lkml/2012/10/30/315, and
are at branch rcu/stall. The most notable change reduces the
default RCU CPU stall-warning time from 60 seconds to 21 seconds,
so that it once again happens sooner than the softlockup timeout.
7. Documentation updates, which were posted to LKML at
https://lkml.org/lkml/2012/10/30/280, and are at branch rcu/doc.
A couple of late-breaking changes were posted at
https://lkml.org/lkml/2012/11/16/634 and
https://lkml.org/lkml/2012/11/16/547.
8. Miscellaneous fixes, which were posted to LKML at
https://lkml.org/lkml/2012/10/30/309, along with a late-breaking
change posted at Fri, 16 Nov 2012 11:26:25 -0800 with message-ID
<20121116192625.GA447@linux.vnet.ibm.com>, but which lkml.org
seems to have missed. These are at branch rcu/fixes.
9. Finally, a fix for an lockdep-RCU splat was posted to LKML
at https://lkml.org/lkml/2012/11/7/486. This is at rcu/next. "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull RCU fix from Ingo Molnar:
"Fix leaking RCU extended quiescent state, which might trigger warnings
and mess up the extended quiescent state tracking logic into thinking
that we are in "RCU user mode" while we aren't."
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Fix unrecovered RCU user mode in syscall_trace_leave()
Create a new subsystem that probes on kernel boundaries
to keep track of the transitions between level contexts
with two basic initial contexts: user or kernel.
This is an abstraction of some RCU code that use such tracking
to implement its userspace extended quiescent state.
We need to pull this up from RCU into this new level of indirection
because this tracking is also going to be used to implement an "on
demand" generic virtual cputime accounting. A necessary step to
shutdown the tick while still accounting the cputime.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
[ paulmck: fix whitespace error and email address. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Modules, in particular oprofile (and possibly other similar tools)
need kernel_stack_pointer(), so export it using EXPORT_SYMBOL_GPL().
Cc: Yang Wei <wei.yang@windriver.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Jun Zhang <jun.zhang@intel.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20120912135059.GZ8285@erda.amd.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
On x86-64 syscall exit, 3 non exclusive events may happen
looping in the following order:
1) Check if we need resched for user preemption, if so call
schedule_user()
2) Check if we have pending signals, if so call do_notify_resume()
3) Check if we do syscall tracing, if so call syscall_trace_leave()
However syscall_trace_leave() has been written assuming it directly
follows the syscall and forget about the above possible 1st and 2nd
steps.
Now schedule_user() and do_notify_resume() exit in RCU user mode
because they have most chances to resume userspace immediately and
this avoids an rcu_user_enter() call in the syscall fast path.
So by the time we call syscall_trace_leave(), we may well be in RCU
user mode. To fix this up, simply call rcu_user_exit() in the beginning
of this function.
This fixes some reported RCU uses in extended quiescent state.
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pull x86/fpu update from Ingo Molnar:
"The biggest change is the addition of the non-lazy (eager) FPU saving
support model and enabling it on CPUs with optimized xsaveopt/xrstor
FPU state saving instructions.
There are also various Sparse fixes"
Fix up trivial add-add conflict in arch/x86/kernel/traps.c
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, kvm: fix kvm's usage of kernel_fpu_begin/end()
x86, fpu: remove cpu_has_xmm check in the fx_finit()
x86, fpu: make eagerfpu= boot param tri-state
x86, fpu: enable eagerfpu by default for xsaveopt
x86, fpu: decouple non-lazy/eager fpu restore from xsave
x86, fpu: use non-lazy fpu restore for processors supporting xsave
lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models
x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage
x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()
x86, fpu: drop_fpu() before restoring new state from sigframe
x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels
x86, fpu: Consolidate inline asm routines for saving/restoring fpu state
x86, signal: Cleanup ifdefs and is_ia32, is_x32
Add syscall slow path hooks to notify syscall entry
and exit on CPUs that want to support userspace RCU
extended quiescent state.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently for x86 and x86_32 binaries, fpstate in the user sigframe is copied
to/from the fpstate in the task struct.
And in the case of signal delivery for x86_64 binaries, if the fpstate is live
in the CPU registers, then the live state is copied directly to the user
sigframe. Otherwise fpstate in the task struct is copied to the user sigframe.
During restore, fpstate in the user sigframe is restored directly to the live
CPU registers.
Historically, different code paths led to different bugs. For example,
x86_64 code path was not preemption safe till recently. Also there is lot
of code duplication for support of new features like xsave etc.
Unify signal handling code paths for x86 and x86_64 kernels.
New strategy is as follows:
Signal delivery: Both for 32/64-bit frames, align the core math frame area to
64bytes as needed by xsave (this where the main fpu/extended state gets copied
to and excludes the legacy compatibility fsave header for the 32-bit [f]xsave
frames). If the state is live, copy the register state directly to the user
frame. If not live, copy the state in the thread struct to the user frame. And
for 32-bit [f]xsave frames, construct the fsave header separately before
the actual [f]xsave area.
Signal return: As the 32-bit frames with [f]xstate has an additional
'fsave' header, copy everything back from the user sigframe to the
fpstate in the task structure and reconstruct the fxstate from the 'fsave'
header (Also user passed pointers may not be correctly aligned for
any attempt to directly restore any partial state). At the next fpstate usage,
everything will be restored to the live CPU registers.
For all the 64-bit frames and the 32-bit fsave frame, restore the state from
the user sigframe directly to the live CPU registers. 64-bit signals always
restored the math frame directly, so we can expect the math frame pointer
to be correctly aligned. For 32-bit fsave frames, there are no alignment
requirements, so we can restore the state directly.
"lat_sig catch" microbenchmark numbers (for x86, x86_64, x86_32 binaries) are
with in the noise range with this change.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1343171129-2747-4-git-send-email-suresh.b.siddha@intel.com
[ Merged in compilation fix ]
Link: http://lkml.kernel.org/r/1344544736.8326.17.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
When I added x32 ptrace to 3.4 kernel, I also include PTRACE_ARCH_PRCTL
support for x32 GDB For ARCH_GET_FS/GS, it takes a pointer to int64. But
at user level, ARCH_GET_FS/GS takes a pointer to int32. So I have to add
x32 ptrace to glibc to handle it with a temporary int64 passed to kernel and
copy it back to GDB as int32. Roland suggested that PTRACE_ARCH_PRCTL
is obsolete and x32 GDB should use fs_base and gs_base fields of
user_regs_struct instead.
Accordingly, remove PTRACE_ARCH_PRCTL completely from the x32 code to
avoid possible memory overrun when pointer to int32 is passed to
kernel.
Link: http://lkml.kernel.org/r/CAMe9rOpDzHfS7NH7m1vmD9QRw8SSj4Sc%2BaNOgcWm_WJME2eRsQ@mail.gmail.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@vger.kernel.org> v3.4
Enable support for seccomp filter on x86:
- syscall_get_arch()
- syscall_get_arguments()
- syscall_rollback()
- syscall_set_return_value()
- SIGSYS siginfo_t support
- secure_computing is called from a ptrace_event()-safe context
- secure_computing return value is checked (see below).
SECCOMP_RET_TRACE and SECCOMP_RET_TRAP may result in seccomp needing to
skip a system call without killing the process. This is done by
returning a non-zero (-1) value from secure_computing. This change
makes x86 respect that return value.
To ensure that minimal kernel code is exposed, a non-zero return value
results in an immediate return to user space (with an invalid syscall
number).
Signed-off-by: Will Drewry <wad@chromium.org>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Acked-by: Eric Paris <eparis@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
v18: rebase and tweaked change description, acked-by
v17: added reviewed by and rebased
v..: all rebases since original introduction.
Signed-off-by: James Morris <james.l.morris@oracle.com>
Pull x86 cleanups from Peter Anvin:
"The biggest textual change is the cleanup to use symbolic constants
for x86 trap values.
The only *functional* change and the reason for the x86/x32 dependency
is the move of is_ia32_task() into <asm/thread_info.h> so that it can
be used in other code that needs to understand if a system call comes
from the compat entry point (and therefore uses i386 system call
numbers) or not. One intended user for that is the BPF system call
filter. Moving it out of <asm/compat.h> means we can define it
unconditionally, returning always true on i386."
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Move is_ia32_task to asm/thread_info.h from asm/compat.h
x86: Rename trap_no to trap_nr in thread_struct
x86: Use enum instead of literals for trap values
Pull x32 support for x86-64 from Ingo Molnar:
"This tree introduces the X32 binary format and execution mode for x86:
32-bit data space binaries using 64-bit instructions and 64-bit kernel
syscalls.
This allows applications whose working set fits into a 32 bits address
space to make use of 64-bit instructions while using a 32-bit address
space with shorter pointers, more compressed data structures, etc."
Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c}
* 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
x32: Fix alignment fail in struct compat_siginfo
x32: Fix stupid ia32/x32 inversion in the siginfo format
x32: Add ptrace for x32
x32: Switch to a 64-bit clock_t
x32: Provide separate is_ia32_task() and is_x32_task() predicates
x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
x86/x32: Fix the binutils auto-detect
x32: Warn and disable rather than error if binutils too old
x32: Only clear TIF_X32 flag once
x32: Make sure TS_COMPAT is cleared for x32 tasks
fs: Remove missed ->fds_bits from cessation use of fd_set structs internally
fs: Fix close_on_exec pointer in alloc_fdtable
x32: Drop non-__vdso weak symbols from the x32 VDSO
x32: Fix coding style violations in the x32 VDSO code
x32: Add x32 VDSO support
x32: Allow x32 to be configured
x32: If configured, add x32 system calls to system call tables
x32: Handle process creation
x32: Signal-related system calls
x86: Add #ifdef CONFIG_COMPAT to <asm/sys_ia32.h>
...
There are precedences of trap number being referred to as
trap_nr. However thread struct refers trap number as trap_no.
Change it to trap_nr.
Also use enum instead of left-over literals for trap values.
This is pure cleanup, no functional change intended.
Suggested-by: Ingo Molnar <mingo@eltu.hu>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120312092555.5379.942.sendpatchset@srdronam.in.ibm.com
[ Fixed the math-emu build ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
X32 ptrace is a hybrid of 64bit ptrace and compat ptrace with 32bit
address and longs. It use 64bit ptrace to access the full 64bit
registers. PTRACE_PEEKUSR and PTRACE_POKEUSR are only allowed to access
segment and debug registers. PTRACE_PEEKUSR returns the lower 32bits
and PTRACE_POKEUSR zero-extends 32bit value to 64bit. It works since
the upper 32bits of segment and debug registers of x32 process are always
zero. GDB only uses PTRACE_PEEKUSR and PTRACE_POKEUSR to access
segment and debug registers.
[ hpa: changed TIF_X32 test to use !is_ia32_task() instead, and moved
the system call number to the now-unused 521 slot. ]
Signed-off-by: "H.J. Lu" <hjl.tools@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/1329696488-16970-1-git-send-email-hpa@zytor.com
While various modules include <asm/i387.h> to get access to things we
actually *intend* for them to use, most of that header file was really
pretty low-level internal stuff that we really don't want to expose to
others.
So split the header file into two: the small exported interfaces remain
in <asm/i387.h>, while the internal definitions that are only used by
core architecture code are now in <asm/fpu-internal.h>.
The guiding principle for this was to expose functions that we export to
modules, and leave them in <asm/i387.h>, while stuff that is used by
task switching or was marked GPL-only is in <asm/fpu-internal.h>.
The fpu-internal.h file could be further split up too, especially since
arch/x86/kvm/ uses some of the remaining stuff for its module. But that
kvm usage should probably be abstracted out a bit, and at least now the
internal FPU accessor functions are much more contained. Even if it
isn't perhaps as contained as it _could_ be.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Every arch calls:
if (unlikely(current->audit_context))
audit_syscall_entry()
which requires knowledge about audit (the existance of audit_context) in
the arch code. Just do it all in static inline in audit.h so that arch's
can remain blissfully ignorant.
Signed-off-by: Eric Paris <eparis@redhat.com>
The audit system previously expected arches calling to audit_syscall_exit to
supply as arguments if the syscall was a success and what the return code was.
Audit also provides a helper AUDITSC_RESULT which was supposed to simplify things
by converting from negative retcodes to an audit internal magic value stating
success or failure. This helper was wrong and could indicate that a valid
pointer returned to userspace was a failed syscall. The fix is to fix the
layering foolishness. We now pass audit_syscall_exit a struct pt_reg and it
in turns calls back into arch code to collect the return value and to
determine if the syscall was a success or failure. We also define a generic
is_syscall_success() macro which determines success/failure based on if the
value is < -MAX_ERRNO. This works for arches like x86 which do not use a
separate mechanism to indicate syscall failure.
We make both the is_syscall_success() and regs_return_value() static inlines
instead of macros. The reason is because the audit function must take a void*
for the regs. (uml calls theirs struct uml_pt_regs instead of just struct
pt_regs so audit_syscall_exit can't take a struct pt_regs). Since the audit
function takes a void* we need to use static inlines to cast it back to the
arch correct structure to dereference it.
The other major change is that on some arches, like ia64, MIPS and ppc, we
change regs_return_value() to give us the negative value on syscall failure.
THE only other user of this macro, kretprobe_example.c, won't notice and it
makes the value signed consistently for the audit functions across all archs.
In arch/sh/kernel/ptrace_64.c I see that we were using regs[9] in the old
audit code as the return value. But the ptrace_64.h code defined the macro
regs_return_value() as regs[3]. I have no idea which one is correct, but this
patch now uses the regs_return_value() function, so it now uses regs[3].
For powerpc we previously used regs->result but now use the
regs_return_value() function which uses regs->gprs[3]. regs->gprs[3] is
always positive so the regs_return_value(), much like ia64 makes it negative
before calling the audit code when appropriate.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: H. Peter Anvin <hpa@zytor.com> [for x86 portion]
Acked-by: Tony Luck <tony.luck@intel.com> [for ia64]
Acked-by: Richard Weinberger <richard@nod.at> [for uml]
Acked-by: David S. Miller <davem@davemloft.net> [for sparc]
Acked-by: Ralf Baechle <ralf@linux-mips.org> [for mips]
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [for ppc]
ptrace_set_debugreg() is only used in this file and should be
static. This also quiets the following sparse warning:
warning: symbol 'ptrace_set_debugreg' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: hartleys@visionengravers.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The perf_event overflow handler does not receive any caller-derived
argument, so many callers need to resort to looking up the perf_event
in their local data structure. This is ugly and doesn't scale if a
single callback services many perf_events.
Fix by adding a context parameter to perf_event_create_kernel_counter()
(and derived hardware breakpoints APIs) and storing it in the perf_event.
The field can be accessed from the callback as event->overflow_handler_context.
All callers are updated.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1309362157-6596-2-git-send-email-avi@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The nmi parameter indicated if we could do wakeups from the current
context, if not, we would set some state and self-IPI and let the
resulting interrupt do the wakeup.
For the various event classes:
- hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from
the PMI-tail (ARM etc.)
- tracepoint: nmi=0; since tracepoint could be from NMI context.
- software: nmi=[0,1]; some, like the schedule thing cannot
perform wakeups, and hence need 0.
As one can see, there is very little nmi=1 usage, and the down-side of
not using it is that on some platforms some software events can have a
jiffy delay in wakeup (when arch_irq_work_raise isn't implemented).
The up-side however is that we can remove the nmi parameter and save a
bunch of conditionals in fast paths.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michael Cree <mcree@orcon.net.nz>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While the tracer accesses ptrace breakpoints, the child task may
concurrently exit due to a SIGKILL and thus release its breakpoints
at the same time. We can then dereference some freed pointers.
To fix this, hold a reference on the child breakpoints before
manipulating them.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: v2.6.33.. <stable@kernel.org>
Link: http://lkml.kernel.org/r/1302284067-7860-3-git-send-email-fweisbec@gmail.com
Remove checking @addr less than 0 because @addr is now unsigned and
use new udescp variable in order to remove unnecessary castings.
[akpm@linux-foundation.org: fix unused variable 'udescp']
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix up the arguments to arch_ptrace() to take account of the fact that
@addr and @data are now unsigned long rather than long as of a preceding
patch in this series.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: <linux-arch@vger.kernel.org>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tag ptrace breakpoints with the exclude_kernel attribute set. This
will make it easier to set generic policies on breakpoints, when it
comes to ensure nobody unpriviliged try to breakpoint on the kernel.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: K. Prasad <prasad@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Support for the PMU's BTS features has been upstreamed in
v2.6.32, but we still have the old and disabled ptrace-BTS,
as Linus noticed it not so long ago.
It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without
regard for other uses (perf) and doesn't provide the flexibility
needed for perf either.
Its users are ptrace-block-step and ptrace-bts, since ptrace-bts
was never used and ptrace-block-step can be implemented using a
much simpler approach.
So axe all 3000 lines of it. That includes the *locked_memory*()
APIs in mm/mlock.c as well.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Markus Metzger <markus.t.metzger@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20100325135413.938004390@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>