License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2015-06-08 15:49:11 +08:00
|
|
|
* Copyright (C) 1991,1992 Linus Torvalds
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2015-06-08 15:49:11 +08:00
|
|
|
* entry_32.S contains the system-call and low-level fault and trap handling routines.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2015-10-06 08:48:13 +08:00
|
|
|
* Stack layout while running C code:
|
2015-06-08 15:49:11 +08:00
|
|
|
* ptrace needs to have all registers on the stack.
|
|
|
|
* If the order here is changed, it needs to be
|
|
|
|
* updated in fork.c:copy_process(), signal.c:do_signal(),
|
2005-04-17 06:20:36 +08:00
|
|
|
* ptrace.c and ptrace.h
|
|
|
|
*
|
|
|
|
* 0(%esp) - %ebx
|
|
|
|
* 4(%esp) - %ecx
|
|
|
|
* 8(%esp) - %edx
|
2015-06-09 04:35:33 +08:00
|
|
|
* C(%esp) - %esi
|
2005-04-17 06:20:36 +08:00
|
|
|
* 10(%esp) - %edi
|
|
|
|
* 14(%esp) - %ebp
|
|
|
|
* 18(%esp) - %eax
|
|
|
|
* 1C(%esp) - %ds
|
|
|
|
* 20(%esp) - %es
|
2007-02-13 20:26:20 +08:00
|
|
|
* 24(%esp) - %fs
|
x86/stackprotector/32: Make the canary into a regular percpu variable
On 32-bit kernels, the stackprotector canary is quite nasty -- it is
stored at %gs:(20), which is nasty because 32-bit kernels use %fs for
percpu storage. It's even nastier because it means that whether %gs
contains userspace state or kernel state while running kernel code
depends on whether stackprotector is enabled (this is
CONFIG_X86_32_LAZY_GS), and this setting radically changes the way
that segment selectors work. Supporting both variants is a
maintenance and testing mess.
Merely rearranging so that percpu and the stack canary
share the same segment would be messy as the 32-bit percpu address
layout isn't currently compatible with putting a variable at a fixed
offset.
Fortunately, GCC 8.1 added options that allow the stack canary to be
accessed as %fs:__stack_chk_guard, effectively turning it into an ordinary
percpu variable. This lets us get rid of all of the code to manage the
stack canary GDT descriptor and the CONFIG_X86_32_LAZY_GS mess.
(That name is special. We could use any symbol we want for the
%fs-relative mode, but for CONFIG_SMP=n, gcc refuses to let us use any
name other than __stack_chk_guard.)
Forcibly disable stackprotector on older compilers that don't support
the new options and turn the stack canary into a percpu variable. The
"lazy GS" approach is now used for all 32-bit configurations.
Also makes load_gs_index() work on 32-bit kernels. On 64-bit kernels,
it loads the GS selector and updates the user GSBASE accordingly. (This
is unchanged.) On 32-bit kernels, it loads the GS selector and updates
GSBASE, which is now always the user base. This means that the overall
effect is the same on 32-bit and 64-bit, which avoids some ifdeffery.
[ bp: Massage commit message. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/c0ff7dba14041c7e5d1cae5d4df052f03759bef3.1613243844.git.luto@kernel.org
2021-02-14 03:19:44 +08:00
|
|
|
* 28(%esp) - unused -- was %gs on old stackprotector kernels
|
2009-02-09 21:17:40 +08:00
|
|
|
* 2C(%esp) - orig_eax
|
|
|
|
* 30(%esp) - %eip
|
|
|
|
* 34(%esp) - %cs
|
|
|
|
* 38(%esp) - %eflags
|
|
|
|
* 3C(%esp) - %oldesp
|
|
|
|
* 40(%esp) - %oldss
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/linkage.h>
|
2012-01-04 03:23:06 +08:00
|
|
|
#include <linux/err.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <asm/thread_info.h>
|
2006-07-03 15:24:43 +08:00
|
|
|
#include <asm/irqflags.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <asm/errno.h>
|
|
|
|
#include <asm/segment.h>
|
|
|
|
#include <asm/smp.h>
|
2006-12-07 09:14:01 +08:00
|
|
|
#include <asm/percpu.h>
|
2008-03-26 03:16:32 +08:00
|
|
|
#include <asm/processor-flags.h>
|
2008-05-03 02:10:09 +08:00
|
|
|
#include <asm/irq_vectors.h>
|
2016-01-27 05:12:04 +08:00
|
|
|
#include <asm/cpufeatures.h>
|
2021-03-11 22:23:06 +08:00
|
|
|
#include <asm/alternative.h>
|
2012-04-21 03:19:50 +08:00
|
|
|
#include <asm/asm.h>
|
2012-09-22 04:58:10 +08:00
|
|
|
#include <asm/smap.h>
|
2016-09-22 05:04:01 +08:00
|
|
|
#include <asm/frame.h>
|
2020-02-26 06:16:11 +08:00
|
|
|
#include <asm/trapnr.h>
|
2018-01-12 05:46:28 +08:00
|
|
|
#include <asm/nospec-branch.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2018-08-17 06:16:58 +08:00
|
|
|
#include "calling.h"
|
|
|
|
|
2011-03-08 02:10:39 +08:00
|
|
|
.section .entry.text, "ax"
|
|
|
|
|
2018-07-18 17:40:49 +08:00
|
|
|
#define PTI_SWITCH_MASK (1 << PAGE_SHIFT)
|
|
|
|
|
|
|
|
/* Unconditionally switch to user cr3 */
|
|
|
|
.macro SWITCH_TO_USER_CR3 scratch_reg:req
|
|
|
|
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
|
|
|
|
|
|
|
|
movl %cr3, \scratch_reg
|
|
|
|
orl $PTI_SWITCH_MASK, \scratch_reg
|
|
|
|
movl \scratch_reg, %cr3
|
|
|
|
.Lend_\@:
|
|
|
|
.endm
|
|
|
|
|
2018-07-18 17:41:16 +08:00
|
|
|
.macro BUG_IF_WRONG_CR3 no_user_check=0
|
|
|
|
#ifdef CONFIG_DEBUG_ENTRY
|
|
|
|
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
|
|
|
|
.if \no_user_check == 0
|
|
|
|
/* coming from usermode? */
|
2019-11-18 23:21:12 +08:00
|
|
|
testl $USER_SEGMENT_RPL_MASK, PT_CS(%esp)
|
2018-07-18 17:41:16 +08:00
|
|
|
jz .Lend_\@
|
|
|
|
.endif
|
|
|
|
/* On user-cr3? */
|
|
|
|
movl %cr3, %eax
|
|
|
|
testl $PTI_SWITCH_MASK, %eax
|
|
|
|
jnz .Lend_\@
|
|
|
|
/* From userspace with kernel cr3 - BUG */
|
|
|
|
ud2
|
|
|
|
.Lend_\@:
|
|
|
|
#endif
|
|
|
|
.endm
|
|
|
|
|
2018-07-18 17:40:49 +08:00
|
|
|
/*
|
|
|
|
* Switch to kernel cr3 if not already loaded and return current cr3 in
|
|
|
|
* \scratch_reg
|
|
|
|
*/
|
|
|
|
.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
|
|
|
|
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
|
|
|
|
movl %cr3, \scratch_reg
|
|
|
|
/* Test if we are already on kernel CR3 */
|
|
|
|
testl $PTI_SWITCH_MASK, \scratch_reg
|
|
|
|
jz .Lend_\@
|
|
|
|
andl $(~PTI_SWITCH_MASK), \scratch_reg
|
|
|
|
movl \scratch_reg, %cr3
|
|
|
|
/* Return original CR3 in \scratch_reg */
|
|
|
|
orl $PTI_SWITCH_MASK, \scratch_reg
|
|
|
|
.Lend_\@:
|
|
|
|
.endm
|
|
|
|
|
2019-05-08 05:25:54 +08:00
|
|
|
#define CS_FROM_ENTRY_STACK (1 << 31)
|
|
|
|
#define CS_FROM_USER_CR3 (1 << 30)
|
|
|
|
#define CS_FROM_KERNEL (1 << 29)
|
2019-11-20 22:02:26 +08:00
|
|
|
#define CS_FROM_ESPFIX (1 << 28)
|
2019-05-08 05:25:54 +08:00
|
|
|
|
|
|
|
.macro FIXUP_FRAME
|
|
|
|
/*
|
|
|
|
* The high bits of the CS dword (__csh) are used for CS_FROM_*.
|
|
|
|
* Clear them in case hardware didn't do this for us.
|
|
|
|
*/
|
2019-11-20 16:56:36 +08:00
|
|
|
andl $0x0000ffff, 4*4(%esp)
|
2019-05-08 05:25:54 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_VM86
|
2019-11-20 16:56:36 +08:00
|
|
|
testl $X86_EFLAGS_VM, 5*4(%esp)
|
2019-05-08 05:25:54 +08:00
|
|
|
jnz .Lfrom_usermode_no_fixup_\@
|
|
|
|
#endif
|
2019-11-20 16:56:36 +08:00
|
|
|
testl $USER_SEGMENT_RPL_MASK, 4*4(%esp)
|
2019-05-08 05:25:54 +08:00
|
|
|
jnz .Lfrom_usermode_no_fixup_\@
|
|
|
|
|
2019-11-20 16:56:36 +08:00
|
|
|
orl $CS_FROM_KERNEL, 4*4(%esp)
|
2019-05-08 05:25:54 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* When we're here from kernel mode; the (exception) stack looks like:
|
|
|
|
*
|
2019-11-20 16:56:36 +08:00
|
|
|
* 6*4(%esp) - <previous context>
|
|
|
|
* 5*4(%esp) - flags
|
|
|
|
* 4*4(%esp) - cs
|
|
|
|
* 3*4(%esp) - ip
|
|
|
|
* 2*4(%esp) - orig_eax
|
|
|
|
* 1*4(%esp) - gs / function
|
|
|
|
* 0*4(%esp) - fs
|
2019-05-08 05:25:54 +08:00
|
|
|
*
|
|
|
|
* Lets build a 5 entry IRET frame after that, such that struct pt_regs
|
|
|
|
* is complete and in particular regs->sp is correct. This gives us
|
2021-03-22 05:28:53 +08:00
|
|
|
* the original 6 entries as gap:
|
2019-05-08 05:25:54 +08:00
|
|
|
*
|
2019-11-20 16:56:36 +08:00
|
|
|
* 14*4(%esp) - <previous context>
|
|
|
|
* 13*4(%esp) - gap / flags
|
|
|
|
* 12*4(%esp) - gap / cs
|
|
|
|
* 11*4(%esp) - gap / ip
|
|
|
|
* 10*4(%esp) - gap / orig_eax
|
|
|
|
* 9*4(%esp) - gap / gs / function
|
|
|
|
* 8*4(%esp) - gap / fs
|
|
|
|
* 7*4(%esp) - ss
|
|
|
|
* 6*4(%esp) - sp
|
|
|
|
* 5*4(%esp) - flags
|
|
|
|
* 4*4(%esp) - cs
|
|
|
|
* 3*4(%esp) - ip
|
|
|
|
* 2*4(%esp) - orig_eax
|
|
|
|
* 1*4(%esp) - gs / function
|
|
|
|
* 0*4(%esp) - fs
|
2019-05-08 05:25:54 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
pushl %ss # ss
|
|
|
|
pushl %esp # sp (points at ss)
|
2019-11-20 16:56:36 +08:00
|
|
|
addl $7*4, (%esp) # point sp back at the previous context
|
|
|
|
pushl 7*4(%esp) # flags
|
|
|
|
pushl 7*4(%esp) # cs
|
|
|
|
pushl 7*4(%esp) # ip
|
|
|
|
pushl 7*4(%esp) # orig_eax
|
|
|
|
pushl 7*4(%esp) # gs / function
|
|
|
|
pushl 7*4(%esp) # fs
|
2019-05-08 05:25:54 +08:00
|
|
|
.Lfrom_usermode_no_fixup_\@:
|
|
|
|
.endm
|
|
|
|
|
|
|
|
.macro IRET_FRAME
|
2019-11-20 16:49:33 +08:00
|
|
|
/*
|
|
|
|
* We're called with %ds, %es, %fs, and %gs from the interrupted
|
|
|
|
* frame, so we shouldn't use them. Also, we may be in ESPFIX
|
|
|
|
* mode and therefore have a nonzero SS base and an offset ESP,
|
|
|
|
* so any attempt to access the stack needs to use SS. (except for
|
|
|
|
* accesses through %esp, which automatically use SS.)
|
|
|
|
*/
|
2019-05-08 05:25:54 +08:00
|
|
|
testl $CS_FROM_KERNEL, 1*4(%esp)
|
|
|
|
jz .Lfinished_frame_\@
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Reconstruct the 3 entry IRET frame right after the (modified)
|
|
|
|
* regs->sp without lowering %esp in between, such that an NMI in the
|
|
|
|
* middle doesn't scribble our stack.
|
|
|
|
*/
|
|
|
|
pushl %eax
|
|
|
|
pushl %ecx
|
|
|
|
movl 5*4(%esp), %eax # (modified) regs->sp
|
|
|
|
|
|
|
|
movl 4*4(%esp), %ecx # flags
|
2019-11-20 16:49:33 +08:00
|
|
|
movl %ecx, %ss:-1*4(%eax)
|
2019-05-08 05:25:54 +08:00
|
|
|
|
|
|
|
movl 3*4(%esp), %ecx # cs
|
|
|
|
andl $0x0000ffff, %ecx
|
2019-11-20 16:49:33 +08:00
|
|
|
movl %ecx, %ss:-2*4(%eax)
|
2019-05-08 05:25:54 +08:00
|
|
|
|
|
|
|
movl 2*4(%esp), %ecx # ip
|
2019-11-20 16:49:33 +08:00
|
|
|
movl %ecx, %ss:-3*4(%eax)
|
2019-05-08 05:25:54 +08:00
|
|
|
|
|
|
|
movl 1*4(%esp), %ecx # eax
|
2019-11-20 16:49:33 +08:00
|
|
|
movl %ecx, %ss:-4*4(%eax)
|
2019-05-08 05:25:54 +08:00
|
|
|
|
|
|
|
popl %ecx
|
2019-11-20 16:49:33 +08:00
|
|
|
lea -4*4(%eax), %esp
|
2019-05-08 05:25:54 +08:00
|
|
|
popl %eax
|
|
|
|
.Lfinished_frame_\@:
|
|
|
|
.endm
|
|
|
|
|
2019-11-20 17:10:49 +08:00
|
|
|
.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0 skip_gs=0 unwind_espfix=0
|
2009-02-09 21:17:40 +08:00
|
|
|
cld
|
2019-07-11 19:40:56 +08:00
|
|
|
.if \skip_gs == 0
|
2021-02-14 03:19:45 +08:00
|
|
|
pushl $0
|
2019-07-11 19:40:56 +08:00
|
|
|
.endif
|
2015-06-08 15:49:11 +08:00
|
|
|
pushl %fs
|
2019-11-20 17:10:49 +08:00
|
|
|
|
|
|
|
pushl %eax
|
|
|
|
movl $(__KERNEL_PERCPU), %eax
|
|
|
|
movl %eax, %fs
|
|
|
|
.if \unwind_espfix > 0
|
|
|
|
UNWIND_ESPFIX_STACK
|
|
|
|
.endif
|
|
|
|
popl %eax
|
|
|
|
|
2019-11-20 16:56:36 +08:00
|
|
|
FIXUP_FRAME
|
2015-06-08 15:49:11 +08:00
|
|
|
pushl %es
|
|
|
|
pushl %ds
|
2015-10-06 08:48:14 +08:00
|
|
|
pushl \pt_regs_ax
|
2015-06-08 15:49:11 +08:00
|
|
|
pushl %ebp
|
|
|
|
pushl %edi
|
|
|
|
pushl %esi
|
|
|
|
pushl %edx
|
|
|
|
pushl %ecx
|
|
|
|
pushl %ebx
|
|
|
|
movl $(__USER_DS), %edx
|
|
|
|
movl %edx, %ds
|
|
|
|
movl %edx, %es
|
2018-07-18 17:40:44 +08:00
|
|
|
/* Switch to kernel stack if necessary */
|
|
|
|
.if \switch_stacks > 0
|
|
|
|
SWITCH_TO_KERNEL_STACK
|
|
|
|
.endif
|
2009-02-09 21:17:40 +08:00
|
|
|
.endm
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2019-11-20 22:02:26 +08:00
|
|
|
.macro SAVE_ALL_NMI cr3_reg:req unwind_espfix=0
|
|
|
|
SAVE_ALL unwind_espfix=\unwind_espfix
|
2018-07-18 17:40:50 +08:00
|
|
|
|
2018-07-18 17:41:16 +08:00
|
|
|
BUG_IF_WRONG_CR3
|
|
|
|
|
2018-07-18 17:40:50 +08:00
|
|
|
/*
|
|
|
|
* Now switch the CR3 when PTI is enabled.
|
|
|
|
*
|
|
|
|
* We can enter with either user or kernel cr3, the code will
|
|
|
|
* store the old cr3 in \cr3_reg and switches to the kernel cr3
|
|
|
|
* if necessary.
|
|
|
|
*/
|
|
|
|
SWITCH_TO_KERNEL_CR3 scratch_reg=\cr3_reg
|
|
|
|
|
|
|
|
.Lend_\@:
|
2018-07-18 17:40:46 +08:00
|
|
|
.endm
|
2018-07-18 17:41:16 +08:00
|
|
|
|
2009-02-09 21:17:40 +08:00
|
|
|
.macro RESTORE_INT_REGS
|
2015-06-08 15:49:11 +08:00
|
|
|
popl %ebx
|
|
|
|
popl %ecx
|
|
|
|
popl %edx
|
|
|
|
popl %esi
|
|
|
|
popl %edi
|
|
|
|
popl %ebp
|
|
|
|
popl %eax
|
2009-02-09 21:17:40 +08:00
|
|
|
.endm
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-02-09 21:17:40 +08:00
|
|
|
.macro RESTORE_REGS pop=0
|
2009-02-09 21:17:40 +08:00
|
|
|
RESTORE_INT_REGS
|
2015-06-08 15:49:11 +08:00
|
|
|
1: popl %ds
|
|
|
|
2: popl %es
|
|
|
|
3: popl %fs
|
2021-02-14 03:19:45 +08:00
|
|
|
addl $(4 + \pop), %esp /* pop the unused "gs" slot */
|
2019-11-20 20:05:06 +08:00
|
|
|
IRET_FRAME
|
2009-02-09 21:17:40 +08:00
|
|
|
.pushsection .fixup, "ax"
|
2015-06-08 15:49:11 +08:00
|
|
|
4: movl $0, (%esp)
|
|
|
|
jmp 1b
|
|
|
|
5: movl $0, (%esp)
|
|
|
|
jmp 2b
|
|
|
|
6: movl $0, (%esp)
|
|
|
|
jmp 3b
|
[PATCH] i386: Use %gs as the PDA base-segment in the kernel
This patch is the meat of the PDA change. This patch makes several related
changes:
1: Most significantly, %gs is now used in the kernel. This means that on
entry, the old value of %gs is saved away, and it is reloaded with
__KERNEL_PDA.
2: entry.S constructs the stack in the shape of struct pt_regs, and this
is passed around the kernel so that the process's saved register
state can be accessed.
Unfortunately struct pt_regs doesn't currently have space for %gs
(or %fs). This patch extends pt_regs to add space for gs (no space
is allocated for %fs, since it won't be used, and it would just
complicate the code in entry.S to work around the space).
3: Because %gs is now saved on the stack like %ds, %es and the integer
registers, there are a number of places where it no longer needs to
be handled specially; namely context switch, and saving/restoring the
register state in a signal context.
4: And since kernel threads run in kernel space and call normal kernel
code, they need to be created with their %gs == __KERNEL_PDA.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
2006-12-07 09:14:02 +08:00
|
|
|
.popsection
|
2015-06-08 15:49:11 +08:00
|
|
|
_ASM_EXTABLE(1b, 4b)
|
|
|
|
_ASM_EXTABLE(2b, 5b)
|
|
|
|
_ASM_EXTABLE(3b, 6b)
|
2009-02-09 21:17:40 +08:00
|
|
|
.endm
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2018-07-18 17:40:50 +08:00
|
|
|
.macro RESTORE_ALL_NMI cr3_reg:req pop=0
|
|
|
|
/*
|
|
|
|
* Now switch the CR3 when PTI is enabled.
|
|
|
|
*
|
|
|
|
* We enter with kernel cr3 and switch the cr3 to the value
|
|
|
|
* stored on \cr3_reg, which is either a user or a kernel cr3.
|
|
|
|
*/
|
|
|
|
ALTERNATIVE "jmp .Lswitched_\@", "", X86_FEATURE_PTI
|
|
|
|
|
|
|
|
testl $PTI_SWITCH_MASK, \cr3_reg
|
|
|
|
jz .Lswitched_\@
|
|
|
|
|
|
|
|
/* User cr3 in \cr3_reg - write it to hardware cr3 */
|
|
|
|
movl \cr3_reg, %cr3
|
|
|
|
|
|
|
|
.Lswitched_\@:
|
|
|
|
|
2018-07-18 17:41:16 +08:00
|
|
|
BUG_IF_WRONG_CR3
|
|
|
|
|
2018-07-18 17:40:46 +08:00
|
|
|
RESTORE_REGS pop=\pop
|
|
|
|
.endm
|
|
|
|
|
2018-07-18 17:40:41 +08:00
|
|
|
.macro CHECK_AND_APPLY_ESPFIX
|
|
|
|
#ifdef CONFIG_X86_ESPFIX32
|
2019-11-25 00:50:03 +08:00
|
|
|
#define GDT_ESPFIX_OFFSET (GDT_ENTRY_ESPFIX_SS * 8)
|
|
|
|
#define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + GDT_ESPFIX_OFFSET
|
2018-07-18 17:40:41 +08:00
|
|
|
|
|
|
|
ALTERNATIVE "jmp .Lend_\@", "", X86_BUG_ESPFIX
|
|
|
|
|
|
|
|
movl PT_EFLAGS(%esp), %eax # mix EFLAGS, SS and CS
|
|
|
|
/*
|
|
|
|
* Warning: PT_OLDSS(%esp) contains the wrong/random values if we
|
|
|
|
* are returning to the kernel.
|
|
|
|
* See comments in process.c:copy_thread() for details.
|
|
|
|
*/
|
|
|
|
movb PT_OLDSS(%esp), %ah
|
|
|
|
movb PT_CS(%esp), %al
|
|
|
|
andl $(X86_EFLAGS_VM | (SEGMENT_TI_MASK << 8) | SEGMENT_RPL_MASK), %eax
|
|
|
|
cmpl $((SEGMENT_LDT << 8) | USER_RPL), %eax
|
|
|
|
jne .Lend_\@ # returning to user-space with LDT SS
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Setup and switch to ESPFIX stack
|
|
|
|
*
|
|
|
|
* We're returning to userspace with a 16 bit stack. The CPU will not
|
|
|
|
* restore the high word of ESP for us on executing iret... This is an
|
|
|
|
* "official" bug of all the x86-compatible CPUs, which we can work
|
|
|
|
* around to make dosemu and wine happy. We do this by preloading the
|
|
|
|
* high word of ESP with the high word of the userspace ESP while
|
|
|
|
* compensating for the offset by changing to the ESPFIX segment with
|
|
|
|
* a base address that matches for the difference.
|
|
|
|
*/
|
|
|
|
mov %esp, %edx /* load kernel esp */
|
|
|
|
mov PT_OLDESP(%esp), %eax /* load userspace esp */
|
|
|
|
mov %dx, %ax /* eax: new kernel esp */
|
|
|
|
sub %eax, %edx /* offset (low word is 0) */
|
|
|
|
shr $16, %edx
|
|
|
|
mov %dl, GDT_ESPFIX_SS + 4 /* bits 16..23 */
|
|
|
|
mov %dh, GDT_ESPFIX_SS + 7 /* bits 24..31 */
|
|
|
|
pushl $__ESPFIX_SS
|
|
|
|
pushl %eax /* new kernel esp */
|
|
|
|
/*
|
|
|
|
* Disable interrupts, but do not irqtrace this section: we
|
|
|
|
* will soon execute iret and the tracer was already set to
|
|
|
|
* the irqstate after the IRET:
|
|
|
|
*/
|
2021-03-11 22:23:14 +08:00
|
|
|
cli
|
2018-07-18 17:40:41 +08:00
|
|
|
lss (%esp), %esp /* switch to espfix segment */
|
|
|
|
.Lend_\@:
|
|
|
|
#endif /* CONFIG_X86_ESPFIX32 */
|
|
|
|
.endm
|
2018-07-18 17:40:44 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Called with pt_regs fully populated and kernel segments loaded,
|
|
|
|
* so we can access PER_CPU and use the integer registers.
|
|
|
|
*
|
|
|
|
* We need to be very careful here with the %esp switch, because an NMI
|
|
|
|
* can happen everywhere. If the NMI handler finds itself on the
|
|
|
|
* entry-stack, it will overwrite the task-stack and everything we
|
|
|
|
* copied there. So allocate the stack-frame on the task-stack and
|
|
|
|
* switch to it before we do any copying.
|
|
|
|
*/
|
2018-07-18 17:40:47 +08:00
|
|
|
|
2018-07-18 17:40:44 +08:00
|
|
|
.macro SWITCH_TO_KERNEL_STACK
|
|
|
|
|
2018-07-18 17:41:16 +08:00
|
|
|
BUG_IF_WRONG_CR3
|
|
|
|
|
2018-07-18 17:40:49 +08:00
|
|
|
SWITCH_TO_KERNEL_CR3 scratch_reg=%eax
|
|
|
|
|
|
|
|
/*
|
|
|
|
* %eax now contains the entry cr3 and we carry it forward in
|
|
|
|
* that register for the time this macro runs
|
|
|
|
*/
|
|
|
|
|
2018-07-18 17:40:44 +08:00
|
|
|
/* Are we on the entry stack? Bail out if not! */
|
|
|
|
movl PER_CPU_VAR(cpu_entry_area), %ecx
|
|
|
|
addl $CPU_ENTRY_AREA_entry_stack + SIZEOF_entry_stack, %ecx
|
|
|
|
subl %esp, %ecx /* ecx = (end of entry_stack) - esp */
|
|
|
|
cmpl $SIZEOF_entry_stack, %ecx
|
|
|
|
jae .Lend_\@
|
|
|
|
|
|
|
|
/* Load stack pointer into %esi and %edi */
|
|
|
|
movl %esp, %esi
|
|
|
|
movl %esi, %edi
|
|
|
|
|
|
|
|
/* Move %edi to the top of the entry stack */
|
|
|
|
andl $(MASK_entry_stack), %edi
|
|
|
|
addl $(SIZEOF_entry_stack), %edi
|
|
|
|
|
|
|
|
/* Load top of task-stack into %edi */
|
|
|
|
movl TSS_entry2task_stack(%edi), %edi
|
|
|
|
|
2018-07-18 17:40:47 +08:00
|
|
|
/* Special case - entry from kernel mode via entry stack */
|
2018-07-21 00:22:23 +08:00
|
|
|
#ifdef CONFIG_VM86
|
|
|
|
movl PT_EFLAGS(%esp), %ecx # mix EFLAGS and CS
|
|
|
|
movb PT_CS(%esp), %cl
|
|
|
|
andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %ecx
|
|
|
|
#else
|
|
|
|
movl PT_CS(%esp), %ecx
|
|
|
|
andl $SEGMENT_RPL_MASK, %ecx
|
|
|
|
#endif
|
|
|
|
cmpl $USER_RPL, %ecx
|
|
|
|
jb .Lentry_from_kernel_\@
|
2018-07-18 17:40:47 +08:00
|
|
|
|
2018-07-18 17:40:44 +08:00
|
|
|
/* Bytes to copy */
|
|
|
|
movl $PTREGS_SIZE, %ecx
|
|
|
|
|
|
|
|
#ifdef CONFIG_VM86
|
|
|
|
testl $X86_EFLAGS_VM, PT_EFLAGS(%esi)
|
|
|
|
jz .Lcopy_pt_regs_\@
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Stack-frame contains 4 additional segment registers when
|
|
|
|
* coming from VM86 mode
|
|
|
|
*/
|
|
|
|
addl $(4 * 4), %ecx
|
|
|
|
|
|
|
|
#endif
|
2018-07-18 17:40:47 +08:00
|
|
|
.Lcopy_pt_regs_\@:
|
2018-07-18 17:40:44 +08:00
|
|
|
|
|
|
|
/* Allocate frame on task-stack */
|
|
|
|
subl %ecx, %edi
|
|
|
|
|
|
|
|
/* Switch to task-stack */
|
|
|
|
movl %edi, %esp
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We are now on the task-stack and can safely copy over the
|
|
|
|
* stack-frame
|
|
|
|
*/
|
|
|
|
shrl $2, %ecx
|
|
|
|
cld
|
|
|
|
rep movsl
|
|
|
|
|
2018-07-18 17:40:47 +08:00
|
|
|
jmp .Lend_\@
|
|
|
|
|
|
|
|
.Lentry_from_kernel_\@:
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This handles the case when we enter the kernel from
|
|
|
|
* kernel-mode and %esp points to the entry-stack. When this
|
|
|
|
* happens we need to switch to the task-stack to run C code,
|
|
|
|
* but switch back to the entry-stack again when we approach
|
|
|
|
* iret and return to the interrupted code-path. This usually
|
|
|
|
* happens when we hit an exception while restoring user-space
|
2018-07-18 17:40:49 +08:00
|
|
|
* segment registers on the way back to user-space or when the
|
|
|
|
* sysenter handler runs with eflags.tf set.
|
2018-07-18 17:40:47 +08:00
|
|
|
*
|
|
|
|
* When we switch to the task-stack here, we can't trust the
|
|
|
|
* contents of the entry-stack anymore, as the exception handler
|
|
|
|
* might be scheduled out or moved to another CPU. Therefore we
|
|
|
|
* copy the complete entry-stack to the task-stack and set a
|
|
|
|
* marker in the iret-frame (bit 31 of the CS dword) to detect
|
|
|
|
* what we've done on the iret path.
|
|
|
|
*
|
|
|
|
* On the iret path we copy everything back and switch to the
|
|
|
|
* entry-stack, so that the interrupted kernel code-path
|
|
|
|
* continues on the same stack it was interrupted with.
|
|
|
|
*
|
|
|
|
* Be aware that an NMI can happen anytime in this code.
|
|
|
|
*
|
|
|
|
* %esi: Entry-Stack pointer (same as %esp)
|
|
|
|
* %edi: Top of the task stack
|
2018-07-18 17:40:49 +08:00
|
|
|
* %eax: CR3 on kernel entry
|
2018-07-18 17:40:47 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
/* Calculate number of bytes on the entry stack in %ecx */
|
|
|
|
movl %esi, %ecx
|
|
|
|
|
|
|
|
/* %ecx to the top of entry-stack */
|
|
|
|
andl $(MASK_entry_stack), %ecx
|
|
|
|
addl $(SIZEOF_entry_stack), %ecx
|
|
|
|
|
|
|
|
/* Number of bytes on the entry stack to %ecx */
|
|
|
|
sub %esi, %ecx
|
|
|
|
|
|
|
|
/* Mark stackframe as coming from entry stack */
|
|
|
|
orl $CS_FROM_ENTRY_STACK, PT_CS(%esp)
|
|
|
|
|
2018-07-18 17:40:49 +08:00
|
|
|
/*
|
|
|
|
* Test the cr3 used to enter the kernel and add a marker
|
|
|
|
* so that we can switch back to it before iret.
|
|
|
|
*/
|
|
|
|
testl $PTI_SWITCH_MASK, %eax
|
|
|
|
jz .Lcopy_pt_regs_\@
|
|
|
|
orl $CS_FROM_USER_CR3, PT_CS(%esp)
|
|
|
|
|
2018-07-18 17:40:47 +08:00
|
|
|
/*
|
|
|
|
* %esi and %edi are unchanged, %ecx contains the number of
|
|
|
|
* bytes to copy. The code at .Lcopy_pt_regs_\@ will allocate
|
|
|
|
* the stack-frame on task-stack and copy everything over
|
|
|
|
*/
|
|
|
|
jmp .Lcopy_pt_regs_\@
|
|
|
|
|
2018-07-18 17:40:44 +08:00
|
|
|
.Lend_\@:
|
|
|
|
.endm
|
|
|
|
|
2018-07-18 17:40:45 +08:00
|
|
|
/*
|
|
|
|
* Switch back from the kernel stack to the entry stack.
|
|
|
|
*
|
|
|
|
* The %esp register must point to pt_regs on the task stack. It will
|
|
|
|
* first calculate the size of the stack-frame to copy, depending on
|
|
|
|
* whether we return to VM86 mode or not. With that it uses 'rep movsl'
|
|
|
|
* to copy the contents of the stack over to the entry stack.
|
|
|
|
*
|
|
|
|
* We must be very careful here, as we can't trust the contents of the
|
|
|
|
* task-stack once we switched to the entry-stack. When an NMI happens
|
|
|
|
* while on the entry-stack, the NMI handler will switch back to the top
|
|
|
|
* of the task stack, overwriting our stack-frame we are about to copy.
|
|
|
|
* Therefore we switch the stack only after everything is copied over.
|
|
|
|
*/
|
|
|
|
.macro SWITCH_TO_ENTRY_STACK
|
|
|
|
|
|
|
|
/* Bytes to copy */
|
|
|
|
movl $PTREGS_SIZE, %ecx
|
|
|
|
|
|
|
|
#ifdef CONFIG_VM86
|
|
|
|
testl $(X86_EFLAGS_VM), PT_EFLAGS(%esp)
|
|
|
|
jz .Lcopy_pt_regs_\@
|
|
|
|
|
|
|
|
/* Additional 4 registers to copy when returning to VM86 mode */
|
|
|
|
addl $(4 * 4), %ecx
|
|
|
|
|
|
|
|
.Lcopy_pt_regs_\@:
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/* Initialize source and destination for movsl */
|
|
|
|
movl PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %edi
|
|
|
|
subl %ecx, %edi
|
|
|
|
movl %esp, %esi
|
|
|
|
|
|
|
|
/* Save future stack pointer in %ebx */
|
|
|
|
movl %edi, %ebx
|
|
|
|
|
|
|
|
/* Copy over the stack-frame */
|
|
|
|
shrl $2, %ecx
|
|
|
|
cld
|
|
|
|
rep movsl
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Switch to entry-stack - needs to happen after everything is
|
|
|
|
* copied because the NMI handler will overwrite the task-stack
|
|
|
|
* when on entry-stack
|
|
|
|
*/
|
|
|
|
movl %ebx, %esp
|
|
|
|
|
|
|
|
.Lend_\@:
|
|
|
|
.endm
|
|
|
|
|
2018-07-18 17:40:47 +08:00
|
|
|
/*
|
|
|
|
* This macro handles the case when we return to kernel-mode on the iret
|
2018-07-18 17:40:49 +08:00
|
|
|
* path and have to switch back to the entry stack and/or user-cr3
|
2018-07-18 17:40:47 +08:00
|
|
|
*
|
|
|
|
* See the comments below the .Lentry_from_kernel_\@ label in the
|
|
|
|
* SWITCH_TO_KERNEL_STACK macro for more details.
|
|
|
|
*/
|
|
|
|
.macro PARANOID_EXIT_TO_KERNEL_MODE
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Test if we entered the kernel with the entry-stack. Most
|
|
|
|
* likely we did not, because this code only runs on the
|
|
|
|
* return-to-kernel path.
|
|
|
|
*/
|
|
|
|
testl $CS_FROM_ENTRY_STACK, PT_CS(%esp)
|
|
|
|
jz .Lend_\@
|
|
|
|
|
|
|
|
/* Unlikely slow-path */
|
|
|
|
|
|
|
|
/* Clear marker from stack-frame */
|
|
|
|
andl $(~CS_FROM_ENTRY_STACK), PT_CS(%esp)
|
|
|
|
|
|
|
|
/* Copy the remaining task-stack contents to entry-stack */
|
|
|
|
movl %esp, %esi
|
|
|
|
movl PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %edi
|
|
|
|
|
|
|
|
/* Bytes on the task-stack to ecx */
|
|
|
|
movl PER_CPU_VAR(cpu_tss_rw + TSS_sp1), %ecx
|
|
|
|
subl %esi, %ecx
|
|
|
|
|
|
|
|
/* Allocate stack-frame on entry-stack */
|
|
|
|
subl %ecx, %edi
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Save future stack-pointer, we must not switch until the
|
|
|
|
* copy is done, otherwise the NMI handler could destroy the
|
|
|
|
* contents of the task-stack we are about to copy.
|
|
|
|
*/
|
|
|
|
movl %edi, %ebx
|
|
|
|
|
|
|
|
/* Do the copy */
|
|
|
|
shrl $2, %ecx
|
|
|
|
cld
|
|
|
|
rep movsl
|
|
|
|
|
|
|
|
/* Safe to switch to entry-stack now */
|
|
|
|
movl %ebx, %esp
|
|
|
|
|
2018-07-18 17:40:49 +08:00
|
|
|
/*
|
|
|
|
* We came from entry-stack and need to check if we also need to
|
|
|
|
* switch back to user cr3.
|
|
|
|
*/
|
|
|
|
testl $CS_FROM_USER_CR3, PT_CS(%esp)
|
|
|
|
jz .Lend_\@
|
|
|
|
|
|
|
|
/* Clear marker from stack-frame */
|
|
|
|
andl $(~CS_FROM_USER_CR3), PT_CS(%esp)
|
|
|
|
|
|
|
|
SWITCH_TO_USER_CR3 scratch_reg=%eax
|
|
|
|
|
2018-07-18 17:40:47 +08:00
|
|
|
.Lend_\@:
|
|
|
|
.endm
|
2020-02-26 06:16:11 +08:00
|
|
|
|
|
|
|
/**
|
|
|
|
* idtentry - Macro to generate entry stubs for simple IDT entries
|
|
|
|
* @vector: Vector number
|
|
|
|
* @asmsym: ASM symbol for the entry point
|
|
|
|
* @cfunc: C function to be called
|
|
|
|
* @has_error_code: Hardware pushed error code on stack
|
|
|
|
*/
|
2020-05-22 04:05:29 +08:00
|
|
|
.macro idtentry vector asmsym cfunc has_error_code:req
|
2020-02-26 06:16:11 +08:00
|
|
|
SYM_CODE_START(\asmsym)
|
|
|
|
ASM_CLAC
|
|
|
|
cld
|
|
|
|
|
|
|
|
.if \has_error_code == 0
|
|
|
|
pushl $0 /* Clear the error code */
|
|
|
|
.endif
|
|
|
|
|
|
|
|
/* Push the C-function address into the GS slot */
|
|
|
|
pushl $\cfunc
|
|
|
|
/* Invoke the common exception entry */
|
|
|
|
jmp handle_exception
|
|
|
|
SYM_CODE_END(\asmsym)
|
|
|
|
.endm
|
|
|
|
|
2020-05-22 04:05:36 +08:00
|
|
|
.macro idtentry_irq vector cfunc
|
|
|
|
.p2align CONFIG_X86_L1_CACHE_SHIFT
|
|
|
|
SYM_CODE_START_LOCAL(asm_\cfunc)
|
|
|
|
ASM_CLAC
|
|
|
|
SAVE_ALL switch_stacks=1
|
|
|
|
ENCODE_FRAME_POINTER
|
|
|
|
movl %esp, %eax
|
|
|
|
movl PT_ORIG_EAX(%esp), %edx /* get the vector from stack */
|
|
|
|
movl $-1, PT_ORIG_EAX(%esp) /* no syscall to restart */
|
|
|
|
call \cfunc
|
|
|
|
jmp handle_exception_return
|
|
|
|
SYM_CODE_END(asm_\cfunc)
|
|
|
|
.endm
|
|
|
|
|
2020-05-22 04:05:38 +08:00
|
|
|
.macro idtentry_sysvec vector cfunc
|
|
|
|
idtentry \vector asm_\cfunc \cfunc has_error_code=0
|
|
|
|
.endm
|
|
|
|
|
2020-02-26 06:16:12 +08:00
|
|
|
/*
|
|
|
|
* Include the defines which emit the idt entries which are shared
|
2020-06-10 14:37:01 +08:00
|
|
|
* shared between 32 and 64 bit and emit the __irqentry_text_* markers
|
|
|
|
* so the stacktrace boundary checks work.
|
2020-02-26 06:16:12 +08:00
|
|
|
*/
|
2020-06-10 14:37:01 +08:00
|
|
|
.align 16
|
|
|
|
.globl __irqentry_text_start
|
|
|
|
__irqentry_text_start:
|
|
|
|
|
2020-02-26 06:16:12 +08:00
|
|
|
#include <asm/idtentry.h>
|
|
|
|
|
2020-06-10 14:37:01 +08:00
|
|
|
.align 16
|
|
|
|
.globl __irqentry_text_end
|
|
|
|
__irqentry_text_end:
|
|
|
|
|
2016-08-14 00:38:19 +08:00
|
|
|
/*
|
|
|
|
* %eax: prev task
|
|
|
|
* %edx: next task
|
|
|
|
*/
|
2020-03-26 02:47:40 +08:00
|
|
|
.pushsection .text, "ax"
|
2019-10-11 19:51:06 +08:00
|
|
|
SYM_CODE_START(__switch_to_asm)
|
2016-08-14 00:38:19 +08:00
|
|
|
/*
|
|
|
|
* Save callee-saved registers
|
|
|
|
* This must match the order in struct inactive_task_frame
|
|
|
|
*/
|
|
|
|
pushl %ebp
|
|
|
|
pushl %ebx
|
|
|
|
pushl %edi
|
|
|
|
pushl %esi
|
2019-11-16 18:12:03 +08:00
|
|
|
/*
|
|
|
|
* Flags are saved to prevent AC leakage. This could go
|
|
|
|
* away if objtool would have 32bit support to verify
|
|
|
|
* the STAC/CLAC correctness.
|
|
|
|
*/
|
2019-02-14 17:30:52 +08:00
|
|
|
pushfl
|
2016-08-14 00:38:19 +08:00
|
|
|
|
|
|
|
/* switch stack */
|
|
|
|
movl %esp, TASK_threadsp(%eax)
|
|
|
|
movl TASK_threadsp(%edx), %esp
|
|
|
|
|
Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variables
The changes to automatically test for working stack protector compiler
support in the Kconfig files removed the special STACKPROTECTOR_AUTO
option that picked the strongest stack protector that the compiler
supported.
That was all a nice cleanup - it makes no sense to have the AUTO case
now that the Kconfig phase can just determine the compiler support
directly.
HOWEVER.
It also meant that doing "make oldconfig" would now _disable_ the strong
stackprotector if you had AUTO enabled, because in a legacy config file,
the sane stack protector configuration would look like
CONFIG_HAVE_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR_NONE is not set
# CONFIG_CC_STACKPROTECTOR_REGULAR is not set
# CONFIG_CC_STACKPROTECTOR_STRONG is not set
CONFIG_CC_STACKPROTECTOR_AUTO=y
and when you ran this through "make oldconfig" with the Kbuild changes,
it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had
been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just
CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version
used to be disabled (because it was really enabled by AUTO), and would
disable it in the new config, resulting in:
CONFIG_HAVE_CC_STACKPROTECTOR=y
CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR_STRONG is not set
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
That's dangerously subtle - people could suddenly find themselves with
the weaker stack protector setup without even realizing.
The solution here is to just rename not just the old RECULAR stack
protector option, but also the strong one. This does that by just
removing the CC_ prefix entirely for the user choices, because it really
is not about the compiler support (the compiler support now instead
automatially impacts _visibility_ of the options to users).
This results in "make oldconfig" actually asking the user for their
choice, so that we don't have any silent subtle security model changes.
The end result would generally look like this:
CONFIG_HAVE_CC_STACKPROTECTOR=y
CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
CONFIG_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR_STRONG=y
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
where the "CC_" versions really are about internal compiler
infrastructure, not the user selections.
Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-14 11:21:18 +08:00
|
|
|
#ifdef CONFIG_STACKPROTECTOR
|
2016-08-14 00:38:19 +08:00
|
|
|
movl TASK_stack_canary(%edx), %ebx
|
x86/stackprotector/32: Make the canary into a regular percpu variable
On 32-bit kernels, the stackprotector canary is quite nasty -- it is
stored at %gs:(20), which is nasty because 32-bit kernels use %fs for
percpu storage. It's even nastier because it means that whether %gs
contains userspace state or kernel state while running kernel code
depends on whether stackprotector is enabled (this is
CONFIG_X86_32_LAZY_GS), and this setting radically changes the way
that segment selectors work. Supporting both variants is a
maintenance and testing mess.
Merely rearranging so that percpu and the stack canary
share the same segment would be messy as the 32-bit percpu address
layout isn't currently compatible with putting a variable at a fixed
offset.
Fortunately, GCC 8.1 added options that allow the stack canary to be
accessed as %fs:__stack_chk_guard, effectively turning it into an ordinary
percpu variable. This lets us get rid of all of the code to manage the
stack canary GDT descriptor and the CONFIG_X86_32_LAZY_GS mess.
(That name is special. We could use any symbol we want for the
%fs-relative mode, but for CONFIG_SMP=n, gcc refuses to let us use any
name other than __stack_chk_guard.)
Forcibly disable stackprotector on older compilers that don't support
the new options and turn the stack canary into a percpu variable. The
"lazy GS" approach is now used for all 32-bit configurations.
Also makes load_gs_index() work on 32-bit kernels. On 64-bit kernels,
it loads the GS selector and updates the user GSBASE accordingly. (This
is unchanged.) On 32-bit kernels, it loads the GS selector and updates
GSBASE, which is now always the user base. This means that the overall
effect is the same on 32-bit and 64-bit, which avoids some ifdeffery.
[ bp: Massage commit message. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/c0ff7dba14041c7e5d1cae5d4df052f03759bef3.1613243844.git.luto@kernel.org
2021-02-14 03:19:44 +08:00
|
|
|
movl %ebx, PER_CPU_VAR(__stack_chk_guard)
|
2016-08-14 00:38:19 +08:00
|
|
|
#endif
|
|
|
|
|
2018-01-13 01:49:25 +08:00
|
|
|
#ifdef CONFIG_RETPOLINE
|
|
|
|
/*
|
|
|
|
* When switching from a shallower to a deeper call stack
|
|
|
|
* the RSB may either underflow or use entries populated
|
|
|
|
* with userspace addresses. On CPUs where those concerns
|
|
|
|
* exist, overwrite the RSB with entries which capture
|
|
|
|
* speculative execution to prevent attack.
|
|
|
|
*/
|
2018-02-19 18:50:56 +08:00
|
|
|
FILL_RETURN_BUFFER %ebx, RSB_CLEAR_LOOPS, X86_FEATURE_RSB_CTXSW
|
2018-01-13 01:49:25 +08:00
|
|
|
#endif
|
|
|
|
|
2019-11-16 18:12:03 +08:00
|
|
|
/* Restore flags or the incoming task to restore AC state. */
|
2019-02-14 17:30:52 +08:00
|
|
|
popfl
|
2019-11-16 18:12:03 +08:00
|
|
|
/* restore callee-saved registers */
|
2016-08-14 00:38:19 +08:00
|
|
|
popl %esi
|
|
|
|
popl %edi
|
|
|
|
popl %ebx
|
|
|
|
popl %ebp
|
|
|
|
|
|
|
|
jmp __switch_to
|
2019-10-11 19:51:06 +08:00
|
|
|
SYM_CODE_END(__switch_to_asm)
|
2020-03-26 02:47:40 +08:00
|
|
|
.popsection
|
2016-08-14 00:38:19 +08:00
|
|
|
|
2017-05-23 23:37:29 +08:00
|
|
|
/*
|
|
|
|
* The unwinder expects the last frame on the stack to always be at the same
|
|
|
|
* offset from the end of the page, which allows it to validate the stack.
|
|
|
|
* Calling schedule_tail() directly would break that convention because its an
|
|
|
|
* asmlinkage function so its argument has to be pushed on the stack. This
|
|
|
|
* wrapper creates a proper "end of stack" frame header before the call.
|
|
|
|
*/
|
2020-03-26 02:47:40 +08:00
|
|
|
.pushsection .text, "ax"
|
2019-10-11 19:51:07 +08:00
|
|
|
SYM_FUNC_START(schedule_tail_wrapper)
|
2017-05-23 23:37:29 +08:00
|
|
|
FRAME_BEGIN
|
|
|
|
|
|
|
|
pushl %eax
|
|
|
|
call schedule_tail
|
|
|
|
popl %eax
|
|
|
|
|
|
|
|
FRAME_END
|
|
|
|
ret
|
2019-10-11 19:51:07 +08:00
|
|
|
SYM_FUNC_END(schedule_tail_wrapper)
|
2020-03-26 02:47:40 +08:00
|
|
|
.popsection
|
|
|
|
|
2016-08-14 00:38:19 +08:00
|
|
|
/*
|
|
|
|
* A newly forked process directly context switches into this address.
|
|
|
|
*
|
|
|
|
* eax: prev task we switched from
|
2016-08-14 00:38:20 +08:00
|
|
|
* ebx: kernel thread func (NULL for user thread)
|
|
|
|
* edi: kernel thread arg
|
2016-08-14 00:38:19 +08:00
|
|
|
*/
|
2020-03-26 02:47:40 +08:00
|
|
|
.pushsection .text, "ax"
|
2019-10-11 19:51:06 +08:00
|
|
|
SYM_CODE_START(ret_from_fork)
|
2017-05-23 23:37:29 +08:00
|
|
|
call schedule_tail_wrapper
|
2015-10-06 08:48:13 +08:00
|
|
|
|
2016-08-14 00:38:20 +08:00
|
|
|
testl %ebx, %ebx
|
|
|
|
jnz 1f /* kernel threads are uncommon */
|
|
|
|
|
|
|
|
2:
|
2015-10-06 08:48:13 +08:00
|
|
|
/* When we fork, we trace the syscall return in the child, too. */
|
2017-05-23 23:37:29 +08:00
|
|
|
movl %esp, %eax
|
2020-07-23 06:00:05 +08:00
|
|
|
call syscall_exit_to_user_mode
|
2020-03-04 19:51:59 +08:00
|
|
|
jmp .Lsyscall_32_done
|
2015-10-06 08:48:13 +08:00
|
|
|
|
2016-08-14 00:38:20 +08:00
|
|
|
/* kernel thread */
|
|
|
|
1: movl %edi, %eax
|
2020-04-22 23:16:40 +08:00
|
|
|
CALL_NOSPEC ebx
|
2015-10-06 08:48:13 +08:00
|
|
|
/*
|
2016-08-14 00:38:20 +08:00
|
|
|
* A kernel thread is allowed to return here after successfully
|
2020-07-14 01:06:48 +08:00
|
|
|
* calling kernel_execve(). Exit to userspace to complete the execve()
|
2016-08-14 00:38:20 +08:00
|
|
|
* syscall.
|
2015-10-06 08:48:13 +08:00
|
|
|
*/
|
2016-08-14 00:38:20 +08:00
|
|
|
movl $0, PT_EAX(%esp)
|
|
|
|
jmp 2b
|
2019-10-11 19:51:06 +08:00
|
|
|
SYM_CODE_END(ret_from_fork)
|
2020-03-26 02:47:40 +08:00
|
|
|
.popsection
|
2012-08-03 03:05:11 +08:00
|
|
|
|
2019-10-11 19:50:59 +08:00
|
|
|
SYM_ENTRY(__begin_SYSENTER_singlestep_region, SYM_L_GLOBAL, SYM_A_NONE)
|
2016-03-10 11:00:30 +08:00
|
|
|
/*
|
|
|
|
* All code from here through __end_SYSENTER_singlestep_region is subject
|
|
|
|
* to being single-stepped if a user program sets TF and executes SYSENTER.
|
|
|
|
* There is absolutely nothing that we can do to prevent this from happening
|
|
|
|
* (thanks Intel!). To keep our handling of this situation as simple as
|
|
|
|
* possible, we handle TF just like AC and NT, except that our #DB handler
|
|
|
|
* will ignore all of the single-step traps generated in this range.
|
|
|
|
*/
|
|
|
|
|
2016-03-10 11:00:35 +08:00
|
|
|
/*
|
|
|
|
* 32-bit SYSENTER entry.
|
|
|
|
*
|
|
|
|
* 32-bit system calls through the vDSO's __kernel_vsyscall enter here
|
|
|
|
* if X86_FEATURE_SEP is available. This is the preferred system call
|
|
|
|
* entry on 32-bit systems.
|
|
|
|
*
|
|
|
|
* The SYSENTER instruction, in principle, should *only* occur in the
|
|
|
|
* vDSO. In practice, a small number of Android devices were shipped
|
|
|
|
* with a copy of Bionic that inlined a SYSENTER instruction. This
|
|
|
|
* never happened in any of Google's Bionic versions -- it only happened
|
|
|
|
* in a narrow range of Intel-provided versions.
|
|
|
|
*
|
|
|
|
* SYSENTER loads SS, ESP, CS, and EIP from previously programmed MSRs.
|
|
|
|
* IF and VM in RFLAGS are cleared (IOW: interrupts are off).
|
|
|
|
* SYSENTER does not save anything on the stack,
|
|
|
|
* and does not save old EIP (!!!), ESP, or EFLAGS.
|
|
|
|
*
|
|
|
|
* To avoid losing track of EFLAGS.VM (and thus potentially corrupting
|
|
|
|
* user and/or vm86 state), we explicitly disable the SYSENTER
|
|
|
|
* instruction in vm86 mode by reprogramming the MSRs.
|
|
|
|
*
|
|
|
|
* Arguments:
|
|
|
|
* eax system call number
|
|
|
|
* ebx arg1
|
|
|
|
* ecx arg2
|
|
|
|
* edx arg3
|
|
|
|
* esi arg4
|
|
|
|
* edi arg5
|
|
|
|
* ebp user stack
|
|
|
|
* 0(%ebp) arg6
|
|
|
|
*/
|
2019-10-11 19:51:07 +08:00
|
|
|
SYM_FUNC_START(entry_SYSENTER_32)
|
2018-07-18 17:40:49 +08:00
|
|
|
/*
|
|
|
|
* On entry-stack with all userspace-regs live - save and
|
|
|
|
* restore eflags and %eax to use it as scratch-reg for the cr3
|
|
|
|
* switch.
|
|
|
|
*/
|
|
|
|
pushfl
|
|
|
|
pushl %eax
|
2018-07-18 17:41:16 +08:00
|
|
|
BUG_IF_WRONG_CR3 no_user_check=1
|
2018-07-18 17:40:49 +08:00
|
|
|
SWITCH_TO_KERNEL_CR3 scratch_reg=%eax
|
|
|
|
popl %eax
|
|
|
|
popfl
|
|
|
|
|
|
|
|
/* Stack empty again, switch to task stack */
|
2018-07-18 17:40:39 +08:00
|
|
|
movl TSS_entry2task_stack(%esp), %esp
|
2018-07-18 17:40:49 +08:00
|
|
|
|
2016-09-22 05:03:59 +08:00
|
|
|
.Lsysenter_past_esp:
|
2015-10-06 08:48:15 +08:00
|
|
|
pushl $__USER_DS /* pt_regs->ss */
|
2020-06-27 01:21:12 +08:00
|
|
|
pushl $0 /* pt_regs->sp (placeholder) */
|
2015-10-06 08:48:15 +08:00
|
|
|
pushfl /* pt_regs->flags (except IF = 0) */
|
|
|
|
pushl $__USER_CS /* pt_regs->cs */
|
|
|
|
pushl $0 /* pt_regs->ip = 0 (placeholder) */
|
|
|
|
pushl %eax /* pt_regs->orig_ax */
|
2018-07-18 17:40:44 +08:00
|
|
|
SAVE_ALL pt_regs_ax=$-ENOSYS /* save rest, stack already switched */
|
2015-10-06 08:48:15 +08:00
|
|
|
|
2016-03-10 11:00:26 +08:00
|
|
|
/*
|
2016-03-10 11:00:30 +08:00
|
|
|
* SYSENTER doesn't filter flags, so we need to clear NT, AC
|
|
|
|
* and TF ourselves. To save a few cycles, we can check whether
|
2016-03-10 11:00:26 +08:00
|
|
|
* either was set instead of doing an unconditional popfq.
|
|
|
|
* This needs to happen before enabling interrupts so that
|
|
|
|
* we don't get preempted with NT set.
|
|
|
|
*
|
2016-03-10 11:00:30 +08:00
|
|
|
* If TF is set, we will single-step all the way to here -- do_debug
|
|
|
|
* will ignore all the traps. (Yes, this is slow, but so is
|
|
|
|
* single-stepping in general. This allows us to avoid having
|
|
|
|
* a more complicated code to handle the case where a user program
|
|
|
|
* forces us to single-step through the SYSENTER entry code.)
|
|
|
|
*
|
2016-03-10 11:00:26 +08:00
|
|
|
* NB.: .Lsysenter_fix_flags is a label with the code under it moved
|
|
|
|
* out-of-line as an optimization: NT is unlikely to be set in the
|
|
|
|
* majority of the cases and instead of polluting the I$ unnecessarily,
|
|
|
|
* we're keeping that code behind a branch which will predict as
|
|
|
|
* not-taken and therefore its instructions won't be fetched.
|
|
|
|
*/
|
2016-03-10 11:00:30 +08:00
|
|
|
testl $X86_EFLAGS_NT|X86_EFLAGS_AC|X86_EFLAGS_TF, PT_EFLAGS(%esp)
|
2016-03-10 11:00:26 +08:00
|
|
|
jnz .Lsysenter_fix_flags
|
|
|
|
.Lsysenter_flags_fixed:
|
|
|
|
|
2015-10-06 08:48:15 +08:00
|
|
|
movl %esp, %eax
|
2020-06-27 01:21:12 +08:00
|
|
|
call do_SYSENTER_32
|
2020-06-29 16:35:39 +08:00
|
|
|
testl %eax, %eax
|
|
|
|
jz .Lsyscall_32_done
|
2015-10-06 08:48:15 +08:00
|
|
|
|
2018-08-17 06:16:58 +08:00
|
|
|
STACKLEAK_ERASE
|
|
|
|
|
2020-03-04 19:51:59 +08:00
|
|
|
/* Opportunistic SYSEXIT */
|
2018-07-18 17:40:45 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Setup entry stack - we keep the pointer in %eax and do the
|
|
|
|
* switch after almost all user-state is restored.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* Load entry stack pointer and allocate frame for eflags/eax */
|
|
|
|
movl PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %eax
|
|
|
|
subl $(2*4), %eax
|
|
|
|
|
|
|
|
/* Copy eflags and eax to entry stack */
|
|
|
|
movl PT_EFLAGS(%esp), %edi
|
|
|
|
movl PT_EAX(%esp), %esi
|
|
|
|
movl %edi, (%eax)
|
|
|
|
movl %esi, 4(%eax)
|
|
|
|
|
|
|
|
/* Restore user registers and segments */
|
2015-10-06 08:48:15 +08:00
|
|
|
movl PT_EIP(%esp), %edx /* pt_regs->ip */
|
|
|
|
movl PT_OLDESP(%esp), %ecx /* pt_regs->sp */
|
2015-10-17 06:42:55 +08:00
|
|
|
1: mov PT_FS(%esp), %fs
|
2018-07-18 17:40:45 +08:00
|
|
|
|
2015-10-06 08:48:15 +08:00
|
|
|
popl %ebx /* pt_regs->bx */
|
|
|
|
addl $2*4, %esp /* skip pt_regs->cx and pt_regs->dx */
|
|
|
|
popl %esi /* pt_regs->si */
|
|
|
|
popl %edi /* pt_regs->di */
|
|
|
|
popl %ebp /* pt_regs->bp */
|
2018-07-18 17:40:45 +08:00
|
|
|
|
|
|
|
/* Switch to entry stack */
|
|
|
|
movl %eax, %esp
|
2015-10-06 08:48:15 +08:00
|
|
|
|
2018-07-18 17:40:49 +08:00
|
|
|
/* Now ready to switch the cr3 */
|
|
|
|
SWITCH_TO_USER_CR3 scratch_reg=%eax
|
|
|
|
|
2016-03-10 11:00:27 +08:00
|
|
|
/*
|
|
|
|
* Restore all flags except IF. (We restore IF separately because
|
|
|
|
* STI gives a one-instruction window in which we won't be interrupted,
|
|
|
|
* whereas POPF does not.)
|
|
|
|
*/
|
2018-06-25 18:21:59 +08:00
|
|
|
btrl $X86_EFLAGS_IF_BIT, (%esp)
|
2018-07-18 17:41:16 +08:00
|
|
|
BUG_IF_WRONG_CR3 no_user_check=1
|
2016-03-10 11:00:27 +08:00
|
|
|
popfl
|
2018-07-18 17:40:45 +08:00
|
|
|
popl %eax
|
2016-03-10 11:00:27 +08:00
|
|
|
|
2015-10-06 08:48:15 +08:00
|
|
|
/*
|
|
|
|
* Return back to the vDSO, which will pop ecx and edx.
|
|
|
|
* Don't bother with DS and ES (they already contain __USER_DS).
|
|
|
|
*/
|
2015-11-20 05:55:46 +08:00
|
|
|
sti
|
|
|
|
sysexit
|
2008-06-24 19:16:52 +08:00
|
|
|
|
2015-06-08 15:49:11 +08:00
|
|
|
.pushsection .fixup, "ax"
|
|
|
|
2: movl $0, PT_FS(%esp)
|
|
|
|
jmp 1b
|
[PATCH] i386: Use %gs as the PDA base-segment in the kernel
This patch is the meat of the PDA change. This patch makes several related
changes:
1: Most significantly, %gs is now used in the kernel. This means that on
entry, the old value of %gs is saved away, and it is reloaded with
__KERNEL_PDA.
2: entry.S constructs the stack in the shape of struct pt_regs, and this
is passed around the kernel so that the process's saved register
state can be accessed.
Unfortunately struct pt_regs doesn't currently have space for %gs
(or %fs). This patch extends pt_regs to add space for gs (no space
is allocated for %fs, since it won't be used, and it would just
complicate the code in entry.S to work around the space).
3: Because %gs is now saved on the stack like %ds, %es and the integer
registers, there are a number of places where it no longer needs to
be handled specially; namely context switch, and saving/restoring the
register state in a signal context.
4: And since kernel threads run in kernel space and call normal kernel
code, they need to be created with their %gs == __KERNEL_PDA.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
2006-12-07 09:14:02 +08:00
|
|
|
.popsection
|
2015-06-08 15:49:11 +08:00
|
|
|
_ASM_EXTABLE(1b, 2b)
|
2016-03-10 11:00:26 +08:00
|
|
|
|
|
|
|
.Lsysenter_fix_flags:
|
|
|
|
pushl $X86_EFLAGS_FIXED
|
|
|
|
popfl
|
|
|
|
jmp .Lsysenter_flags_fixed
|
2019-10-11 19:50:59 +08:00
|
|
|
SYM_ENTRY(__end_SYSENTER_singlestep_region, SYM_L_GLOBAL, SYM_A_NONE)
|
2019-10-11 19:51:07 +08:00
|
|
|
SYM_FUNC_END(entry_SYSENTER_32)
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2016-03-10 11:00:35 +08:00
|
|
|
/*
|
|
|
|
* 32-bit legacy system call entry.
|
|
|
|
*
|
|
|
|
* 32-bit x86 Linux system calls traditionally used the INT $0x80
|
|
|
|
* instruction. INT $0x80 lands here.
|
|
|
|
*
|
|
|
|
* This entry point can be used by any 32-bit perform system calls.
|
|
|
|
* Instances of INT $0x80 can be found inline in various programs and
|
|
|
|
* libraries. It is also used by the vDSO's __kernel_vsyscall
|
|
|
|
* fallback for hardware that doesn't support a faster entry method.
|
|
|
|
* Restarted 32-bit system calls also fall back to INT $0x80
|
|
|
|
* regardless of what instruction was originally used to do the system
|
|
|
|
* call. (64-bit programs can use INT $0x80 as well, but they can
|
|
|
|
* only run on 64-bit kernels and therefore land in
|
|
|
|
* entry_INT80_compat.)
|
|
|
|
*
|
|
|
|
* This is considered a slow path. It is not used by most libc
|
|
|
|
* implementations on modern hardware except during process startup.
|
|
|
|
*
|
|
|
|
* Arguments:
|
|
|
|
* eax system call number
|
|
|
|
* ebx arg1
|
|
|
|
* ecx arg2
|
|
|
|
* edx arg3
|
|
|
|
* esi arg4
|
|
|
|
* edi arg5
|
|
|
|
* ebp arg6
|
|
|
|
*/
|
2019-10-11 19:51:07 +08:00
|
|
|
SYM_FUNC_START(entry_INT80_32)
|
2012-09-22 04:58:10 +08:00
|
|
|
ASM_CLAC
|
2015-10-06 08:48:14 +08:00
|
|
|
pushl %eax /* pt_regs->orig_ax */
|
2018-07-18 17:40:44 +08:00
|
|
|
|
|
|
|
SAVE_ALL pt_regs_ax=$-ENOSYS switch_stacks=1 /* save rest */
|
2015-10-06 08:48:14 +08:00
|
|
|
|
|
|
|
movl %esp, %eax
|
2016-03-10 05:24:32 +08:00
|
|
|
call do_int80_syscall_32
|
2015-10-06 08:48:15 +08:00
|
|
|
.Lsyscall_32_done:
|
2018-08-17 06:16:58 +08:00
|
|
|
STACKLEAK_ERASE
|
|
|
|
|
2020-03-04 19:51:59 +08:00
|
|
|
restore_all_switch_stack:
|
2018-07-18 17:40:45 +08:00
|
|
|
SWITCH_TO_ENTRY_STACK
|
2018-07-18 17:40:41 +08:00
|
|
|
CHECK_AND_APPLY_ESPFIX
|
2020-03-09 06:24:02 +08:00
|
|
|
|
2018-07-18 17:40:49 +08:00
|
|
|
/* Switch back to user CR3 */
|
|
|
|
SWITCH_TO_USER_CR3 scratch_reg=%eax
|
|
|
|
|
2018-07-18 17:41:16 +08:00
|
|
|
BUG_IF_WRONG_CR3
|
|
|
|
|
2018-07-18 17:40:49 +08:00
|
|
|
/* Restore user state */
|
|
|
|
RESTORE_REGS pop=4 # skip orig_eax/error_code
|
2016-09-22 05:03:59 +08:00
|
|
|
.Lirq_return:
|
membarrier/x86: Provide core serializing command
There are two places where core serialization is needed by membarrier:
1) When returning from the membarrier IPI,
2) After scheduler updates curr to a thread with a different mm, before
going back to user-space, since the curr->mm is used by membarrier to
check whether it needs to send an IPI to that CPU.
x86-32 uses IRET as return from interrupt, and both IRET and SYSEXIT to go
back to user-space. The IRET instruction is core serializing, but not
SYSEXIT.
x86-64 uses IRET as return from interrupt, which takes care of the IPI.
However, it can return to user-space through either SYSRETL (compat
code), SYSRETQ, or IRET. Given that SYSRET{L,Q} is not core serializing,
we rely instead on write_cr3() performed by switch_mm() to provide core
serialization after changing the current mm, and deal with the special
case of kthread -> uthread (temporarily keeping current mm into
active_mm) by adding a sync_core() in that specific case.
Use the new sync_core_before_usermode() to guarantee this.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Andrew Hunter <ahh@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: David Sehr <sehr@google.com>
Cc: Greg Hackmann <ghackmann@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maged Michael <maged.michael@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Link: http://lkml.kernel.org/r/20180129202020.8515-10-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-30 04:20:18 +08:00
|
|
|
/*
|
|
|
|
* ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
|
|
|
|
* when returning from IPI handler and when returning from
|
|
|
|
* scheduler to user-space.
|
|
|
|
*/
|
2021-03-11 22:23:14 +08:00
|
|
|
iret
|
2016-09-22 05:03:59 +08:00
|
|
|
|
2015-06-08 15:49:11 +08:00
|
|
|
.section .fixup, "ax"
|
2020-02-26 06:16:30 +08:00
|
|
|
SYM_CODE_START(asm_iret_error)
|
2015-06-08 15:49:11 +08:00
|
|
|
pushl $0 # no error code
|
2020-02-26 06:16:30 +08:00
|
|
|
pushl $iret_error
|
2018-07-18 17:41:16 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_DEBUG_ENTRY
|
|
|
|
/*
|
|
|
|
* The stack-frame here is the one that iret faulted on, so its a
|
|
|
|
* return-to-user frame. We are on kernel-cr3 because we come here from
|
|
|
|
* the fixup code. This confuses the CR3 checker, so switch to user-cr3
|
|
|
|
* as the checker expects it.
|
|
|
|
*/
|
|
|
|
pushl %eax
|
|
|
|
SWITCH_TO_USER_CR3 scratch_reg=%eax
|
|
|
|
popl %eax
|
|
|
|
#endif
|
|
|
|
|
2020-02-26 06:16:30 +08:00
|
|
|
jmp handle_exception
|
|
|
|
SYM_CODE_END(asm_iret_error)
|
2005-04-17 06:20:36 +08:00
|
|
|
.previous
|
2020-02-26 06:16:30 +08:00
|
|
|
_ASM_EXTABLE(.Lirq_return, asm_iret_error)
|
2019-10-11 19:51:07 +08:00
|
|
|
SYM_FUNC_END(entry_INT80_32)
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2009-02-09 21:17:40 +08:00
|
|
|
.macro FIXUP_ESPFIX_STACK
|
i386: fix/simplify espfix stack switching, move it into assembly
The espfix code triggers if we have a protected mode userspace
application with a 16-bit stack. On returning to userspace, with iret,
the CPU doesn't restore the high word of the stack pointer. This is an
"official" bug, and the work-around used in the kernel is to temporarily
switch to a 32-bit stack segment/pointer pair where the high word of the
pointer is equal to the high word of the userspace stackpointer.
The current implementation uses THREAD_SIZE to determine the cut-off,
but there is no good reason not to use the more natural 64kb... However,
implementing this by simply substituting THREAD_SIZE with 65536 in
patch_espfix_desc crashed the test application. patch_espfix_desc tries
to do what is described above, but gets it subtly wrong if the userspace
stack pointer is just below a multiple of THREAD_SIZE: an overflow
occurs to bit 13... With a bit of luck, when the kernelspace
stackpointer is just below a 64kb-boundary, the overflow then ripples
trough to bit 16 and userspace will see its stack pointer changed by
65536.
This patch moves all espfix code into entry_32.S. Selecting a 16-bit
cut-off simplifies the code. The game with changing the limit dynamically
is removed too. It complicates matters and I see no value in it. Changing
only the top 16-bit word of ESP is one instruction and it also implies
that only two bytes of the ESPFIX GDT entry need to be changed and this
can be implemented in just a handful simple to understand instructions.
As a side effect, the operation to compute the original ESP from the
ESPFIX ESP and the GDT entry simplifies a bit too, and the remaining
three instructions have been expanded inline in entry_32.S.
impact: can now reliably run userspace with ESP=xxxxfffc on 16-bit
stack segment
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Stas Sergeev <stsp@aknet.ru>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-06-18 06:35:58 +08:00
|
|
|
/*
|
|
|
|
* Switch back for ESPFIX stack to the normal zerobased stack
|
|
|
|
*
|
|
|
|
* We can't call C functions using the ESPFIX stack. This code reads
|
|
|
|
* the high word of the segment base from the GDT and swiches to the
|
|
|
|
* normal stack and adjusts ESP with the matching offset.
|
2019-11-25 00:50:03 +08:00
|
|
|
*
|
|
|
|
* We might be on user CR3 here, so percpu data is not mapped and we can't
|
|
|
|
* access the GDT through the percpu segment. Instead, use SGDT to find
|
|
|
|
* the cpu_entry_area alias of the GDT.
|
i386: fix/simplify espfix stack switching, move it into assembly
The espfix code triggers if we have a protected mode userspace
application with a 16-bit stack. On returning to userspace, with iret,
the CPU doesn't restore the high word of the stack pointer. This is an
"official" bug, and the work-around used in the kernel is to temporarily
switch to a 32-bit stack segment/pointer pair where the high word of the
pointer is equal to the high word of the userspace stackpointer.
The current implementation uses THREAD_SIZE to determine the cut-off,
but there is no good reason not to use the more natural 64kb... However,
implementing this by simply substituting THREAD_SIZE with 65536 in
patch_espfix_desc crashed the test application. patch_espfix_desc tries
to do what is described above, but gets it subtly wrong if the userspace
stack pointer is just below a multiple of THREAD_SIZE: an overflow
occurs to bit 13... With a bit of luck, when the kernelspace
stackpointer is just below a 64kb-boundary, the overflow then ripples
trough to bit 16 and userspace will see its stack pointer changed by
65536.
This patch moves all espfix code into entry_32.S. Selecting a 16-bit
cut-off simplifies the code. The game with changing the limit dynamically
is removed too. It complicates matters and I see no value in it. Changing
only the top 16-bit word of ESP is one instruction and it also implies
that only two bytes of the ESPFIX GDT entry need to be changed and this
can be implemented in just a handful simple to understand instructions.
As a side effect, the operation to compute the original ESP from the
ESPFIX ESP and the GDT entry simplifies a bit too, and the remaining
three instructions have been expanded inline in entry_32.S.
impact: can now reliably run userspace with ESP=xxxxfffc on 16-bit
stack segment
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Stas Sergeev <stsp@aknet.ru>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-06-18 06:35:58 +08:00
|
|
|
*/
|
2014-05-05 01:36:22 +08:00
|
|
|
#ifdef CONFIG_X86_ESPFIX32
|
i386: fix/simplify espfix stack switching, move it into assembly
The espfix code triggers if we have a protected mode userspace
application with a 16-bit stack. On returning to userspace, with iret,
the CPU doesn't restore the high word of the stack pointer. This is an
"official" bug, and the work-around used in the kernel is to temporarily
switch to a 32-bit stack segment/pointer pair where the high word of the
pointer is equal to the high word of the userspace stackpointer.
The current implementation uses THREAD_SIZE to determine the cut-off,
but there is no good reason not to use the more natural 64kb... However,
implementing this by simply substituting THREAD_SIZE with 65536 in
patch_espfix_desc crashed the test application. patch_espfix_desc tries
to do what is described above, but gets it subtly wrong if the userspace
stack pointer is just below a multiple of THREAD_SIZE: an overflow
occurs to bit 13... With a bit of luck, when the kernelspace
stackpointer is just below a 64kb-boundary, the overflow then ripples
trough to bit 16 and userspace will see its stack pointer changed by
65536.
This patch moves all espfix code into entry_32.S. Selecting a 16-bit
cut-off simplifies the code. The game with changing the limit dynamically
is removed too. It complicates matters and I see no value in it. Changing
only the top 16-bit word of ESP is one instruction and it also implies
that only two bytes of the ESPFIX GDT entry need to be changed and this
can be implemented in just a handful simple to understand instructions.
As a side effect, the operation to compute the original ESP from the
ESPFIX ESP and the GDT entry simplifies a bit too, and the remaining
three instructions have been expanded inline in entry_32.S.
impact: can now reliably run userspace with ESP=xxxxfffc on 16-bit
stack segment
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Stas Sergeev <stsp@aknet.ru>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-06-18 06:35:58 +08:00
|
|
|
/* fixup the stack */
|
2019-11-25 00:50:03 +08:00
|
|
|
pushl %ecx
|
|
|
|
subl $2*4, %esp
|
|
|
|
sgdt (%esp)
|
|
|
|
movl 2(%esp), %ecx /* GDT address */
|
|
|
|
/*
|
|
|
|
* Careful: ECX is a linear pointer, so we need to force base
|
|
|
|
* zero. %cs is the only known-linear segment we have right now.
|
|
|
|
*/
|
|
|
|
mov %cs:GDT_ESPFIX_OFFSET + 4(%ecx), %al /* bits 16..23 */
|
|
|
|
mov %cs:GDT_ESPFIX_OFFSET + 7(%ecx), %ah /* bits 24..31 */
|
2015-06-09 04:35:33 +08:00
|
|
|
shl $16, %eax
|
2019-11-25 00:50:03 +08:00
|
|
|
addl $2*4, %esp
|
|
|
|
popl %ecx
|
2015-06-08 15:49:11 +08:00
|
|
|
addl %esp, %eax /* the adjusted stack pointer */
|
|
|
|
pushl $__KERNEL_DS
|
|
|
|
pushl %eax
|
|
|
|
lss (%esp), %esp /* switch to the normal stack segment */
|
2014-05-05 01:36:22 +08:00
|
|
|
#endif
|
2009-02-09 21:17:40 +08:00
|
|
|
.endm
|
2019-11-20 17:10:49 +08:00
|
|
|
|
2009-02-09 21:17:40 +08:00
|
|
|
.macro UNWIND_ESPFIX_STACK
|
2019-11-20 17:10:49 +08:00
|
|
|
/* It's safe to clobber %eax, all other regs need to be preserved */
|
2014-05-05 01:36:22 +08:00
|
|
|
#ifdef CONFIG_X86_ESPFIX32
|
2015-06-08 15:49:11 +08:00
|
|
|
movl %ss, %eax
|
2009-02-09 21:17:40 +08:00
|
|
|
/* see if on espfix stack */
|
2015-06-08 15:49:11 +08:00
|
|
|
cmpw $__ESPFIX_SS, %ax
|
2019-11-20 17:10:49 +08:00
|
|
|
jne .Lno_fixup_\@
|
2009-02-09 21:17:40 +08:00
|
|
|
/* switch to normal stack */
|
|
|
|
FIXUP_ESPFIX_STACK
|
2019-11-20 17:10:49 +08:00
|
|
|
.Lno_fixup_\@:
|
2014-05-05 01:36:22 +08:00
|
|
|
#endif
|
2009-02-09 21:17:40 +08:00
|
|
|
.endm
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2020-02-26 06:16:11 +08:00
|
|
|
SYM_CODE_START_LOCAL_NOALIGN(handle_exception)
|
|
|
|
/* the function address is in %gs's slot on the stack */
|
|
|
|
SAVE_ALL switch_stacks=1 skip_gs=1 unwind_espfix=1
|
|
|
|
ENCODE_FRAME_POINTER
|
|
|
|
|
|
|
|
movl PT_GS(%esp), %edi # get the function address
|
|
|
|
|
|
|
|
/* fixup orig %eax */
|
|
|
|
movl PT_ORIG_EAX(%esp), %edx # get the error code
|
|
|
|
movl $-1, PT_ORIG_EAX(%esp) # no syscall to restart
|
|
|
|
|
|
|
|
movl %esp, %eax # pt_regs pointer
|
|
|
|
CALL_NOSPEC edi
|
|
|
|
|
2020-05-22 04:05:26 +08:00
|
|
|
handle_exception_return:
|
2020-02-26 06:16:11 +08:00
|
|
|
#ifdef CONFIG_VM86
|
|
|
|
movl PT_EFLAGS(%esp), %eax # mix EFLAGS and CS
|
|
|
|
movb PT_CS(%esp), %al
|
|
|
|
andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %eax
|
|
|
|
#else
|
|
|
|
/*
|
|
|
|
* We can be coming here from child spawned by kernel_thread().
|
|
|
|
*/
|
|
|
|
movl PT_CS(%esp), %eax
|
|
|
|
andl $SEGMENT_RPL_MASK, %eax
|
|
|
|
#endif
|
|
|
|
cmpl $USER_RPL, %eax # returning to v8086 or userspace ?
|
|
|
|
jnb ret_to_user
|
|
|
|
|
|
|
|
PARANOID_EXIT_TO_KERNEL_MODE
|
|
|
|
BUG_IF_WRONG_CR3
|
|
|
|
RESTORE_REGS 4
|
|
|
|
jmp .Lirq_return
|
|
|
|
|
|
|
|
ret_to_user:
|
|
|
|
movl %esp, %eax
|
|
|
|
jmp restore_all_switch_stack
|
|
|
|
SYM_CODE_END(handle_exception)
|
|
|
|
|
2020-02-26 06:33:31 +08:00
|
|
|
SYM_CODE_START(asm_exc_double_fault)
|
2019-11-21 15:06:41 +08:00
|
|
|
1:
|
|
|
|
/*
|
|
|
|
* This is a task gate handler, not an interrupt gate handler.
|
|
|
|
* The error code is on the stack, but the stack is otherwise
|
|
|
|
* empty. Interrupts are off. Our state is sane with the following
|
|
|
|
* exceptions:
|
|
|
|
*
|
|
|
|
* - CR0.TS is set. "TS" literally means "task switched".
|
|
|
|
* - EFLAGS.NT is set because we're a "nested task".
|
|
|
|
* - The doublefault TSS has back_link set and has been marked busy.
|
|
|
|
* - TR points to the doublefault TSS and the normal TSS is busy.
|
|
|
|
* - CR3 is the normal kernel PGD. This would be delightful, except
|
|
|
|
* that the CPU didn't bother to save the old CR3 anywhere. This
|
|
|
|
* would make it very awkward to return back to the context we came
|
|
|
|
* from.
|
|
|
|
*
|
|
|
|
* The rest of EFLAGS is sanitized for us, so we don't need to
|
|
|
|
* worry about AC or DF.
|
|
|
|
*
|
|
|
|
* Don't even bother popping the error code. It's always zero,
|
|
|
|
* and ignoring it makes us a bit more robust against buggy
|
|
|
|
* hypervisor task gate implementations.
|
|
|
|
*
|
|
|
|
* We will manually undo the task switch instead of doing a
|
|
|
|
* task-switching IRET.
|
|
|
|
*/
|
|
|
|
|
|
|
|
clts /* clear CR0.TS */
|
|
|
|
pushl $X86_EFLAGS_FIXED
|
|
|
|
popfl /* clear EFLAGS.NT */
|
|
|
|
|
|
|
|
call doublefault_shim
|
|
|
|
|
|
|
|
/* We don't support returning, so we have no IRET here. */
|
|
|
|
1:
|
|
|
|
hlt
|
|
|
|
jmp 1b
|
2020-02-26 06:33:31 +08:00
|
|
|
SYM_CODE_END(asm_exc_double_fault)
|
2019-11-21 15:06:41 +08:00
|
|
|
|
2008-11-24 22:38:45 +08:00
|
|
|
/*
|
x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup
Right after SYSENTER, we can get a #DB or NMI. On x86_32, there's no IST,
so the exception handler is invoked on the temporary SYSENTER stack.
Because the SYSENTER stack is very small, we have a fixup to switch
off the stack quickly when this happens. The old fixup had several issues:
1. It checked the interrupt frame's CS and EIP. This wasn't
obviously correct on Xen or if vm86 mode was in use [1].
2. In the NMI handler, it did some frightening digging into the
stack frame. I'm not convinced this digging was correct.
3. The fixup didn't switch stacks and then switch back. Instead, it
synthesized a brand new stack frame that would redirect the IRET
back to the SYSENTER code. That frame was highly questionable.
For one thing, if NMI nested inside #DB, we would effectively
abort the #DB prologue, which was probably safe but was
frightening. For another, the code used PUSHFL to write the
FLAGS portion of the frame, which was simply bogus -- by the time
PUSHFL was called, at least TF, NT, VM, and all of the arithmetic
flags were clobbered.
Simplify this considerably. Instead of looking at the saved frame
to see where we came from, check the hardware ESP register against
the SYSENTER stack directly. Malicious user code cannot spoof the
kernel ESP register, and by moving the check after SAVE_ALL, we can
use normal PER_CPU accesses to find all the relevant addresses.
With this patch applied, the improved syscall_nt_32 test finally
passes on 32-bit kernels.
[1] It isn't obviously correct, but it is nonetheless safe from vm86
shenanigans as far as I can tell. A user can't point EIP at
entry_SYSENTER_32 while in vm86 mode because entry_SYSENTER_32,
like all kernel addresses, is greater than 0xffff and would thus
violate the CS segment limit.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/b2cdbc037031c07ecf2c40a96069318aec0e7971.1457578375.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-10 11:00:32 +08:00
|
|
|
* NMI is doubly nasty. It can happen on the first instruction of
|
|
|
|
* entry_SYSENTER_32 (just like #DB), but it can also interrupt the beginning
|
|
|
|
* of the #DB handler even if that #DB in turn hit before entry_SYSENTER_32
|
|
|
|
* switched stacks. We handle both conditions by simply checking whether we
|
|
|
|
* interrupted kernel code running on the SYSENTER stack.
|
2008-11-24 22:38:45 +08:00
|
|
|
*/
|
2020-02-26 06:33:25 +08:00
|
|
|
SYM_CODE_START(asm_exc_nmi)
|
2012-09-22 04:58:10 +08:00
|
|
|
ASM_CLAC
|
2018-07-18 17:40:44 +08:00
|
|
|
|
2014-05-05 01:36:22 +08:00
|
|
|
#ifdef CONFIG_X86_ESPFIX32
|
2019-11-20 22:02:26 +08:00
|
|
|
/*
|
|
|
|
* ESPFIX_SS is only ever set on the return to user path
|
|
|
|
* after we've switched to the entry stack.
|
|
|
|
*/
|
2015-06-08 15:49:11 +08:00
|
|
|
pushl %eax
|
|
|
|
movl %ss, %eax
|
|
|
|
cmpw $__ESPFIX_SS, %ax
|
|
|
|
popl %eax
|
2016-09-22 05:03:59 +08:00
|
|
|
je .Lnmi_espfix_stack
|
2014-05-05 01:36:22 +08:00
|
|
|
#endif
|
x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup
Right after SYSENTER, we can get a #DB or NMI. On x86_32, there's no IST,
so the exception handler is invoked on the temporary SYSENTER stack.
Because the SYSENTER stack is very small, we have a fixup to switch
off the stack quickly when this happens. The old fixup had several issues:
1. It checked the interrupt frame's CS and EIP. This wasn't
obviously correct on Xen or if vm86 mode was in use [1].
2. In the NMI handler, it did some frightening digging into the
stack frame. I'm not convinced this digging was correct.
3. The fixup didn't switch stacks and then switch back. Instead, it
synthesized a brand new stack frame that would redirect the IRET
back to the SYSENTER code. That frame was highly questionable.
For one thing, if NMI nested inside #DB, we would effectively
abort the #DB prologue, which was probably safe but was
frightening. For another, the code used PUSHFL to write the
FLAGS portion of the frame, which was simply bogus -- by the time
PUSHFL was called, at least TF, NT, VM, and all of the arithmetic
flags were clobbered.
Simplify this considerably. Instead of looking at the saved frame
to see where we came from, check the hardware ESP register against
the SYSENTER stack directly. Malicious user code cannot spoof the
kernel ESP register, and by moving the check after SAVE_ALL, we can
use normal PER_CPU accesses to find all the relevant addresses.
With this patch applied, the improved syscall_nt_32 test finally
passes on 32-bit kernels.
[1] It isn't obviously correct, but it is nonetheless safe from vm86
shenanigans as far as I can tell. A user can't point EIP at
entry_SYSENTER_32 while in vm86 mode because entry_SYSENTER_32,
like all kernel addresses, is greater than 0xffff and would thus
violate the CS segment limit.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/b2cdbc037031c07ecf2c40a96069318aec0e7971.1457578375.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-10 11:00:32 +08:00
|
|
|
|
|
|
|
pushl %eax # pt_regs->orig_ax
|
2018-07-18 17:40:50 +08:00
|
|
|
SAVE_ALL_NMI cr3_reg=%edi
|
2016-10-21 00:34:40 +08:00
|
|
|
ENCODE_FRAME_POINTER
|
2015-06-08 15:49:11 +08:00
|
|
|
xorl %edx, %edx # zero error code
|
|
|
|
movl %esp, %eax # pt_regs pointer
|
x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup
Right after SYSENTER, we can get a #DB or NMI. On x86_32, there's no IST,
so the exception handler is invoked on the temporary SYSENTER stack.
Because the SYSENTER stack is very small, we have a fixup to switch
off the stack quickly when this happens. The old fixup had several issues:
1. It checked the interrupt frame's CS and EIP. This wasn't
obviously correct on Xen or if vm86 mode was in use [1].
2. In the NMI handler, it did some frightening digging into the
stack frame. I'm not convinced this digging was correct.
3. The fixup didn't switch stacks and then switch back. Instead, it
synthesized a brand new stack frame that would redirect the IRET
back to the SYSENTER code. That frame was highly questionable.
For one thing, if NMI nested inside #DB, we would effectively
abort the #DB prologue, which was probably safe but was
frightening. For another, the code used PUSHFL to write the
FLAGS portion of the frame, which was simply bogus -- by the time
PUSHFL was called, at least TF, NT, VM, and all of the arithmetic
flags were clobbered.
Simplify this considerably. Instead of looking at the saved frame
to see where we came from, check the hardware ESP register against
the SYSENTER stack directly. Malicious user code cannot spoof the
kernel ESP register, and by moving the check after SAVE_ALL, we can
use normal PER_CPU accesses to find all the relevant addresses.
With this patch applied, the improved syscall_nt_32 test finally
passes on 32-bit kernels.
[1] It isn't obviously correct, but it is nonetheless safe from vm86
shenanigans as far as I can tell. A user can't point EIP at
entry_SYSENTER_32 while in vm86 mode because entry_SYSENTER_32,
like all kernel addresses, is greater than 0xffff and would thus
violate the CS segment limit.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/b2cdbc037031c07ecf2c40a96069318aec0e7971.1457578375.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-10 11:00:32 +08:00
|
|
|
|
|
|
|
/* Are we currently on the SYSENTER stack? */
|
2017-12-04 22:07:20 +08:00
|
|
|
movl PER_CPU_VAR(cpu_entry_area), %ecx
|
2017-12-05 09:25:07 +08:00
|
|
|
addl $CPU_ENTRY_AREA_entry_stack + SIZEOF_entry_stack, %ecx
|
|
|
|
subl %eax, %ecx /* ecx = (end of entry_stack) - esp */
|
|
|
|
cmpl $SIZEOF_entry_stack, %ecx
|
x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup
Right after SYSENTER, we can get a #DB or NMI. On x86_32, there's no IST,
so the exception handler is invoked on the temporary SYSENTER stack.
Because the SYSENTER stack is very small, we have a fixup to switch
off the stack quickly when this happens. The old fixup had several issues:
1. It checked the interrupt frame's CS and EIP. This wasn't
obviously correct on Xen or if vm86 mode was in use [1].
2. In the NMI handler, it did some frightening digging into the
stack frame. I'm not convinced this digging was correct.
3. The fixup didn't switch stacks and then switch back. Instead, it
synthesized a brand new stack frame that would redirect the IRET
back to the SYSENTER code. That frame was highly questionable.
For one thing, if NMI nested inside #DB, we would effectively
abort the #DB prologue, which was probably safe but was
frightening. For another, the code used PUSHFL to write the
FLAGS portion of the frame, which was simply bogus -- by the time
PUSHFL was called, at least TF, NT, VM, and all of the arithmetic
flags were clobbered.
Simplify this considerably. Instead of looking at the saved frame
to see where we came from, check the hardware ESP register against
the SYSENTER stack directly. Malicious user code cannot spoof the
kernel ESP register, and by moving the check after SAVE_ALL, we can
use normal PER_CPU accesses to find all the relevant addresses.
With this patch applied, the improved syscall_nt_32 test finally
passes on 32-bit kernels.
[1] It isn't obviously correct, but it is nonetheless safe from vm86
shenanigans as far as I can tell. A user can't point EIP at
entry_SYSENTER_32 while in vm86 mode because entry_SYSENTER_32,
like all kernel addresses, is greater than 0xffff and would thus
violate the CS segment limit.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/b2cdbc037031c07ecf2c40a96069318aec0e7971.1457578375.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-10 11:00:32 +08:00
|
|
|
jb .Lnmi_from_sysenter_stack
|
|
|
|
|
|
|
|
/* Not on SYSENTER stack. */
|
2020-02-26 06:33:25 +08:00
|
|
|
call exc_nmi
|
2018-07-18 17:40:42 +08:00
|
|
|
jmp .Lnmi_return
|
2008-11-24 22:38:45 +08:00
|
|
|
|
x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup
Right after SYSENTER, we can get a #DB or NMI. On x86_32, there's no IST,
so the exception handler is invoked on the temporary SYSENTER stack.
Because the SYSENTER stack is very small, we have a fixup to switch
off the stack quickly when this happens. The old fixup had several issues:
1. It checked the interrupt frame's CS and EIP. This wasn't
obviously correct on Xen or if vm86 mode was in use [1].
2. In the NMI handler, it did some frightening digging into the
stack frame. I'm not convinced this digging was correct.
3. The fixup didn't switch stacks and then switch back. Instead, it
synthesized a brand new stack frame that would redirect the IRET
back to the SYSENTER code. That frame was highly questionable.
For one thing, if NMI nested inside #DB, we would effectively
abort the #DB prologue, which was probably safe but was
frightening. For another, the code used PUSHFL to write the
FLAGS portion of the frame, which was simply bogus -- by the time
PUSHFL was called, at least TF, NT, VM, and all of the arithmetic
flags were clobbered.
Simplify this considerably. Instead of looking at the saved frame
to see where we came from, check the hardware ESP register against
the SYSENTER stack directly. Malicious user code cannot spoof the
kernel ESP register, and by moving the check after SAVE_ALL, we can
use normal PER_CPU accesses to find all the relevant addresses.
With this patch applied, the improved syscall_nt_32 test finally
passes on 32-bit kernels.
[1] It isn't obviously correct, but it is nonetheless safe from vm86
shenanigans as far as I can tell. A user can't point EIP at
entry_SYSENTER_32 while in vm86 mode because entry_SYSENTER_32,
like all kernel addresses, is greater than 0xffff and would thus
violate the CS segment limit.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/b2cdbc037031c07ecf2c40a96069318aec0e7971.1457578375.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-10 11:00:32 +08:00
|
|
|
.Lnmi_from_sysenter_stack:
|
|
|
|
/*
|
|
|
|
* We're on the SYSENTER stack. Switch off. No one (not even debug)
|
|
|
|
* is using the thread stack right now, so it's safe for us to use it.
|
|
|
|
*/
|
2016-10-21 00:34:40 +08:00
|
|
|
movl %esp, %ebx
|
x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup
Right after SYSENTER, we can get a #DB or NMI. On x86_32, there's no IST,
so the exception handler is invoked on the temporary SYSENTER stack.
Because the SYSENTER stack is very small, we have a fixup to switch
off the stack quickly when this happens. The old fixup had several issues:
1. It checked the interrupt frame's CS and EIP. This wasn't
obviously correct on Xen or if vm86 mode was in use [1].
2. In the NMI handler, it did some frightening digging into the
stack frame. I'm not convinced this digging was correct.
3. The fixup didn't switch stacks and then switch back. Instead, it
synthesized a brand new stack frame that would redirect the IRET
back to the SYSENTER code. That frame was highly questionable.
For one thing, if NMI nested inside #DB, we would effectively
abort the #DB prologue, which was probably safe but was
frightening. For another, the code used PUSHFL to write the
FLAGS portion of the frame, which was simply bogus -- by the time
PUSHFL was called, at least TF, NT, VM, and all of the arithmetic
flags were clobbered.
Simplify this considerably. Instead of looking at the saved frame
to see where we came from, check the hardware ESP register against
the SYSENTER stack directly. Malicious user code cannot spoof the
kernel ESP register, and by moving the check after SAVE_ALL, we can
use normal PER_CPU accesses to find all the relevant addresses.
With this patch applied, the improved syscall_nt_32 test finally
passes on 32-bit kernels.
[1] It isn't obviously correct, but it is nonetheless safe from vm86
shenanigans as far as I can tell. A user can't point EIP at
entry_SYSENTER_32 while in vm86 mode because entry_SYSENTER_32,
like all kernel addresses, is greater than 0xffff and would thus
violate the CS segment limit.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/b2cdbc037031c07ecf2c40a96069318aec0e7971.1457578375.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-10 11:00:32 +08:00
|
|
|
movl PER_CPU_VAR(cpu_current_top_of_stack), %esp
|
2020-02-26 06:33:25 +08:00
|
|
|
call exc_nmi
|
2016-10-21 00:34:40 +08:00
|
|
|
movl %ebx, %esp
|
2018-07-18 17:40:42 +08:00
|
|
|
|
|
|
|
.Lnmi_return:
|
2019-11-20 22:02:26 +08:00
|
|
|
#ifdef CONFIG_X86_ESPFIX32
|
|
|
|
testl $CS_FROM_ESPFIX, PT_CS(%esp)
|
|
|
|
jnz .Lnmi_from_espfix
|
|
|
|
#endif
|
|
|
|
|
2018-07-18 17:40:42 +08:00
|
|
|
CHECK_AND_APPLY_ESPFIX
|
2018-07-18 17:40:50 +08:00
|
|
|
RESTORE_ALL_NMI cr3_reg=%edi pop=4
|
2018-07-18 17:40:42 +08:00
|
|
|
jmp .Lirq_return
|
2008-11-24 22:38:45 +08:00
|
|
|
|
2014-05-05 01:36:22 +08:00
|
|
|
#ifdef CONFIG_X86_ESPFIX32
|
2016-09-22 05:03:59 +08:00
|
|
|
.Lnmi_espfix_stack:
|
x86/debug: Remove perpetually broken, unmaintainable dwarf annotations
So the dwarf2 annotations in low level assembly code have
become an increasing hindrance: unreadable, messy macros
mixed into some of the most security sensitive code paths
of the Linux kernel.
These debug info annotations don't even buy the upstream
kernel anything: dwarf driven stack unwinding has caused
problems in the past so it's out of tree, and the upstream
kernel only uses the much more robust framepointers based
stack unwinding method.
In addition to that there's a steady, slow bitrot going
on with these annotations, requiring frequent fixups.
There's no tooling and no functionality upstream that
keeps it correct.
So burn down the sick forest, allowing new, healthier growth:
27 files changed, 350 insertions(+), 1101 deletions(-)
Someone who has the willingness and time to do this
properly can attempt to reintroduce dwarf debuginfo in x86
assembly code plus dwarf unwinding from first principles,
with the following conditions:
- it should be maximally readable, and maximally low-key to
'ordinary' code reading and maintenance.
- find a build time method to insert dwarf annotations
automatically in the most common cases, for pop/push
instructions that manipulate the stack pointer. This could
be done for example via a preprocessing step that just
looks for common patterns - plus special annotations for
the few cases where we want to depart from the default.
We have hundreds of CFI annotations, so automating most of
that makes sense.
- it should come with build tooling checks that ensure that
CFI annotations are sensible. We've seen such efforts from
the framepointer side, and there's no reason it couldn't be
done on the dwarf side.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frédéric Weisbecker <fweisbec@gmail.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-28 18:21:47 +08:00
|
|
|
/*
|
2019-11-20 22:02:26 +08:00
|
|
|
* Create the pointer to LSS back
|
2008-11-24 22:38:45 +08:00
|
|
|
*/
|
2015-06-08 15:49:11 +08:00
|
|
|
pushl %ss
|
|
|
|
pushl %esp
|
|
|
|
addl $4, (%esp)
|
2019-11-20 22:02:26 +08:00
|
|
|
|
|
|
|
/* Copy the (short) IRET frame */
|
|
|
|
pushl 4*4(%esp) # flags
|
|
|
|
pushl 4*4(%esp) # cs
|
|
|
|
pushl 4*4(%esp) # ip
|
|
|
|
|
|
|
|
pushl %eax # orig_ax
|
|
|
|
|
|
|
|
SAVE_ALL_NMI cr3_reg=%edi unwind_espfix=1
|
2016-10-21 00:34:40 +08:00
|
|
|
ENCODE_FRAME_POINTER
|
2019-11-20 22:02:26 +08:00
|
|
|
|
|
|
|
/* clear CS_FROM_KERNEL, set CS_FROM_ESPFIX */
|
|
|
|
xorl $(CS_FROM_ESPFIX | CS_FROM_KERNEL), PT_CS(%esp)
|
|
|
|
|
2015-06-08 15:49:11 +08:00
|
|
|
xorl %edx, %edx # zero error code
|
2019-11-20 22:02:26 +08:00
|
|
|
movl %esp, %eax # pt_regs pointer
|
|
|
|
jmp .Lnmi_from_sysenter_stack
|
|
|
|
|
|
|
|
.Lnmi_from_espfix:
|
2018-07-18 17:40:50 +08:00
|
|
|
RESTORE_ALL_NMI cr3_reg=%edi
|
2019-11-20 22:02:26 +08:00
|
|
|
/*
|
|
|
|
* Because we cleared CS_FROM_KERNEL, IRET_FRAME 'forgot' to
|
|
|
|
* fix up the gap and long frame:
|
|
|
|
*
|
|
|
|
* 3 - original frame (exception)
|
|
|
|
* 2 - ESPFIX block (above)
|
|
|
|
* 6 - gap (FIXUP_FRAME)
|
|
|
|
* 5 - long frame (FIXUP_FRAME)
|
|
|
|
* 1 - orig_ax
|
|
|
|
*/
|
|
|
|
lss (1+5+6)*4(%esp), %esp # back to espfix stack
|
2016-09-22 05:03:59 +08:00
|
|
|
jmp .Lirq_return
|
2014-05-05 01:36:22 +08:00
|
|
|
#endif
|
2020-02-26 06:33:25 +08:00
|
|
|
SYM_CODE_END(asm_exc_nmi)
|
2008-11-24 22:38:45 +08:00
|
|
|
|
2020-03-26 02:47:40 +08:00
|
|
|
.pushsection .text, "ax"
|
2019-10-11 19:51:06 +08:00
|
|
|
SYM_CODE_START(rewind_stack_do_exit)
|
2016-07-15 04:22:55 +08:00
|
|
|
/* Prevent any naive code from trying to unwind to our caller. */
|
|
|
|
xorl %ebp, %ebp
|
|
|
|
|
|
|
|
movl PER_CPU_VAR(cpu_current_top_of_stack), %esi
|
|
|
|
leal -TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
|
|
|
|
|
|
|
|
call do_exit
|
|
|
|
1: jmp 1b
|
2019-10-11 19:51:06 +08:00
|
|
|
SYM_CODE_END(rewind_stack_do_exit)
|
2020-03-26 02:47:40 +08:00
|
|
|
.popsection
|