Quoting Linus:
I do think that it would be a good idea to very expressly document
the fact that it's not that the user access itself is unsafe. I do
agree that things like "get_user()" want to be protected, but not
because of any direct bugs or problems with get_user() and friends,
but simply because get_user() is an excellent source of a pointer
that is obviously controlled from a potentially attacking user
space. So it's a prime candidate for then finding _subsequent_
accesses that can then be used to perturb the cache.
Unlike the __get_user() case get_user() includes the address limit check
near the pointer de-reference. With that locality the speculation can be
mitigated with pointer narrowing rather than a barrier, i.e.
array_index_nospec(). Where the narrowing is performed by:
cmp %limit, %ptr
sbb %mask, %mask
and %mask, %ptr
With respect to speculation the value of %ptr is either less than %limit
or NULL.
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Kees Cook <keescook@chromium.org>
Cc: kernel-hardening@lists.openwall.com
Cc: gregkh@linuxfoundation.org
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: torvalds@linux-foundation.org
Cc: alan@linux.intel.com
Link: https://lkml.kernel.org/r/151727417469.33451.11804043010080838495.stgit@dwillia2-desk3.amr.corp.intel.com
Quoting Linus:
I do think that it would be a good idea to very expressly document
the fact that it's not that the user access itself is unsafe. I do
agree that things like "get_user()" want to be protected, but not
because of any direct bugs or problems with get_user() and friends,
but simply because get_user() is an excellent source of a pointer
that is obviously controlled from a potentially attacking user
space. So it's a prime candidate for then finding _subsequent_
accesses that can then be used to perturb the cache.
__uaccess_begin_nospec() covers __get_user() and copy_from_iter() where the
limit check is far away from the user pointer de-reference. In those cases
a barrier_nospec() prevents speculation with a potential pointer to
privileged memory. uaccess_try_nospec covers get_user_try.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Kees Cook <keescook@chromium.org>
Cc: kernel-hardening@lists.openwall.com
Cc: gregkh@linuxfoundation.org
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: alan@linux.intel.com
Link: https://lkml.kernel.org/r/151727416953.33451.10508284228526170604.stgit@dwillia2-desk3.amr.corp.intel.com
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJabj6pAAoJEHm+PkMAQRiGs8cIAJQFkCWnbz86e3vG4DuWhyA8
CMGHCQdUOxxFGa/ixhIiuetbC0x+JVHAjV2FwVYbAQfaZB3pfw2iR1ncQxpAP1AI
oLU9vBEqTmwKMPc9CM5rRfnLFWpGcGwUNzgPdxD5yYqGDtcM8K840mF6NdkYe5AN
xU8rv1wlcFPF4A5pvHCH0pvVmK4VxlVFk/2H67TFdxBs4PyJOnSBnf+bcGWgsKO6
hC8XIVtcKCH2GfFxt5d0Vgc5QXJEpX1zn2mtCa1MwYRjN2plgYfD84ha0xE7J0B0
oqV/wnjKXDsmrgVpncr3txd4+zKJFNkdNRE4eLAIupHo2XHTG4HvDJ5dBY2NhGU=
=sOml
-----END PGP SIGNATURE-----
Merge tag 'v4.15' into x86/pti, to be able to merge dependent changes
Time has come to switch PTI development over to a v4.15 base - we'll still
try to make sure that all PTI fixes backport cleanly to v4.14 and earlier.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86/pti updates from Thomas Gleixner:
"Another set of melted spectrum related changes:
- Code simplifications and cleanups for RSB and retpolines.
- Make the indirect calls in KVM speculation safe.
- Whitelist CPUs which are known not to speculate from Meltdown and
prepare for the new CPUID flag which tells the kernel that a CPU is
not affected.
- A less rigorous variant of the module retpoline check which merily
warns when a non-retpoline protected module is loaded and reflects
that fact in the sysfs file.
- Prepare for Indirect Branch Prediction Barrier support.
- Prepare for exposure of the Speculation Control MSRs to guests, so
guest OSes which depend on those "features" can use them. Includes
a blacklist of the broken microcodes. The actual exposure of the
MSRs through KVM is still being worked on"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Simplify indirect_branch_prediction_barrier()
x86/retpoline: Simplify vmexit_fill_RSB()
x86/cpufeatures: Clean up Spectre v2 related CPUID flags
x86/cpu/bugs: Make retpoline module warning conditional
x86/bugs: Drop one "mitigation" from dmesg
x86/nospec: Fix header guards names
x86/alternative: Print unadorned pointers
x86/speculation: Add basic IBPB (Indirect Branch Prediction Barrier) support
x86/cpufeature: Blacklist SPEC_CTRL/PRED_CMD on early Spectre v2 microcodes
x86/pti: Do not enable PTI on CPUs which are not vulnerable to Meltdown
x86/msr: Add definitions for new speculation control MSRs
x86/cpufeatures: Add AMD feature bits for Speculation Control
x86/cpufeatures: Add Intel feature bits for Speculation Control
x86/cpufeatures: Add CPUID_7_EDX CPUID leaf
module/retpoline: Warn about missing retpoline in module
KVM: VMX: Make indirect call speculation safe
KVM: x86: Make indirect calls in emulator speculation safe
Pull x86 retpoline fixlet from Thomas Gleixner:
"Remove the ESP/RSP thunks for retpoline as they cannot ever work.
Get rid of them before they show up in a release"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/retpoline: Remove the esp/rsp thunk
It doesn't make sense to have an indirect call thunk with esp/rsp as
retpoline code won't work correctly with the stack pointer register.
Removing it will help compiler writers to catch error in case such
a thunk call is emitted incorrectly.
Fixes: 76b043848f ("x86/retpoline: Add initial retpoline support")
Suggested-by: Jeff Law <law@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Kees Cook <keescook@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: https://lkml.kernel.org/r/1516658974-27852-1-git-send-email-longman@redhat.com
en_rx_am.c was deleted in 'net-next' but had a bug fixed in it in
'net'.
The esp{4,6}_offload.c conflicts were overlapping changes.
The 'out' label is removed so we just return ERR_PTR(-EINVAL)
directly.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull x86 pti fixes from Thomas Gleixner:
"A small set of fixes for the meltdown/spectre mitigations:
- Make kprobes aware of retpolines to prevent probes in the retpoline
thunks.
- Make the machine check exception speculation protected. MCE used to
issue an indirect call directly from the ASM entry code. Convert
that to a direct call into a C-function and issue the indirect call
from there so the compiler can add the retpoline protection,
- Make the vmexit_fill_RSB() assembly less stupid
- Fix a typo in the PTI documentation"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/retpoline: Optimize inline assembler for vmexit_fill_RSB
x86/pti: Document fix wrong index
kprobes/x86: Disable optimizing on the function jumps to indirect thunk
kprobes/x86: Blacklist indirect thunk functions for kprobes
retpoline: Introduce start/end markers of indirect thunk
x86/mce: Make machine check speculation protected
Mark __x86_indirect_thunk_* functions as blacklist for kprobes
because those functions can be called from anywhere in the kernel
including blacklist functions of kprobes.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/151629209111.10241.5444852823378068683.stgit@devbox
Introduce start/end markers of __x86_indirect_thunk_* functions.
To make it easy, consolidate .text.__x86.indirect_thunk.* sections
to one .text.__x86.indirect_thunk section and put it in the
end of kernel text section and adds __indirect_thunk_start/end
so that other subsystem (e.g. kprobes) can identify it.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/151629206178.10241.6828804696410044771.stgit@devbox
__const_udelay is marked inline, and LTO will happily inline it everywhere
Dropping the inline saves ~44k text in a LTO build.
13999560 1740864 1499136 17239560 1070e08 vmlinux-with-udelay-inline
13954764 1736768 1499136 17190668 1064f0c vmlinux-wo-udelay-inline
Inlining it has no advantage in general, so its the right thing to do.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20171222001821.2157-2-andi@firstfloor.org
Pull x86 pti updates from Thomas Gleixner:
"This contains:
- a PTI bugfix to avoid setting reserved CR3 bits when PCID is
disabled. This seems to cause issues on a virtual machine at least
and is incorrect according to the AMD manual.
- a PTI bugfix which disables the perf BTS facility if PTI is
enabled. The BTS AUX buffer is not globally visible and causes the
CPU to fault when the mapping disappears on switching CR3 to user
space. A full fix which restores BTS on PTI is non trivial and will
be worked on.
- PTI bugfixes for EFI and trusted boot which make sure that the user
space visible page table entries have the NX bit cleared
- removal of dead code in the PTI pagetable setup functions
- add PTI documentation
- add a selftest for vsyscall to verify that the kernel actually
implements what it advertises.
- a sysfs interface to expose vulnerability and mitigation
information so there is a coherent way for users to retrieve the
status.
- the initial spectre_v2 mitigations, aka retpoline:
+ The necessary ASM thunk and compiler support
+ The ASM variants of retpoline and the conversion of affected ASM
code
+ Make LFENCE serializing on AMD so it can be used as speculation
trap
+ The RSB fill after vmexit
- initial objtool support for retpoline
As I said in the status mail this is the most of the set of patches
which should go into 4.15 except two straight forward patches still on
hold:
- the retpoline add on of LFENCE which waits for ACKs
- the RSB fill after context switch
Both should be ready to go early next week and with that we'll have
covered the major holes of spectre_v2 and go back to normality"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
x86,perf: Disable intel_bts when PTI
security/Kconfig: Correct the Documentation reference for PTI
x86/pti: Fix !PCID and sanitize defines
selftests/x86: Add test_vsyscall
x86/retpoline: Fill return stack buffer on vmexit
x86/retpoline/irq32: Convert assembler indirect jumps
x86/retpoline/checksum32: Convert assembler indirect jumps
x86/retpoline/xen: Convert Xen hypercall indirect jumps
x86/retpoline/hyperv: Convert assembler indirect jumps
x86/retpoline/ftrace: Convert ftrace assembler indirect jumps
x86/retpoline/entry: Convert entry assembler indirect jumps
x86/retpoline/crypto: Convert crypto assembler indirect jumps
x86/spectre: Add boot time option to select Spectre v2 mitigation
x86/retpoline: Add initial retpoline support
objtool: Allow alternatives to be ignored
objtool: Detect jumps to retpoline thunks
x86/pti: Make unpoison of pgd for trusted boot work for real
x86/alternatives: Fix optimize_nops() checking
sysfs/cpu: Fix typos in vulnerability documentation
x86/cpu/AMD: Use LFENCE_RDTSC in preference to MFENCE_RDTSC
...
Since error-injection framework is not limited to be used
by kprobes, nor bpf. Other kernel subsystems can use it
freely for checking safeness of error-injection, e.g.
livepatch, ftrace etc.
So this separate error-injection framework from kprobes.
Some differences has been made:
- "kprobe" word is removed from any APIs/structures.
- BPF_ALLOW_ERROR_INJECTION() is renamed to
ALLOW_ERROR_INJECTION() since it is not limited for BPF too.
- CONFIG_FUNCTION_ERROR_INJECTION is the config item of this
feature. It is automatically enabled if the arch supports
error injection feature for kprobe or ftrace etc.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Enable the use of -mindirect-branch=thunk-extern in newer GCC, and provide
the corresponding thunks. Provide assembler macros for invoking the thunks
in the same way that GCC does, from native and inline assembler.
This adds X86_FEATURE_RETPOLINE and sets it by default on all CPUs. In
some circumstances, IBRS microcode features may be used instead, and the
retpoline can be disabled.
On AMD CPUs if lfence is serialising, the retpoline can be dramatically
simplified to a simple "lfence; jmp *\reg". A future patch, after it has
been verified that lfence really is serialising in all circumstances, can
enable this by setting the X86_FEATURE_RETPOLINE_AMD feature bit in addition
to X86_FEATURE_RETPOLINE.
Do not align the retpoline in the altinstr section, because there is no
guarantee that it stays aligned when it's copied over the oldinstr during
alternative patching.
[ Andi Kleen: Rename the macros, add CONFIG_RETPOLINE option, export thunks]
[ tglx: Put actual function CALL/JMP in front of the macros, convert to
symbolic labels ]
[ dwmw2: Convert back to numeric labels, merge objtool fixes ]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: thomas.lendacky@amd.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kees Cook <keescook@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: https://lkml.kernel.org/r/1515707194-20531-4-git-send-email-dwmw@amazon.co.uk
Pull x86 syscall entry code changes for PTI from Ingo Molnar:
"The main changes here are Andy Lutomirski's changes to switch the
x86-64 entry code to use the 'per CPU entry trampoline stack'. This,
besides helping fix KASLR leaks (the pending Page Table Isolation
(PTI) work), also robustifies the x86 entry code"
* 'WIP.x86-pti.entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
x86/cpufeatures: Make CPU bugs sticky
x86/paravirt: Provide a way to check for hypervisors
x86/paravirt: Dont patch flush_tlb_single
x86/entry/64: Make cpu_entry_area.tss read-only
x86/entry: Clean up the SYSENTER_stack code
x86/entry/64: Remove the SYSENTER stack canary
x86/entry/64: Move the IST stacks into struct cpu_entry_area
x86/entry/64: Create a per-CPU SYSCALL entry trampoline
x86/entry/64: Return to userspace from the trampoline stack
x86/entry/64: Use a per-CPU trampoline stack for IDT entries
x86/espfix/64: Stop assuming that pt_regs is on the entry stack
x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0
x86/entry: Remap the TSS into the CPU entry area
x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct
x86/dumpstack: Handle stack overflow on all stacks
x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss
x86/kasan/64: Teach KASAN about the cpu_entry_area
x86/mm/fixmap: Generalize the GDT fixmap mechanism, introduce struct cpu_entry_area
x86/entry/gdt: Put per-CPU GDT remaps in ascending order
x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
...
The TSS is a fairly juicy target for exploits, and, now that the TSS
is in the cpu_entry_area, it's no longer protected by kASLR. Make it
read-only on x86_64.
On x86_32, it can't be RO because it's written by the CPU during task
switches, and we use a task gate for double faults. I'd also be
nervous about errata if we tried to make it RO even on configurations
without double fault handling.
[ tglx: AMD confirmed that there is no problem on 64-bit with TSS RO. So
it's probably safe to assume that it's a non issue, though Intel
might have been creative in that area. Still waiting for
confirmation. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bpetkov@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150606.733700132@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Update x86-opcode-map.txt based on the October 2017 Intel SDM publication.
Fix INVPID to INVVPID.
Add UD0 and UD1 instruction opcodes.
Also sync the objtool and perf tooling copies of this file.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/aac062d7-c0f6-96e3-5c92-ed299e2bd3da@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The kbuild test robot reported this build warning:
Warning: arch/x86/tools/test_get_len found difference at <jump_table>:ffffffff8103dd2c
Warning: ffffffff8103dd82: f6 09 d8 testb $0xd8,(%rcx)
Warning: objdump says 3 bytes, but insn_get_length() says 2
Warning: decoded and checked 1569014 instructions with 1 warnings
This sequence seems to be a new instruction not in the opcode map in the Intel SDM.
The instruction sequence is "F6 09 d8", means Group3(F6), MOD(00)REG(001)RM(001), and 0xd8.
Intel SDM vol2 A.4 Table A-6 said the table index in the group is "Encoding of Bits 5,4,3 of
the ModR/M Byte (bits 2,1,0 in parenthesis)"
In that table, opcodes listed by the index REG bits as:
000 001 010 011 100 101 110 111
TEST Ib/Iz,(undefined),NOT,NEG,MUL AL/rAX,IMUL AL/rAX,DIV AL/rAX,IDIV AL/rAX
So, it seems TEST Ib is assigned to 001.
Add the new pattern.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to save on redundant structs definitions
insn_get_code_seg_params() was made to return two 4-bit values in a char
but clang complains:
arch/x86/lib/insn-eval.c:780:10: warning: implicit conversion from 'int' to 'char'
changes value from 132 to -124 [-Wconstant-conversion]
return INSN_CODE_SEG_PARAMS(4, 8);
~~~~~~ ^~~~~~~~~~~~~~~~~~~~~~~~~~
./arch/x86/include/asm/insn-eval.h:16:57: note: expanded from macro 'INSN_CODE_SEG_PARAMS'
#define INSN_CODE_SEG_PARAMS(oper_sz, addr_sz) (oper_sz | (addr_sz << 4))
Those two values do get picked apart afterwards the opposite way of how
they were ORed so wrt to the LSByte, the return value is the same.
But this function returns -EINVAL in the error case, which is an int. So
make it return an int which is the native word size anyway and thus fix
the clang warning.
Reported-by: Kees Cook <keescook@google.com>
Reported-by: Nick Desaulniers <nick.desaulniers@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ricardo.neri-calderon@linux.intel.com
Link: https://lkml.kernel.org/r/20171123091951.1462-1-bp@alien8.de
Pull x86 core updates from Ingo Molnar:
"Note that in this cycle most of the x86 topics interacted at a level
that caused them to be merged into tip:x86/asm - but this should be a
temporary phenomenon, hopefully we'll back to the usual patterns in
the next merge window.
The main changes in this cycle were:
Hardware enablement:
- Add support for the Intel UMIP (User Mode Instruction Prevention)
CPU feature. This is a security feature that disables certain
instructions such as SGDT, SLDT, SIDT, SMSW and STR. (Ricardo Neri)
[ Note that this is disabled by default for now, there are some
smaller enhancements in the pipeline that I'll follow up with in
the next 1-2 days, which allows this to be enabled by default.]
- Add support for the AMD SEV (Secure Encrypted Virtualization) CPU
feature, on top of SME (Secure Memory Encryption) support that was
added in v4.14. (Tom Lendacky, Brijesh Singh)
- Enable new SSE/AVX/AVX512 CPU features: AVX512_VBMI2, GFNI, VAES,
VPCLMULQDQ, AVX512_VNNI, AVX512_BITALG. (Gayatri Kammela)
Other changes:
- A big series of entry code simplifications and enhancements (Andy
Lutomirski)
- Make the ORC unwinder default on x86 and various objtool
enhancements. (Josh Poimboeuf)
- 5-level paging enhancements (Kirill A. Shutemov)
- Micro-optimize the entry code a bit (Borislav Petkov)
- Improve the handling of interdependent CPU features in the early
FPU init code (Andi Kleen)
- Build system enhancements (Changbin Du, Masahiro Yamada)
- ... plus misc enhancements, fixes and cleanups"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (118 commits)
x86/build: Make the boot image generation less verbose
selftests/x86: Add tests for the STR and SLDT instructions
selftests/x86: Add tests for User-Mode Instruction Prevention
x86/traps: Fix up general protection faults caused by UMIP
x86/umip: Enable User-Mode Instruction Prevention at runtime
x86/umip: Force a page fault when unable to copy emulated result to user
x86/umip: Add emulation code for UMIP instructions
x86/cpufeature: Add User-Mode Instruction Prevention definitions
x86/insn-eval: Add support to resolve 16-bit address encodings
x86/insn-eval: Handle 32-bit address encodings in virtual-8086 mode
x86/insn-eval: Add wrapper function for 32 and 64-bit addresses
x86/insn-eval: Add support to resolve 32-bit address encodings
x86/insn-eval: Compute linear address in several utility functions
resource: Fix resource_size.cocci warnings
X86/KVM: Clear encryption attribute when SEV is active
X86/KVM: Decrypt shared per-cpu variables when SEV is active
percpu: Introduce DEFINE_PER_CPU_DECRYPTED
x86: Add support for changing memory encryption attribute in early boot
x86/io: Unroll string I/O when SEV is active
x86/boot: Add early boot support when running with SEV active
...
Tasks running in virtual-8086 mode, in protected mode with code segment
descriptors that specify 16-bit default address sizes via the D bit, or via
an address override prefix will use 16-bit addressing form encodings as
described in the Intel 64 and IA-32 Architecture Software Developer's
Manual Volume 2A Section 2.1.5, Table 2-1.
16-bit addressing encodings differ in several ways from the 32-bit/64-bit
addressing form encodings: ModRM.rm points to different registers and, in
some cases, effective addresses are indicated by the addition of the value
of two registers. Also, there is no support for SIB bytes. Thus, a
separate function is needed to parse this form of addressing.
Three functions are introduced. get_reg_offset_16() obtains the
offset from the base of pt_regs of the registers indicated by the ModRM
byte of the address encoding. get_eff_addr_modrm_16() computes the
effective address from the value of the register operands.
get_addr_ref_16() computes the linear address using the obtained effective
address and the base address of the segment.
Segment limits are enforced when running in protected mode.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ricardo.neri@intel.com
Link: http://lkml.kernel.org/r/1509935277-22138-6-git-send-email-ricardo.neri-calderon@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is possible to utilize 32-bit address encodings in virtual-8086 mode via
an address override instruction prefix. However, the range of the
effective address is still limited to [0x-0xffff]. In such a case, return
error.
Also, linear addresses in virtual-8086 mode are limited to 20 bits. Enforce
such limit by truncating the most significant bytes of the computed linear
address.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ricardo.neri@intel.com
Link: http://lkml.kernel.org/r/1509935277-22138-5-git-send-email-ricardo.neri-calderon@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The function insn_get_addr_ref() is capable of handling only 64-bit
addresses. A previous commit introduced a function to handle 32-bit
addresses. Invoke these two functions from a third wrapper function that
calls the appropriate routine based on the address size specified in the
instruction structure (obtained by looking at the code segment default
address size and the address override prefix, if present).
While doing this, rename the original function insn_get_addr_ref() with
the more appropriate name get_addr_ref_64(), ensure it is only used
for 64-bit addresses.
Also, since 64-bit addresses are not possible in 32-bit builds, provide
a dummy function such case.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ricardo.neri@intel.com
Link: http://lkml.kernel.org/r/1509935277-22138-4-git-send-email-ricardo.neri-calderon@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
32-bit and 64-bit address encodings are identical. Thus, the same logic
could be used to resolve the effective address. However, there are two key
differences: address size and enforcement of segment limits.
If running a 32-bit process on a 64-bit kernel, it is best to perform
the address calculation using 32-bit data types. In this manner hardware
is used for the arithmetic, including handling of signs and overflows.
32-bit addresses are generally used in protected mode; segment limits are
enforced in this mode. This implementation obtains the limit of the
segment associated with the instruction operands and prefixes. If the
computed address is outside the segment limits, an error is returned. It
is also possible to use 32-bit address in long mode and virtual-8086 mode
by using an address override prefix. In such cases, segment limits are not
enforced.
Support to use 32-bit arithmetic is added to the utility functions that
compute effective addresses. However, the end result is stored in a
variable of type long (which has a width of 8 bytes in 64-bit builds).
Hence, once a 32-bit effective address is computed, the 4 most significant
bytes are masked out to avoid sign extension.
The newly added function get_addr_ref_32() is almost identical to the
existing function insn_get_addr_ref() (used for 64-bit addresses). The only
difference is that it verifies that the effective address is within the
limits of the segment.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ricardo.neri@intel.com
Link: http://lkml.kernel.org/r/1509935277-22138-3-git-send-email-ricardo.neri-calderon@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Computing a linear address involves several steps. The first step is to
compute the effective address. This requires determining the addressing
mode in use and perform arithmetic operations on the operands. Plus, each
addressing mode has special cases that must be handled.
Once the effective address is known, the base address of the applicable
segment is added to obtain the linear address.
Clearly, this is too much work for a single function. Instead, handle each
addressing mode in a separate utility function. This improves readability
and gives us the opportunity to handler errors better.
At the moment, arithmetic to compute the effective address uses 64-byte
variables. Thus, limit support to 64-bit addresses.
While reworking the function insn_get_addr_ref(), the variable addr_offset
is renamed as regoff to reflect its actual use (i.e., offset, from the
base of pt_regs, of the register used as operand).
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ricardo.neri@intel.com
Link: http://lkml.kernel.org/r/1509935277-22138-2-git-send-email-ricardo.neri-calderon@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In protected mode, it is common to want to obtain the limit of a segment
along with its base address. This is useful, for instance, to verify that
an effective address lies within a segment before computing a linear
address.
Up to this point, this library only computes linear addresses in long
mode. Subsequent patches will include support for protected mode. Support
to verify the segment limit will be needed.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ricardo.neri@intel.com
Link: http://lkml.kernel.org/r/1509148310-30862-2-git-send-email-ricardo.neri-calderon@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
insn_get_addr_ref() returns the effective address as defined by the
section 3.7.5.1 Vol 1 of the Intel 64 and IA-32 Architectures Software
Developer's Manual. In order to compute the linear address, we must add
to the effective address the segment base address as set in the segment
descriptor. The segment descriptor to use depends on the register used as
operand and segment override prefixes, if any.
In most cases, the segment base address will be 0 if the USER_DS/USER32_DS
segment is used or if segmentation is not used. However, the base address
is not necessarily zero if a user programs defines its own segments. This
is possible by using a local descriptor table.
Since the effective address is a signed quantity, the unsigned segment
base address is saved in a separate variable and added to the final,
unsigned, effective address.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-19-git-send-email-ricardo.neri-calderon@linux.intel.com
Section 2.2.1.3 of the Intel 64 and IA-32 Architectures Software
Developer's Manual volume 2A states that when ModRM.mod is zero and
ModRM.rm is 101b, a 32-bit displacement follows the ModRM byte. This means
that none of the registers are used in the computation of the effective
address. A return value of -EDOM indicates callers that they should not
use the value of registers when computing the effective address for the
instruction.
In long mode, the effective address is given by the 32-bit displacement
plus the location of the next instruction. In protected mode, only the
displacement is used.
The instruction decoder takes care of obtaining the displacement.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-18-git-send-email-ricardo.neri-calderon@linux.intel.com
Obtain the default values of the address and operand sizes as specified in
the D and L bits of the the segment descriptor selected by the register
CS. The function can be used for both protected and long modes.
For virtual-8086 mode, the default address and operand sizes are always 2
bytes.
The returned parameters are encoded in a signed 8-bit data type. Auxiliar
macros are provided to encode and decode such values.
Improvements-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-17-git-send-email-ricardo.neri-calderon@linux.intel.com
With segmentation, the base address of the segment is needed to compute a
linear address. This base address is obtained from the applicable segment
descriptor. Such segment descriptor is referenced from a segment selector.
These new functions obtain the segment base and limit of the segment
selector indicated by segment register index given as argument. This index
is any of the INAT_SEG_REG_* family of #define's.
The logic to obtain the segment selector is wrapped in the function
get_segment_selector() with the inputs described above. Once the selector
is known, the base address is determined. In protected mode, the selector
is used to obtain the segment descriptor and then its base address. In
long mode, the segment base address is zero except when FS or GS are used.
In virtual-8086 mode, the base address is computed as the value of the
segment selector shifted 4 positions to the left.
In protected mode, segment limits are enforced. Thus, a function to
determine the limit of the segment is added. Segment limits are not
enforced in long or virtual-8086. For the latter, addresses are limited
to 20 bits; address size will be handled when computing the linear
address.
Improvements-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-16-git-send-email-ricardo.neri-calderon@linux.intel.com
The segment descriptor contains information that is relevant to how linear
addresses need to be computed. It contains the default size of addresses
as well as the base address of the segment. Thus, given a segment
selector, we ought to look at segment descriptor to correctly calculate
the linear address.
In protected mode, the segment selector might indicate a segment
descriptor from either the global descriptor table or a local descriptor
table. Both cases are considered in this function.
This function is a prerequisite for functions in subsequent commits that
will obtain the aforementioned attributes of the segment descriptor.
Improvements-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-15-git-send-email-ricardo.neri-calderon@linux.intel.com
When computing a linear address and segmentation is used, we need to know
the base address of the segment involved in the computation. In most of
the cases, the segment base address will be zero as in USER_DS/USER32_DS.
However, it may be possible that a user space program defines its own
segments via a local descriptor table. In such a case, the segment base
address may not be zero. Thus, the segment base address is needed to
calculate correctly the linear address.
If running in protected mode, the segment selector to be used when
computing a linear address is determined by either any of segment override
prefixes in the instruction or inferred from the registers involved in the
computation of the effective address; in that order. Also, there are cases
when the segment override prefixes shall be ignored (i.e., code segments
are always selected by the CS segment register; string instructions always
use the ES segment register when using rDI register as operand). In long
mode, segment registers are ignored, except for FS and GS. In these two
cases, base addresses are obtained from the respective MSRs.
For clarity, this process can be split into four steps (and an equal
number of functions): determine if segment prefixes overrides can be used;
parse the segment override prefixes, and use them if found; if not found
or cannot be used, use the default segment registers associated with the
operand registers. Once the segment register to use has been identified,
read its value to obtain the segment selector.
The method to obtain the segment selector depends on several factors. In
32-bit builds, segment selectors are saved into a pt_regs structure
when switching to kernel mode. The same is also true for virtual-8086
mode. In 64-bit builds, segmentation is mostly ignored, except when
running a program in 32-bit legacy mode. In this case, CS and SS can be
obtained from pt_regs. DS, ES, FS and GS can be read directly from
the respective segment registers.
In order to identify the segment registers, a new set of #defines is
introduced. It also includes two special identifiers. One of them
indicates when the default segment register associated with instruction
operands shall be used. Another one indicates that the contents of the
segment register shall be ignored; this identifier is used when in long
mode.
Improvements-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-14-git-send-email-ricardo.neri-calderon@linux.intel.com
String instructions are special because, in protected mode, the linear
address is always obtained via the ES segment register in operands that
use the (E)DI register; the DS segment register in operands that use
the (E)SI register. Furthermore, segment override prefixes are ignored
when calculating a linear address involving the (E)DI register; segment
override prefixes can be used when calculating linear addresses involving
the (E)SI register.
It follows that linear addresses are calculated differently for the case of
string instructions. The purpose of this utility function is to identify
such instructions for callers to determine a linear address correctly.
Note that this function only identifies string instructions; it does not
determine what segment register to use in the address computation. That is
left to callers. A subsequent commmit introduces a function to determine
the segment register to use given the instruction, operands and
segment override prefixes.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-13-git-send-email-ricardo.neri-calderon@linux.intel.com
The function get_reg_offset() returns the offset to the register the
argument specifies as indicated in an enumeration of type offset. Callers
of this function would need the definition of such enumeration. This is
not needed. Instead, add helper functions for this purpose. These functions
are useful in cases when, for instance, the caller needs to decide whether
the operand is a register or a memory location by looking at the rm part
of the ModRM byte. As of now, this is the only helper function that is
needed.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-12-git-send-email-ricardo.neri-calderon@linux.intel.com
We are not in a critical failure path. The invalid register type is caused
when trying to decode invalid instruction bytes from a user-space program.
Thus, simply print an error message. To prevent this warning from being
abused from user space programs, use the rate-limited variant of pr_err().
along with a descriptive prefix.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-11-git-send-email-ricardo.neri-calderon@linux.intel.com
Other kernel submodules can benefit from using the utility functions
defined in mpx.c to obtain the addresses and values of operands contained
in the general purpose registers. An instance of this is the emulation code
used for instructions protected by the Intel User-Mode Instruction
Prevention feature.
Thus, these functions are relocated to a new insn-eval.c file. The reason
to not relocate these utilities into insn.c is that the latter solely
analyses instructions given by a struct insn without any knowledge of the
meaning of the values of instruction operands. This new utility insn-
eval.c aims to be used to resolve userspace linear addresses based on
the contents of the instruction operands as well as the contents of pt_regs
structure.
These utilities come with a separate header. This is to avoid taking insn.c
out of sync from the instructions decoders under tools/obj and tools/perf.
This also avoids adding cumbersome #ifdef's for the #include'd files
required to decode instructions in a kernel context.
Functions are simply relocated. There are not functional or indentation
changes.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: ricardo.neri@intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/1509135945-13762-10-git-send-email-ricardo.neri-calderon@linux.intel.com
Add a cmdline_find_option() function to look for cmdline options that
take arguments. The argument is returned in a supplied buffer and the
argument length (regardless of whether it fits in the supplied buffer)
is returned, with -1 indicating not found.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Toshimitsu Kani <toshi.kani@hpe.com>
Cc: kasan-dev@googlegroups.com
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/36b5f97492a9745dce27682305f990fc20e5cf8a.1500319216.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This adds support for compiling with a rough equivalent to the glibc
_FORTIFY_SOURCE=1 feature, providing compile-time and runtime buffer
overflow checks for string.h functions when the compiler determines the
size of the source or destination buffer at compile-time. Unlike glibc,
it covers buffer reads in addition to writes.
GNU C __builtin_*_chk intrinsics are avoided because they would force a
much more complex implementation. They aren't designed to detect read
overflows and offer no real benefit when using an implementation based
on inline checks. Inline checks don't add up to much code size and
allow full use of the regular string intrinsics while avoiding the need
for a bunch of _chk functions and per-arch assembly to avoid wrapper
overhead.
This detects various overflows at compile-time in various drivers and
some non-x86 core kernel code. There will likely be issues caught in
regular use at runtime too.
Future improvements left out of initial implementation for simplicity,
as it's all quite optional and can be done incrementally:
* Some of the fortified string functions (strncpy, strcat), don't yet
place a limit on reads from the source based on __builtin_object_size of
the source buffer.
* Extending coverage to more string functions like strlcat.
* It should be possible to optionally use __builtin_object_size(x, 1) for
some functions (C strings) to detect intra-object overflows (like
glibc's _FORTIFY_SOURCE=2), but for now this takes the conservative
approach to avoid likely compatibility issues.
* The compile-time checks should be made available via a separate config
option which can be enabled by default (or always enabled) once enough
time has passed to get the issues it catches fixed.
Kees said:
"This is great to have. While it was out-of-tree code, it would have
blocked at least CVE-2016-3858 from being exploitable (improper size
argument to strlcpy()). I've sent a number of fixes for
out-of-bounds-reads that this detected upstream already"
[arnd@arndb.de: x86: fix fortified memcpy]
Link: http://lkml.kernel.org/r/20170627150047.660360-1-arnd@arndb.de
[keescook@chromium.org: avoid panic() in favor of BUG()]
Link: http://lkml.kernel.org/r/20170626235122.GA25261@beast
[keescook@chromium.org: move from -mm, add ARCH_HAS_FORTIFY_SOURCE, tweak Kconfig help]
Link: http://lkml.kernel.org/r/20170526095404.20439-1-danielmicay@gmail.com
Link: http://lkml.kernel.org/r/1497903987-21002-8-git-send-email-keescook@chromium.org
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Introduce the _flushcache() family of memory copy helpers and use them
for persistent memory write operations on x86. The _flushcache()
semantic indicates that the cache is either bypassed for the copy
operation (movnt) or any lines dirtied by the copy operation are
written back (clwb, clflushopt, or clflush).
* Extend dax_operations with ->copy_from_iter() and ->flush()
operations. These operations and other infrastructure updates allow
all persistent memory specific dax functionality to be pushed into
libnvdimm and the pmem driver directly. It also allows dax-specific
sysfs attributes to be linked to a host device, for example:
/sys/block/pmem0/dax/write_cache
* Add support for the new NVDIMM platform/firmware mechanisms introduced
in ACPI 6.2 and UEFI 2.7. This support includes the v1.2 namespace
label format, extensions to the address-range-scrub command set, new
error injection commands, and a new BTT (block-translation-table)
layout. These updates support inter-OS and pre-OS compatibility.
* Fix a longstanding memory corruption bug in nfit_test.
* Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2)
capable.
* Miscellaneous fixes and small updates across libnvdimm and the nfit
driver.
Acknowledgements that came after the branch was pushed:
commit 6aa734a2f3 "libnvdimm, region, pmem: fix 'badblocks'
sysfs_get_dirent() reference lifetime"
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZXsUtAAoJEB7SkWpmfYgCOXcP/06bncqTEvtgrOF2b7O8w+8e
mTySD51RUn6UpkFd37SMRch+rmbojuqj465TAE7XIXgyLgIOJixKaTlHYUoEnP3X
rC4Q/g5mN0nittMDwL+vQaa1lQWd2kbjOlrqCgnLHVEEJpHmiQussunjvir4G1U7
5ROooP8W+qMK5y5XPLJAg/gyGhYkjpRSlDg3Eo5meZZ0IdURbI7+WCLKrPcQUERT
WmDc9gLhJdSQVxBV/0m2gdAER4ADmFjcrlm8kjXRBhdlUmEFjM0zpvlHJutHTkks
rNZWCmCJs0Sas+DmRKszFmvVFHRHqUVA3dWK4P6PJEX+tl7BwlPcxpbfacHTG2EZ
btArFc584DZ+EIrim1cXXRvLFlxnKOFBtBeteFs7l2kZjEcN6S4I5OZgTyeDpe/i
2WDpHWLQWibkcIzH9y1EuMBkYnQjTJl1pecHzJoTaC+jAQ+opLiY7EecjLmCmQS6
MBYUeQZNufLGfT5b8KXfpKeiXhpFkYrAGp+ErfoH/6RKy2zqTdagN1yVhos2y+a7
JJu/Weetpn8qv+KTGUShO8TGyWv3wU46YkG2rKWl0FL1+C+6LMMw1/L0A97lwVlg
BpypVVyaNu1D22ifZ8O5wbqPIYghoZ5akA0CiduhX19cpl5rTeTd8EvLjvcYhZEZ
pMHuMAqIcIyLhIe2/sRF
=xKQB
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"libnvdimm updates for the latest ACPI and UEFI specifications. This
pull request also includes new 'struct dax_operations' enabling to
undo the abuse of copy_user_nocache() for copy operations to pmem.
The dax work originally missed 4.12 to address concerns raised by Al.
Summary:
- Introduce the _flushcache() family of memory copy helpers and use
them for persistent memory write operations on x86. The
_flushcache() semantic indicates that the cache is either bypassed
for the copy operation (movnt) or any lines dirtied by the copy
operation are written back (clwb, clflushopt, or clflush).
- Extend dax_operations with ->copy_from_iter() and ->flush()
operations. These operations and other infrastructure updates allow
all persistent memory specific dax functionality to be pushed into
libnvdimm and the pmem driver directly. It also allows dax-specific
sysfs attributes to be linked to a host device, for example:
/sys/block/pmem0/dax/write_cache
- Add support for the new NVDIMM platform/firmware mechanisms
introduced in ACPI 6.2 and UEFI 2.7. This support includes the v1.2
namespace label format, extensions to the address-range-scrub
command set, new error injection commands, and a new BTT
(block-translation-table) layout. These updates support inter-OS
and pre-OS compatibility.
- Fix a longstanding memory corruption bug in nfit_test.
- Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2)
capable.
- Miscellaneous fixes and small updates across libnvdimm and the nfit
driver.
Acknowledgements that came after the branch was pushed: commit
6aa734a2f3 ("libnvdimm, region, pmem: fix 'badblocks'
sysfs_get_dirent() reference lifetime") was reviewed by Toshi Kani
<toshi.kani@hpe.com>"
* tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (42 commits)
libnvdimm, namespace: record 'lbasize' for pmem namespaces
acpi/nfit: Issue Start ARS to retrieve existing records
libnvdimm: New ACPI 6.2 DSM functions
acpi, nfit: Show bus_dsm_mask in sysfs
libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru.
acpi, nfit: Enable DSM pass thru for root functions.
libnvdimm: passthru functions clear to send
libnvdimm, btt: convert some info messages to warn/err
libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime
libnvdimm: fix the clear-error check in nsio_rw_bytes
libnvdimm, btt: fix btt_rw_page not returning errors
acpi, nfit: quiet invalid block-aperture-region warnings
libnvdimm, btt: BTT updates for UEFI 2.7 format
acpi, nfit: constify *_attribute_group
libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
libnvdimm, pmem, dax: export a cache control attribute
dax: convert to bitmask for flags
dax: remove default copy_from_iter fallback
libnvdimm, nfit: enable support for volatile ranges
libnvdimm, pmem: fix persistence warning
...
Pull x86 asm updates from Ingo Molnar:
"A single commit micro-optimizing short user copies on certain Intel
CPUs"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings
Pull perf updates from Ingo Molnar:
"Most of the changes are for tooling, the main changes in this cycle were:
- Improve Intel-PT hardware tracing support, both on the kernel and
on the tooling side: PTWRITE instruction support, power events for
C-state tracing, etc. (Adrian Hunter)
- Add support to measure SMI cost to the x86 architecture, with
tooling support in 'perf stat' (Kan Liang)
- Support function filtering in 'perf ftrace', plus related
improvements (Namhyung Kim)
- Allow adding and removing fields to the default 'perf script'
columns, using + or - as field prefixes to do so (Andi Kleen)
- Allow resolving the DSO name with 'perf script -F brstack{sym,off},dso'
(Mark Santaniello)
- Add perf tooling unwind support for PowerPC (Paolo Bonzini)
- ... and various other improvements as well"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (84 commits)
perf auxtrace: Add CPU filter support
perf intel-pt: Do not use TSC packets for calculating CPU cycles to TSC
perf intel-pt: Update documentation to include new ptwrite and power events
perf intel-pt: Add example script for power events and PTWRITE
perf intel-pt: Synthesize new power and "ptwrite" events
perf intel-pt: Move code in intel_pt_synth_events() to simplify attr setting
perf intel-pt: Factor out intel_pt_set_event_name()
perf intel-pt: Tidy messages into called function intel_pt_synth_event()
perf intel-pt: Tidy Intel PT evsel lookup into separate function
perf intel-pt: Join needlessly wrapped lines
perf intel-pt: Remove unused instructions_sample_period
perf intel-pt: Factor out common code synthesizing event samples
perf script: Add synthesized Intel PT power and ptwrite events
perf/x86/intel: Constify the 'lbr_desc[]' array and make a function static
perf script: Add 'synth' field for synthesized event payloads
perf auxtrace: Add itrace option to output power events
perf auxtrace: Add itrace option to output ptwrite events
tools include: Add byte-swapping macros to kernel.h
perf script: Add 'synth' event type for synthesized events
x86/insn: perf tools: Add new ptwrite instruction
...
In preparation for an objtool rewrite which will have broader checks,
whitelist functions and files which cause problems because they do
unusual things with the stack.
These whitelists serve as a TODO list for which functions and files
don't yet have undwarf unwinder coverage. Eventually most of the
whitelists can be removed in favor of manual CFI hint annotations or
objtool improvements.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/7f934a5d707a574bda33ea282e9478e627fb1829.1498659915.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
According to the Intel datasheet, the REP MOVSB instruction
exposes a pretty heavy setup cost (50 ticks), which hurts
short string copy operations.
This change tries to avoid this cost by calling the explicit
loop available in the unrolled code for strings shorter
than 64 bytes.
The 64 bytes cutoff value is arbitrary from the code logic
point of view - it has been selected based on measurements,
as the largest value that still ensures a measurable gain.
Micro benchmarks of the __copy_from_user() function with
lengths in the [0-63] range show this performance gain
(shorter the string, larger the gain):
- in the [55%-4%] range on Intel Xeon(R) CPU E5-2690 v4
- in the [72%-9%] range on Intel Core i7-4810MQ
Other tested CPUs - namely Intel Atom S1260 and AMD Opteron
8216 - show no difference, because they do not expose the
ERMS feature bit.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/4533a1d101fd460f80e21329a34928fad521c1d4.1498744345.git.pabeni@redhat.com
[ Clarified the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add ptwrite to the op code map and the perf tools new instructions test.
To run the test:
$ tools/perf/perf test "x86 ins"
39: Test x86 instruction decoder - new instructions : Ok
Or to see the details:
$ tools/perf/perf test -v "x86 ins" 2>&1 | grep ptwrite
For information about ptwrite, refer the Intel SDM.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: http://lkml.kernel.org/r/1495180230-19367-1-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
With all handling of the CONFIG_ARCH_HAS_PMEM_API case being moved to
libnvdimm and the pmem driver directly we do not need to provide global
wrappers and fallbacks in the CONFIG_ARCH_HAS_PMEM_API=n case. The pmem
driver will simply not link to arch_wb_cache_pmem() in that case. Same
as before, pmem flushing is only defined for x86_64, via
clean_cache_range(), but it is straightforward to add other archs in the
future.
arch_wb_cache_pmem() is an exported function since the pmem module needs
to find it, but it is privately declared in drivers/nvdimm/pmem.h because
there are no consumers outside of the pmem driver.
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes are not
cached. It is sufficient for the writes to be flushed to a cpu-store-buffer
(non-temporal / "movnt" in x86 terms), as we expect userspace to call fsync()
to ensure data-writes have reached a power-fail-safe zone in the platform. The
fsync() triggers a REQ_FUA or REQ_FLUSH to the pmem driver which will turn
around and fence previous writes with an "sfence".
Implement a __copy_from_user_inatomic_flushcache, memcpy_page_flushcache, and
memcpy_flushcache, that guarantee that the destination buffer is not dirty in
the cpu cache on completion. The new copy_from_iter_flushcache and sub-routines
will be used to replace the "pmem api" (include/linux/pmem.h +
arch/x86/include/asm/pmem.h). The availability of copy_from_iter_flushcache()
and memcpy_flushcache() are gated by the CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
config symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
otherwise.
This is meant to satisfy the concern from Linus that if a driver wants to do
something beyond the normal nocache semantics it should be something private to
that driver [1], and Al's concern that anything uaccess related belongs with
the rest of the uaccess code [2].
The first consumer of this interface is a new 'copy_from_iter' dax operation so
that pmem can inject cache maintenance operations without imposing this
overhead on other dax-capable drivers.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The constraint "rm" allows the compiler to put mix_const into memory.
When the input operand is a memory location then MUL needs an operand
size suffix, since Clang can't infer the multiplication width from the
operand.
Add and use the _ASM_MUL macro which determines the operand size and
resolves to the NUL instruction with the corresponding suffix.
This fixes the following error when building with clang:
CC arch/x86/lib/kaslr.o
/tmp/kaslr-dfe1ad.s: Assembler messages:
/tmp/kaslr-dfe1ad.s:182: Error: no instruction mnemonic suffix given and no register operands; can't size instruction
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Cc: Grant Grundler <grundler@chromium.org>
Cc: Greg Hackmann <ghackmann@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Davidson <md@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170501224741.133938-1-mka@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andrey Konovalov reported the following warning while fuzzing the kernel
with syzkaller:
WARNING: kernel stack regs at ffff8800686869f8 in a.out:4933 has bad 'bp' value c3fc855a10167ec0
The unwinder dump revealed that RBP had a bad value when an interrupt
occurred in csum_partial_copy_generic().
That function saves RBP on the stack and then overwrites it, using it as
a scratch register. That's problematic because it breaks stack traces
if an interrupt occurs in the middle of the function.
Replace the usage of RBP with another callee-saved register (R15) so
stack traces are no longer affected.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: netdev <netdev@vger.kernel.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Link: http://lkml.kernel.org/r/4b03a961efda5ec9bfe46b7b9c9ad72d1efad343.1493909486.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 asm updates from Ingo Molnar:
"The main changes in this cycle were:
- unwinder fixes and enhancements
- improve ftrace interaction with the unwinder
- optimize the code footprint of WARN() and related debugging
constructs
- ... plus misc updates, cleanups and fixes"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86/unwind: Dump all stacks in unwind_dump()
x86/unwind: Silence more entry-code related warnings
x86/ftrace: Fix ebp in ftrace_regs_caller that screws up unwinder
x86/unwind: Remove unused 'sp' parameter in unwind_dump()
x86/unwind: Prepend hex mask value with '0x' in unwind_dump()
x86/unwind: Properly zero-pad 32-bit values in unwind_dump()
x86/unwind: Ensure stack pointer is aligned
debug: Avoid setting BUGFLAG_WARNING twice
x86/unwind: Silence entry-related warnings
x86/unwind: Read stack return address in update_stack_state()
x86/unwind: Move common code into update_stack_state()
debug: Fix __bug_table[] in arch linker scripts
debug: Add _ONCE() logic to report_bug()
x86/debug: Define BUG() again for !CONFIG_BUG
x86/debug: Implement __WARN() using UD0
x86/ftrace: Use Makefile logic instead of #ifdef for compiling ftrace_*.o
x86/ftrace: Add -mfentry support to x86_32 with DYNAMIC_FTRACE set
x86/ftrace: Clean up ftrace_regs_caller
x86/ftrace: Add stack frame pointer to ftrace_caller
x86/ftrace: Move the ftrace specific code out of entry_32.S
...
Pull x86 boot updates from Ingo Molnar:
"The biggest changes in this cycle were:
- reworking of the e820 code: separate in-kernel and boot-ABI data
structures and apply a whole range of cleanups to the kernel side.
No change in functionality.
- enable KASLR by default: it's used by all major distros and it's
out of the experimental stage as well.
- ... misc fixes and cleanups"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits)
x86/KASLR: Fix kexec kernel boot crash when KASLR randomization fails
x86/reboot: Turn off KVM when halting a CPU
x86/boot: Fix BSS corruption/overwrite bug in early x86 kernel startup
x86: Enable KASLR by default
boot/param: Move next_arg() function to lib/cmdline.c for later reuse
x86/boot: Fix Sparse warning by including required header file
x86/boot/64: Rename start_cpu()
x86/xen: Update e820 table handling to the new core x86 E820 code
x86/boot: Fix pr_debug() API braindamage
xen, x86/headers: Add <linux/device.h> dependency to <asm/xen/page.h>
x86/boot/e820: Simplify e820__update_table()
x86/boot/e820: Separate the E820 ABI structures from the in-kernel structures
x86/boot/e820: Fix and clean up e820_type switch() statements
x86/boot/e820: Rename the remaining E820 APIs to the e820__*() prefix
x86/boot/e820: Remove unnecessary #include's
x86/boot/e820: Rename e820_mark_nosave_regions() to e820__register_nosave_regions()
x86/boot/e820: Rename e820_reserve_resources*() to e820__reserve_resources*()
x86/boot/e820: Use bool in query APIs
x86/boot/e820: Document e820__reserve_setup_data()
x86/boot/e820: Clean up __e820__update_table() et al
...
Pull uaccess unification updates from Al Viro:
"This is the uaccess unification pile. It's _not_ the end of uaccess
work, but the next batch of that will go into the next cycle. This one
mostly takes copy_from_user() and friends out of arch/* and gets the
zero-padding behaviour in sync for all architectures.
Dealing with the nocache/writethrough mess is for the next cycle;
fortunately, that's x86-only. Same for cleanups in iov_iter.c (I am
sold on access_ok() in there, BTW; just not in this pile), same for
reducing __copy_... callsites, strn*... stuff, etc. - there will be a
pile about as large as this one in the next merge window.
This one sat in -next for weeks. -3KLoC"
* 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (96 commits)
HAVE_ARCH_HARDENED_USERCOPY is unconditional now
CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional now
m32r: switch to RAW_COPY_USER
hexagon: switch to RAW_COPY_USER
microblaze: switch to RAW_COPY_USER
get rid of padding, switch to RAW_COPY_USER
ia64: get rid of copy_in_user()
ia64: sanitize __access_ok()
ia64: get rid of 'segment' argument of __do_{get,put}_user()
ia64: get rid of 'segment' argument of __{get,put}_user_check()
ia64: add extable.h
powerpc: get rid of zeroing, switch to RAW_COPY_USER
esas2r: don't open-code memdup_user()
alpha: fix stack smashing in old_adjtimex(2)
don't open-code kernel_setsockopt()
mips: switch to RAW_COPY_USER
mips: get rid of tail-zeroing in primitives
mips: make copy_from_user() zero tail explicitly
mips: clean and reorder the forest of macros...
mips: consolidate __invoke_... wrappers
...
Newer hardware has uncovered a bug in the software implementation of
using MWAITX for the delay function. A value of 0 for the timer is meant
to indicate that a timeout will not be used to exit MWAITX. On newer
hardware this can result in MWAITX never returning, resulting in NMI
soft lockup messages being printed. On older hardware, some of the other
conditions under which MWAITX can exit masked this issue. The AMD APM
does not currently document this and will be updated.
Please refer to http://marc.info/?l=kvm&m=148950623231140 for
information regarding NMI soft lockup messages on an AMD Ryzen 1800X.
This has been root-caused as a 0 passed to MWAITX causing it to wait
indefinitely.
This change has the added benefit of avoiding the unnecessary setup of
MONITORX/MWAITX when the delay value is zero.
Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Link: http://lkml.kernel.org/r/1493156643-29366-1-git-send-email-Janakarajan.Natarajan@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Back in commit:
92b0729c34 ("x86/mm, x86/mce: Add memcpy_mcsafe()")
... I made a copy/paste error setting up the exception table entries
and ended up with two for label .L_cache_w3 and none for .L_cache_w2.
This means that if we take a machine check on:
.L_cache_w2: movq 2*8(%rsi), %r10
then we don't have an exception table entry for this instruction
and we can't recover.
Fix: s/3/2/
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 92b0729c34 ("x86/mm, x86/mce: Add memcpy_mcsafe()")
Link: http://lkml.kernel.org/r/1490046030-25862-1-git-send-email-tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In line with asm/e820/types.h, move the e820 API declarations to
asm/e820/api.h and update all usage sites.
This is just a mechanical, obviously correct move & replace patch,
there will be subsequent changes to clean up the code and to make
better use of the new header organization.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a panic happens during bootup, "Rebooting in X seconds.." is
shown, but reboot happens immediatelly. It is because panic() uses mdelay()
and mdelay() calls __const_udelay() immediately, which does not
work while booting.
The per_cpu cpu_info.loops_per_jiffy value is not initialized yet, so
__const_udelay() actually multiplies the number of loops by zero. This
results in __const_udelay() to delay the execution only by a nanosecond
or so.
So check whether cpu_info.loops_per_jiffy is zero and use
loops_per_jiffy in that case. mdelay() will not be so precise without
proper calibration, but it works relatively well.
Before:
[ 0.170039] delaying 100ms
[ 0.170828] done
After
[ 0.214042] delaying 100ms
[ 0.313974] done
I do not think the added check matters given we are about to spin the
processor in the next few hundred cycles.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170119114730.2670-1-jslaby@suse.cz
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 asm updates from Ingo Molnar:
"The main changes in this development cycle were:
- a large number of call stack dumping/printing improvements: higher
robustness, better cross-context dumping, improved output, etc.
(Josh Poimboeuf)
- vDSO getcpu() performance improvement for future Intel CPUs with
the RDPID instruction (Andy Lutomirski)
- add two new Intel AVX512 features and the CPUID support
infrastructure for it: AVX512IFMA and AVX512VBMI. (Gayatri Kammela,
He Chen)
- more copy-user unification (Borislav Petkov)
- entry code assembly macro simplifications (Alexander Kuleshov)
- vDSO C/R support improvements (Dmitry Safonov)
- misc fixes and cleanups (Borislav Petkov, Paul Bolle)"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
scripts/decode_stacktrace.sh: Fix address line detection on x86
x86/boot/64: Use defines for page size
x86/dumpstack: Make stack name tags more comprehensible
selftests/x86: Add test_vdso to test getcpu()
x86/vdso: Use RDPID in preference to LSL when available
x86/dumpstack: Handle NULL stack pointer in show_trace_log_lvl()
x86/cpufeatures: Enable new AVX512 cpu features
x86/cpuid: Provide get_scattered_cpuid_leaf()
x86/cpuid: Cleanup cpuid_regs definitions
x86/copy_user: Unify the code by removing the 64-bit asm _copy_*_user() variants
x86/unwind: Ensure stack grows down
x86/vdso: Set vDSO pointer only after success
x86/prctl/uapi: Remove #ifdef for CHECKPOINT_RESTORE
x86/unwind: Detect bad stack return address
x86/dumpstack: Warn on stack recursion
x86/unwind: Warn on bad frame pointer
x86/decoder: Use stderr if insn sanity test fails
x86/decoder: Use stdout if insn decoder test is successful
mm/page_alloc: Remove kernel address exposure in free_reserved_area()
x86/dumpstack: Remove raw stack dump
...
Make the MSR argument an unsigned int, both low and high u32, put
"notrace" last in the function signature. Reflow function signatures for
better readability and cleanup white space.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We already have the same functionality in usercopy_32.c. Share it with
64-bit and get rid of some more asm glue which is not needed anymore.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161031151015.22087-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull kbuild updates from Michal Marek:
- EXPORT_SYMBOL for asm source by Al Viro.
This does bring a regression, because genksyms no longer generates
checksums for these symbols (CONFIG_MODVERSIONS). Nick Piggin is
working on a patch to fix this.
Plus, we are talking about functions like strcpy(), which rarely
change prototypes.
- Fixes for PPC fallout of the above by Stephen Rothwell and Nick
Piggin
- fixdep speedup by Alexey Dobriyan.
- preparatory work by Nick Piggin to allow architectures to build with
-ffunction-sections, -fdata-sections and --gc-sections
- CONFIG_THIN_ARCHIVES support by Stephen Rothwell
- fix for filenames with colons in the initramfs source by me.
* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (22 commits)
initramfs: Escape colons in depfile
ppc: there is no clear_pages to export
powerpc/64: whitelist unresolved modversions CRCs
kbuild: -ffunction-sections fix for archs with conflicting sections
kbuild: add arch specific post-link Makefile
kbuild: allow archs to select link dead code/data elimination
kbuild: allow architectures to use thin archives instead of ld -r
kbuild: Regenerate genksyms lexer
kbuild: genksyms fix for typeof handling
fixdep: faster CONFIG_ search
ia64: move exports to definitions
sparc32: debride memcpy.S a bit
[sparc] unify 32bit and 64bit string.h
sparc: move exports to definitions
ppc: move exports to definitions
arm: move exports to definitions
s390: move exports to definitions
m68k: move exports to definitions
alpha: move exports to actual definitions
x86: move exports to actual definitions
...
Use the mcsafe_key defined in the previous patch to make decisions on which
copy function to use. We can't use the FEATURE bit any more because PCI
quirks run too late to affect the patching of code. So we use a static key.
Turn memcpy_mcsafe() into an inline function to make life easier for
callers. The assembly code that actually does the copy is now named
memcpy_mcsafe_unrolled()
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Boris Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/bfde2fc774e94f53d91b70a4321c85a0d33e7118.1472754712.git.tony.luck@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
debug_putstr() is used to output strings without using printf-like
formatting but debug_putstr(v) is defined as early_printk(v) in
arch/x86/lib/kaslr.c.
This makes clang reports the following warning when building
with -Wformat-security:
arch/x86/lib/kaslr.c:57:15: warning: format string is not a string
literal (potentially insecure) [-Wformat-security]
debug_putstr(purpose);
^~~~~~~
Fix this by using "%s" in early_printk().
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160806102039.27221-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 header cleanups from Ingo Molnar:
"This tree is a cleanup of the x86 tree reducing spurious uses of
module.h - which should improve build performance a bit"
* 'x86-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, crypto: Restore MODULE_LICENSE() to glue_helper.c so it loads
x86/apic: Remove duplicated include from probe_64.c
x86/ce4100: Remove duplicated include from ce4100.c
x86/headers: Include spinlock_types.h in x8664_ksyms_64.c for missing spinlock_t
x86/platform: Delete extraneous MODULE_* tags fromm ts5500
x86: Audit and remove any remaining unnecessary uses of module.h
x86/kvm: Audit and remove any unnecessary uses of module.h
x86/xen: Audit and remove any unnecessary uses of module.h
x86/platform: Audit and remove any unnecessary uses of module.h
x86/lib: Audit and remove any unnecessary uses of module.h
x86/kernel: Audit and remove any unnecessary uses of module.h
x86/mm: Audit and remove any unnecessary uses of module.h
x86: Don't use module.h just for AUTHOR / LICENSE tags
1/ Replace pcommit with ADR / directed-flushing:
The pcommit instruction, which has not shipped on any product, is
deprecated. Instead, the requirement is that platforms implement either
ADR, or provide one or more flush addresses per nvdimm. ADR
(Asynchronous DRAM Refresh) flushes data in posted write buffers to the
memory controller on a power-fail event. Flush addresses are defined in
ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure:
"Flush Hint Address Structure". A flush hint is an mmio address that
when written and fenced assures that all previous posted writes
targeting a given dimm have been flushed to media.
2/ On-demand ARS (address range scrub):
Linux uses the results of the ACPI ARS commands to track bad blocks
in pmem devices. When latent errors are detected we re-scrub the media
to refresh the bad block list, userspace can also request a re-scrub at
any time.
3/ Support for the Microsoft DSM (device specific method) command format.
4/ Support for EDK2/OVMF virtual disk device memory ranges.
5/ Various fixes and cleanups across the subsystem.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXmXBsAAoJEB7SkWpmfYgCEwwP/1IOt9ocP+iHLMDH9KE7VaTZ
NmUDR+Zy6g5cRQM7SgcuU5BXUcx+OsSrSrUTVF1cW994o9Gbz1mFotkv0ZAsPcYY
ZVRQxo2oqHrssyOcg+PsgKWiXn68rJOCgmpEyzaJywl5qTMst7pzsT1s1f7rSh6h
trCf4VaJJwxZR8fARGtlHUnnhPe2Orp99EZRKEWprAsIv2kPuWpPHSjRjuEgN1JG
KW8AYwWqFTtiLRUk86I4KBB0wcDrfctsjgN9Ogd6+aHyQBRnVSr2U+vDCFkC8KLu
qiDCpYp+yyxBjclnljz7tRRT3GtzfCUWd4v2KVWqgg2IaobUc0Lbukp/rmikUXQP
WLikT2OCQ994eFK5OX3Q3cIU/4j459TQnof8q14yVSpjAKrNUXVSR5puN7Hxa+V7
41wKrAsnsyY1oq+Yd/rMR8VfH7PHx3bFkrmRCGZCufLX1UQm4aYj+sWagDKiV3yA
DiudghbOnhfurfGsnXUVw7y7GKs+gNWNBmB6ndAD6ZEHmKoGUhAEbJDLCc3DnANl
b/2mv1MIdIcC1DlCmnbbcn6fv6bICe/r8poK3VrCK3UgOq/EOvKIWl7giP+k1JuC
6DdVYhlNYIVFXUNSLFAwz8OkLu8byx7WDm36iEqrKHtPw+8qa/2bWVgOU6OBgpjV
cN3edFVIdxvZeMgM5Ubq
=xCBG
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
- Replace pcommit with ADR / directed-flushing.
The pcommit instruction, which has not shipped on any product, is
deprecated. Instead, the requirement is that platforms implement
either ADR, or provide one or more flush addresses per nvdimm.
ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
to the memory controller on a power-fail event.
Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
A flush hint is an mmio address that when written and fenced assures
that all previous posted writes targeting a given dimm have been
flushed to media.
- On-demand ARS (address range scrub).
Linux uses the results of the ACPI ARS commands to track bad blocks
in pmem devices. When latent errors are detected we re-scrub the
media to refresh the bad block list, userspace can also request a
re-scrub at any time.
- Support for the Microsoft DSM (device specific method) command
format.
- Support for EDK2/OVMF virtual disk device memory ranges.
- Various fixes and cleanups across the subsystem.
* tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
nfit: do an ARS scrub on hitting a latent media error
nfit: move to nfit/ sub-directory
nfit, libnvdimm: allow an ARS scrub to be triggered on demand
libnvdimm: register nvdimm_bus devices with an nd_bus driver
pmem: clarify a debug print in pmem_clear_poison
x86/insn: remove pcommit
Revert "KVM: x86: add pcommit support"
nfit, tools/testing/nvdimm/: unify shutdown paths
libnvdimm: move ->module to struct nvdimm_bus_descriptor
nfit: cleanup acpi_nfit_init calling convention
nfit: fix _FIT evaluation memory leak + use after free
tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
tools/testing/nvdimm: add virtual ramdisk range
acpi, nfit: treat virtual ramdisk SPA as pmem region
pmem: kill __pmem address space
pmem: kill wmb_pmem()
libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
fs/dax: remove wmb_pmem()
libnvdimm, pmem: flush posted-write queues on shutdown
...
Pull perf fixes from Ingo Molnar:
"This tree contains tooling fixes plus some additions:
- fixes to the vdso2c build environment that Stephen Rothwell is
using for the linux-next build (Arnaldo Carvalho de Melo)
- AVX-512 instruction mappings (Adrian Hunter)
- misc fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "perf tools: event.h needs asm/perf_regs.h"
x86: Make the vdso2c compiler use the host architecture headers
tools build: Fix objtool build with ARCH=x86_64
objtool: Always use host headers
objtool: Use tools/scripts/Makefile.arch to get ARCH and HOSTARCH
tools build: Add HOSTARCH Makefile variable
perf tests kmod-path: Fix build on ubuntu:16.04-x-armhf
perf tools: Add AVX-512 instructions to the new instructions test
perf tools: Add AVX-512 support to the instruction decoder used by Intel PT
x86/insn: Add AVX-512 support to the instruction decoder
x86/insn: perf tools: Fix vcvtph2ps instruction decoding
Pull x86 boot updates from Ingo Molnar:
"The main changes:
- add initial commits to randomize kernel memory section virtual
addresses, enabled via a new kernel option: RANDOMIZE_MEMORY
(Thomas Garnier, Kees Cook, Baoquan He, Yinghai Lu)
- enhance KASLR (RANDOMIZE_BASE) physical memory randomization (Kees
Cook)
- EBDA/BIOS region boot quirk cleanups (Andy Lutomirski, Ingo Molnar)
- misc cleanups/fixes"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Simplify EBDA-vs-BIOS reservation logic
x86/boot: Clarify what x86_legacy_features.reserve_bios_regions does
x86/boot: Reorganize and clean up the BIOS area reservation code
x86/mm: Do not reference phys addr beyond kernel
x86/mm: Add memory hotplug support for KASLR memory randomization
x86/mm: Enable KASLR for vmalloc memory regions
x86/mm: Enable KASLR for physical mapping memory regions
x86/mm: Implement ASLR for kernel memory regions
x86/mm: Separate variable for trampoline PGD
x86/mm: Add PUD VA support for physical mapping
x86/mm: Update physical mapping variable names
x86/mm: Refactor KASLR entropy functions
x86/KASLR: Fix boot crash with certain memory configurations
x86/boot/64: Add forgotten end of function marker
x86/KASLR: Allow randomization below the load address
x86/KASLR: Extend kernel image physical address randomization to addresses larger than 4G
x86/KASLR: Randomize virtual address separately
x86/KASLR: Clarify identity map interface
x86/boot: Refuse to build with data relocations
x86/KASLR, x86/power: Remove x86 hibernation restrictions
The pcommit instruction is being deprecated in favor of either ADR
(asynchronous DRAM refresh: flush-on-power-fail) at the platform level, or
posted-write-queue flush addresses as defined by the ACPI 6.x NFIT (NVDIMM
Firmware Interface Table).
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Add support for Intel's AVX-512 instructions to the instruction decoder.
AVX-512 instructions are documented in Intel Architecture Instruction
Set Extensions Programming Reference (February 2016).
AVX-512 instructions are identified by a EVEX prefix which, for the
purpose of instruction decoding, can be treated as though it were a
4-byte VEX prefix.
Existing instructions which can now accept an EVEX prefix need not be
further annotated in the op code map (x86-opcode-map.txt). In the case
of new instructions, the op code map is updated accordingly.
Also add associated Mask Instructions that are used to manipulate mask
registers used in AVX-512 instructions.
The 'perf tools' instruction decoder is updated in a subsequent patch.
And a representative set of instructions is added to the perf tools new
instructions test in a subsequent patch.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: X86 ML <x86@kernel.org>
Link: http://lkml.kernel.org/r/1469003437-32706-3-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
vcvtph2ps does not have an immediate operand, so remove the erroneous
'Ib' from its opcode map entry. Add vcvtph2ps to the perf tools new
instructions test to verify it.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: X86 ML <x86@kernel.org>
Link: http://lkml.kernel.org/r/1469003437-32706-2-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
struct thread_info is a legacy mess. To prepare for its partial removal,
move thread_info::addr_limit out.
As an added benefit, this way is simpler.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/15bee834d09402b47ac86f2feccdf6529f9bc5b0.1468527351.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Historically a lot of these existed because we did not have
a distinction between what was modular code and what was providing
support to modules via EXPORT_SYMBOL and friends. That changed
when we forked out support for the latter into the export.h file.
This means we should be able to reduce the usage of module.h
in code that is obj-y Makefile or bool Kconfig. The advantage
in doing so is that module.h itself sources about 15 other headers;
adding significantly to what we feed cpp, and it can obscure what
headers we are effectively using.
Since module.h was the source for init.h (for __init) and for
export.h (for EXPORT_SYMBOL) we consider each obj-y/bool instance
for the presence of either and replace as needed. Build testing
revealed a couple implicit header usage issues that were fixed.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160714001901.31603-5-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
People complained about ARCH_HWEIGHT_CFLAGS and how it throws a wrench
into kcov, lto, etc, experimentations.
Add asm versions for __sw_hweight{32,64}() and do explicit saving and
restoring of clobbered registers. This gets rid of the special calling
convention. We get to call those functions on !X86_FEATURE_POPCNT CPUs.
We still need to hardcode POPCNT and register operands as some old gas
versions which we support, do not know about POPCNT.
Btw, remove redundant REX prefix from 32-bit POPCNT because alternatives
can do padding now.
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1464605787-20603-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 asm updates from Ingo Molnar:
"The main changes in this cycle were:
- MSR access API fixes and enhancements (Andy Lutomirski)
- early exception handling improvements (Andy Lutomirski)
- user-space FS/GS prctl usage fixes and improvements (Andy
Lutomirski)
- Remove the cpu_has_*() APIs and replace them with equivalents
(Borislav Petkov)
- task switch micro-optimization (Brian Gerst)
- 32-bit entry code simplification (Denys Vlasenko)
- enhance PAT handling in enumated CPUs (Toshi Kani)
... and lots of other cleanups/fixlets"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
x86/arch_prctl/64: Restore accidentally removed put_cpu() in ARCH_SET_GS
x86/entry/32: Remove asmlinkage_protect()
x86/entry/32: Remove GET_THREAD_INFO() from entry code
x86/entry, sched/x86: Don't save/restore EFLAGS on task switch
x86/asm/entry/32: Simplify pushes of zeroed pt_regs->REGs
selftests/x86/ldt_gdt: Test set_thread_area() deletion of an active segment
x86/tls: Synchronize segment registers in set_thread_area()
x86/asm/64: Rename thread_struct's fs and gs to fsbase and gsbase
x86/arch_prctl/64: Remove FSBASE/GSBASE < 4G optimization
x86/segments/64: When load_gs_index fails, clear the base
x86/segments/64: When loadsegment(fs, ...) fails, clear the base
x86/asm: Make asm/alternative.h safe from assembly
x86/asm: Stop depending on ptrace.h in alternative.h
x86/entry: Rename is_{ia32,x32}_task() to in_{ia32,x32}_syscall()
x86/asm: Make sure verify_cpu() has a good stack
x86/extable: Add a comment about early exception handlers
x86/msr: Set the return value to zero when native_rdmsr_safe() fails
x86/paravirt: Make "unsafe" MSR accesses unsafe even if PARAVIRT=y
x86/paravirt: Add paravirt_{read,write}_msr()
x86/msr: Carry on after a non-"safe" MSR access fails
...
3387a535ce ("x86/asm: Create stack frames in rwsem functions") has
added FRAME_{BEGIN,END} annotations to rwsem asm stubs. The patch
which has added call_rwsem_down_write_failed_killable() was based on an
older tree so it didn't know about annotations. Let's add them.
This addresses the following objtool warning:
arch/x86/lib/rwsem.o: warning: objtool: call_rwsem_down_write_failed_killable()+0xe: call without frame pointer save/setup
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1460541432-21631-1-git-send-email-mhocko@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
which uses the same fast path as __down_write() except it falls back to
call_rwsem_down_write_failed_killable() slow path and return -EINTR if
killed. To prevent from code duplication extract the skeleton of
__down_write() into a helper macro which just takes the semaphore
and the slow path function to be called.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Signed-off-by: Jason Low <jason.low2@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-alpha@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/1460041951-22347-11-git-send-email-mhocko@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing). Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system. A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/). However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.
kcov does not aim to collect as much coverage as possible. It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g. scheduler, locking).
Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes. Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch). I've
dropped the second mode for simplicity.
This patch adds the necessary support on kernel side. The complimentary
compiler support was added in gcc revision 231296.
We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:
https://github.com/google/syzkaller/wiki/Found-Bugs
We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation". For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.
Why not gcov. Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
typical coverage can be just a dozen of basic blocks (e.g. an invalid
input). In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M). Cost of
kcov depends only on number of executed basic blocks/edges. On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.
kcov exposes kernel PCs and control flow to user-space which is
insecure. But debugfs should not be mapped as user accessible.
Based on a patch by Quentin Casasnovas.
[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@google.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: David Drysdale <drysdale@google.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull 'objtool' stack frame validation from Ingo Molnar:
"This tree adds a new kernel build-time object file validation feature
(ONFIG_STACK_VALIDATION=y): kernel stack frame correctness validation.
It was written by and is maintained by Josh Poimboeuf.
The motivation: there's a category of hard to find kernel bugs, most
of them in assembly code (but also occasionally in C code), that
degrades the quality of kernel stack dumps/backtraces. These bugs are
hard to detect at the source code level. Such bugs result in
incorrect/incomplete backtraces most of time - but can also in some
rare cases result in crashes or other undefined behavior.
The build time correctness checking is done via the new 'objtool'
user-space utility that was written for this purpose and which is
hosted in the kernel repository in tools/objtool/. The tool's (very
simple) UI and source code design is shaped after Git and perf and
shares quite a bit of infrastructure with tools/perf (which tooling
infrastructure sharing effort got merged via perf and is already
upstream). Objtool follows the well-known kernel coding style.
Objtool does not try to check .c or .S files, it instead analyzes the
resulting .o generated machine code from first principles: it decodes
the instruction stream and interprets it. (Right now objtool supports
the x86-64 architecture.)
From tools/objtool/Documentation/stack-validation.txt:
"The kernel CONFIG_STACK_VALIDATION option enables a host tool named
objtool which runs at compile time. It has a "check" subcommand
which analyzes every .o file and ensures the validity of its stack
metadata. It enforces a set of rules on asm code and C inline
assembly code so that stack traces can be reliable.
Currently it only checks frame pointer usage, but there are plans to
add CFI validation for C files and CFI generation for asm files.
For each function, it recursively follows all possible code paths
and validates the correct frame pointer state at each instruction.
It also follows code paths involving special sections, like
.altinstructions, __jump_table, and __ex_table, which can add
alternative execution paths to a given instruction (or set of
instructions). Similarly, it knows how to follow switch statements,
for which gcc sometimes uses jump tables."
When this new kernel option is enabled (it's disabled by default), the
tool, if it finds any suspicious assembly code pattern, outputs
warnings in compiler warning format:
warning: objtool: rtlwifi_rate_mapping()+0x2e7: frame pointer state mismatch
warning: objtool: cik_tiling_mode_table_init()+0x6ce: call without frame pointer save/setup
warning: objtool:__schedule()+0x3c0: duplicate frame pointer save
warning: objtool:__schedule()+0x3fd: sibling call from callable instruction with changed frame pointer
... so that scripts that pick up compiler warnings will notice them.
All known warnings triggered by the tool are fixed by the tree, most
of the commits in fact prepare the kernel to be warning-free. Most of
them are bugfixes or cleanups that stand on their own, but there are
also some annotations of 'special' stack frames for justified cases
such entries to JIT-ed code (BPF) or really special boot time code.
There are two other long-term motivations behind this tool as well:
- To improve the quality and reliability of kernel stack frames, so
that they can be used for optimized live patching.
- To create independent infrastructure to check the correctness of
CFI stack frames at build time. CFI debuginfo is notoriously
unreliable and we cannot use it in the kernel as-is without extra
checking done both on the kernel side and on the build side.
The quality of kernel stack frames matters to debuggability as well,
so IMO we can merge this without having to consider the live patching
or CFI debuginfo angle"
* 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
objtool: Only print one warning per function
objtool: Add several performance improvements
tools: Copy hashtable.h into tools directory
objtool: Fix false positive warnings for functions with multiple switch statements
objtool: Rename some variables and functions
objtool: Remove superflous INIT_LIST_HEAD
objtool: Add helper macros for traversing instructions
objtool: Fix false positive warnings related to sibling calls
objtool: Compile with debugging symbols
objtool: Detect infinite recursion
objtool: Prevent infinite recursion in noreturn detection
objtool: Detect and warn if libelf is missing and don't break the build
tools: Support relative directory path for 'O='
objtool: Support CROSS_COMPILE
x86/asm/decoder: Use explicitly signed chars
objtool: Enable stack metadata validation on 64-bit x86
objtool: Add CONFIG_STACK_VALIDATION option
objtool: Add tool to perform compile-time stack metadata validation
x86/kprobes: Mark kretprobe_trampoline() stack frame as non-standard
sched: Always inline context_switch()
...
Pull networking updates from David Miller:
"Highlights:
1) Support more Realtek wireless chips, from Jes Sorenson.
2) New BPF types for per-cpu hash and arrap maps, from Alexei
Starovoitov.
3) Make several TCP sysctls per-namespace, from Nikolay Borisov.
4) Allow the use of SO_REUSEPORT in order to do per-thread processing
of incoming TCP/UDP connections. The muxing can be done using a
BPF program which hashes the incoming packet. From Craig Gallek.
5) Add a multiplexer for TCP streams, to provide a messaged based
interface. BPF programs can be used to determine the message
boundaries. From Tom Herbert.
6) Add 802.1AE MACSEC support, from Sabrina Dubroca.
7) Avoid factorial complexity when taking down an inetdev interface
with lots of configured addresses. We were doing things like
traversing the entire address less for each address removed, and
flushing the entire netfilter conntrack table for every address as
well.
8) Add and use SKB bulk free infrastructure, from Jesper Brouer.
9) Allow offloading u32 classifiers to hardware, and implement for
ixgbe, from John Fastabend.
10) Allow configuring IRQ coalescing parameters on a per-queue basis,
from Kan Liang.
11) Extend ethtool so that larger link mode masks can be supported.
From David Decotigny.
12) Introduce devlink, which can be used to configure port link types
(ethernet vs Infiniband, etc.), port splitting, and switch device
level attributes as a whole. From Jiri Pirko.
13) Hardware offload support for flower classifiers, from Amir Vadai.
14) Add "Local Checksum Offload". Basically, for a tunneled packet
the checksum of the outer header is 'constant' (because with the
checksum field filled into the inner protocol header, the payload
of the outer frame checksums to 'zero'), and we can take advantage
of that in various ways. From Edward Cree"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
bonding: fix bond_get_stats()
net: bcmgenet: fix dma api length mismatch
net/mlx4_core: Fix backward compatibility on VFs
phy: mdio-thunder: Fix some Kconfig typos
lan78xx: add ndo_get_stats64
lan78xx: handle statistics counter rollover
RDS: TCP: Remove unused constant
RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
net: smc911x: convert pxa dma to dmaengine
team: remove duplicate set of flag IFF_MULTICAST
bonding: remove duplicate set of flag IFF_MULTICAST
net: fix a comment typo
ethernet: micrel: fix some error codes
ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
bpf, dst: add and use dst_tclassid helper
bpf: make skb->tc_classid also readable
net: mvneta: bm: clarify dependencies
cls_bpf: reset class and reuse major in da
ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
ldmvsw: Add ldmvsw.c driver code
...
Returning a 'bool' was very unpopular. Doubly so because the
code was just wrong (returning zero for true, one for false;
great for shell programming, not so good for C).
Change return type to "int". Keep zero as the success indicator
because it matches other similar code and people may be more
comfortable writing:
if (memcpy_mcsafe(to, from, count)) {
printk("Sad panda, copy failed\n");
...
}
Make the failure return value -EFAULT for now.
Reported by: Mika Penttilä <mika.penttila@nextfour.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mika.penttila@nextfour.com
Fixes: 92b0729c34 ("x86/mm, x86/mce: Add memcpy_mcsafe()")
Link: http://lkml.kernel.org/r/695f14233fa7a54fcac4406c706d7fec228e3f4c.1457993040.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 boot updates from Ingo Molnar:
"Early command line options parsing enhancements from Dave Hansen, plus
minor cleanups and enhancements"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Remove unused 'is_big_kernel' variable
x86/boot: Use proper array element type in memset() size calculation
x86/boot: Pass in size to early cmdline parsing
x86/boot: Simplify early command line parsing
x86/boot: Fix early command-line parsing when partial word matches
x86/boot: Fix early command-line parsing when matching at end
x86/boot: Simplify kernel load address alignment check
x86/boot: Micro-optimize reset_early_page_tables()
Pull x86 asm updates from Ingo Molnar:
"This is another big update. Main changes are:
- lots of x86 system call (and other traps/exceptions) entry code
enhancements. In particular the complex parts of the 64-bit entry
code have been migrated to C code as well, and a number of dusty
corners have been refreshed. (Andy Lutomirski)
- vDSO special mapping robustification and general cleanups (Andy
Lutomirski)
- cpufeature refactoring, cleanups and speedups (Borislav Petkov)
- lots of other changes ..."
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
x86/cpufeature: Enable new AVX-512 features
x86/entry/traps: Show unhandled signal for i386 in do_trap()
x86/entry: Call enter_from_user_mode() with IRQs off
x86/entry/32: Change INT80 to be an interrupt gate
x86/entry: Improve system call entry comments
x86/entry: Remove TIF_SINGLESTEP entry work
x86/entry/32: Add and check a stack canary for the SYSENTER stack
x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup
x86/entry: Only allocate space for tss_struct::SYSENTER_stack if needed
x86/entry: Vastly simplify SYSENTER TF (single-step) handling
x86/entry/traps: Clear DR6 early in do_debug() and improve the comment
x86/entry/traps: Clear TIF_BLOCKSTEP on all debug exceptions
x86/entry/32: Restore FLAGS on SYSEXIT
x86/entry/32: Filter NT and speed up AC filtering in SYSENTER
x86/entry/compat: In SYSENTER, sink AC clearing below the existing FLAGS test
selftests/x86: In syscall_nt, test NT|TF as well
x86/asm-offsets: Remove PARAVIRT_enabled
x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled
uprobes: __create_xol_area() must nullify xol_mapping.fault
x86/cpufeature: Create a new synthetic cpu capability for machine check recovery
...
Pull RAS updates from Ingo Molnar:
"Various RAS updates:
- AMD MCE support updates for future CPUs, fixes and 'SMCA' (Scalable
MCA) error decoding support (Aravind Gopalakrishnan)
- x86 memcpy_mcsafe() support, to enable smart(er) hardware error
recovery in NVDIMM drivers, based on an extension of the x86
exception handling code. (Tony Luck)"
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
EDAC/sb_edac: Fix computation of channel address
x86/mm, x86/mce: Add memcpy_mcsafe()
x86/mce/AMD: Document some functionality
x86/mce: Clarify comments regarding deferred error
x86/mce/AMD: Fix logic to obtain block address
x86/mce/AMD, EDAC: Enable error decoding of Scalable MCA errors
x86/mce: Move MCx_CONFIG MSR definitions
x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries
x86/mm: Expand the exception table logic to allow new handling options
x86/mce/AMD: Set MCAX Enable bit
x86/mce/AMD: Carve out threshold block preparation
x86/mce/AMD: Fix LVT offset configuration for thresholding
x86/mce/AMD: Reduce number of blocks scanned per bank
x86/mce/AMD: Do not perform shared bank check for future processors
x86/mce: Fix order of AMD MCE init function call
This patch updates csum_ipv6_magic so that it correctly recognizes that
protocol is a unsigned 8 bit value.
This will allow us to better understand what limitations may or may not be
present in how we handle the data. For example there are a number of
places that call htonl on the protocol value. This is likely not necessary
and can be replaced with a multiplication by ntohl(1) which will be
converted to a shift by the compiler.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We do use this_cpu_ptr(&cpu_tss) as a cacheline-aligned, seldomly
accessed per-cpu var as the MONITORX target in delay_mwaitx(). However,
when called in preemptible context, this_cpu_ptr -> smp_processor_id() ->
debug_smp_processor_id() fires:
BUG: using smp_processor_id() in preemptible [00000000] code: udevd/312
caller is delay_mwaitx+0x40/0xa0
But we don't care about that check - we only need cpu_tss as a MONITORX
target and it doesn't really matter which CPU's var we're touching as
we're going idle anyway. Fix that.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: spg_linux_kernel@amd.com
Link: http://lkml.kernel.org/r/20160309205622.GG6564@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make use of the EXTABLE_FAULT exception table entries to write
a kernel copy routine that doesn't crash the system if it
encounters a machine check. Prime use case for this is to copy
from large arrays of non-volatile memory used as storage.
We have to use an unrolled copy loop for now because current
hardware implementations treat a machine check in "rep mov"
as fatal. When that is fixed we can simplify.
Return type is a "bool". True means that we copied OK, false means
that it didn't.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@gmail.com>
Link: http://lkml.kernel.org/r/a44e1055efc2d2a9473307b22c91caa437aa3f8b.1456439214.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When running objtool on a ppc64le host to analyze x86 binaries, it
reports a lot of false warnings like:
ipc/compat_mq.o: warning: objtool: compat_SyS_mq_open()+0x91: can't find jump dest instruction at .text+0x3a5
The warnings are caused by the x86 instruction decoder setting the wrong
value for the jump instruction's immediate field because it assumes that
"char == signed char", which isn't true for all architectures. When
converting char to int, gcc sign-extends on x86 but doesn't sign-extend
on ppc64le.
According to the gcc man page, that's a feature, not a bug:
> Each kind of machine has a default for what "char" should be. It is
> either like "unsigned char" by default or like "signed char" by
> default.
>
> Ideally, a portable program should always use "signed char" or
> "unsigned char" when it depends on the signedness of an object.
Conform to the "standards" by changing the "char" casts to "signed
char". This results in no actual changes to the object code on x86.
Note: the x86 decoder now lives in three different locations in the
kernel tree, which are all kept in sync via makefile checks and
warnings: in-kernel, perf, and objtool. This fixes all three locations.
Eventually we should probably try to at least converge the two separate
"tools" locations into a single shared location.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/9dd4161719b20e6def9564646d68bfbe498c549f.1456962210.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
rwsem.S has several callable non-leaf functions which don't honor
CONFIG_FRAME_POINTER, which can result in bad stack traces.
Create stack frames for them when CONFIG_FRAME_POINTER is enabled.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chris J Arges <chris.j.arges@canonical.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/ad0932bbead975b15f9578e4f2cf2ee5961eb840.1453405861.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Data corruption issues were observed in tests which initiated
a system crash/reset while accessing BTT devices. This problem
is reproducible.
The BTT driver calls pmem_rw_bytes() to update data in pmem
devices. This interface calls __copy_user_nocache(), which
uses non-temporal stores so that the stores to pmem are
persistent.
__copy_user_nocache() uses non-temporal stores when a request
size is 8 bytes or larger (and is aligned by 8 bytes). The
BTT driver updates the BTT map table, which entry size is
4 bytes. Therefore, updates to the map table entries remain
cached, and are not written to pmem after a crash.
Change __copy_user_nocache() to use non-temporal store when
a request size is 4 bytes. The change extends the current
byte-copy path for a less-than-8-bytes request, and does not
add any overhead to the regular path.
Reported-and-tested-by: Micah Parrish <micah.parrish@hpe.com>
Reported-and-tested-by: Brian Boylston <brian.boylston@hpe.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: <stable@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: linux-nvdimm@lists.01.org
Link: http://lkml.kernel.org/r/1455225857-12039-3-git-send-email-toshi.kani@hpe.com
[ Small readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We will use this in a few patches to implement tests for early parsing.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
[ Aligned args properly. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: yu-cheng.yu@intel.com
Link: http://lkml.kernel.org/r/20151222225243.5CC47EB6@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__cmdline_find_option_bool() tries to account for both NULL-terminated
and non-NULL-terminated strings. It keeps 'pos' to look for the end of
the buffer and also looks for '!c' in a bunch of places to look for NULL
termination.
But, it also calls strlen(). You can't call strlen on a
non-NULL-terminated string.
If !strlen(cmdline), then cmdline[0]=='\0'. In that case, we will go in
to the while() loop, set c='\0', hit st_wordstart, notice !c, and will
immediately return 0.
So, remove the strlen(). It is unnecessary and unsafe.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: yu-cheng.yu@intel.com
Link: http://lkml.kernel.org/r/20151222225241.15365E43@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cmdline_find_option_bool() keeps track of position in two strings:
1. the command-line
2. the option we are searchign for in the command-line
We plow through each character in the command-line one at a time, always
moving forward. We move forward in the option ('opptr') when we match
characters in 'cmdline'. We reset the 'opptr' only when we go in to the
'st_wordstart' state.
But, if we fail to match an option because we see a space
(state=st_wordcmp, *opptr='\0',c=' '), we set state='st_wordskip' and
'break', moving to the next character. But, that move to the next
character is the one *after* the ' '. This means that we will miss a
'st_wordstart' state.
For instance, if we have
cmdline = "foo fool";
and are searching for "fool", we have:
"fool"
opptr = ----^
"foo fool"
c = --------^
We see that 'l' != ' ', set state=st_wordskip, break, and then move 'c', so:
"foo fool"
c = ---------^
and are still in state=st_wordskip. We will stay in wordskip until we
have skipped "fool", thus missing the option we were looking for. This
*only* happens when you have a partially- matching word followed by a
matching one.
To fix this, we always fall *into* the 'st_wordskip' state when we set
it.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: yu-cheng.yu@intel.com
Link: http://lkml.kernel.org/r/20151222225239.8E1DCA58@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The x86 early command line parsing in cmdline_find_option_bool() is
buggy. If it matches a specified 'option' all the way to the end of the
command-line, it will consider it a match.
For instance,
cmdline = "foo";
cmdline_find_option_bool(cmdline, "fool");
will return 1. This is particularly annoying since we have actual FPU
options like "noxsave" and "noxsaves" So, command-line "foo bar noxsave"
will match *BOTH* a "noxsave" and "noxsaves". (This turns out not to be
an actual problem because "noxsave" implies "noxsaves", but it's still
confusing.)
To fix this, we simplify the code and stop tracking 'len'. 'len'
was trying to indicate either the NULL terminator *OR* the end of a
non-NULL-terminated command line at 'COMMAND_LINE_SIZE'. But, each of the
three states is *already* checking 'cmdline' for a NULL terminator.
We _only_ need to check if we have overrun 'COMMAND_LINE_SIZE', and that
we can do without keeping 'len' around.
Also add some commends to clarify what is going on.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: yu-cheng.yu@intel.com
Link: http://lkml.kernel.org/r/20151222225238.9AEB560C@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move them to a separate header and have the following
dependency:
x86/cpufeatures.h <- x86/processor.h <- x86/cpufeature.h
This makes it easier to use the header in asm code and not
include the whole cpufeature.h and add guards for asm.
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1453842730-28463-5-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 cpu updates from Ingo Molnar:
"The main changes in this cycle were:
- Improved CPU ID handling code and related enhancements (Borislav
Petkov)
- RDRAND fix (Len Brown)"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Replace RDRAND forced-reseed with simple sanity check
x86/MSR: Chop off lower 32-bit value
x86/cpu: Fix MSR value truncation issue
x86/cpu/amd, kvm: Satisfy guest kernel reads of IC_CFG MSR
kvm: Add accessors for guest CPU's family, model, stepping
x86/cpu: Unify CPU family, model, stepping calculation
For debugging low level code interacting with the CPU it is often
useful to trace the MSR read/writes. This gives a concise summary of
PMU and other operations.
perf has an ad-hoc way to do this using trace_printk, but it's
somewhat limited (and also now spews ugly boot messages when enabled)
Instead define real trace points for all MSR accesses.
This adds three new trace points: read_msr and write_msr and rdpmc.
They also report if the access faulted (if *_safe is used)
This allows filtering and triggering on specific MSR values, which
allows various more advanced debugging techniques.
All the values are well defined in the CPU documentation.
The trace can be post processed with
Documentation/trace/postprocess/decode_msr.py to add symbolic MSR
names to the trace.
I only added it to native MSR accesses in C, not paravirtualized or in
entry*.S (which is not too interesting)
Originally the patch kit moved the MSRs out of line. This uses an
alternative approach recommended by Steven Rostedt of only moving the
trace calls out of line, but open coding the access to the jump label.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1449018060-1742-3-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add generic functions which calc family, model and stepping from
the CPUID_1.EAX leaf and stick them into the library we have.
Rename those which do call CPUID with the prefix "x86_cpuid" as
suggested by Paolo Bonzini.
No functionality change.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1448273546-2567-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add xsavec, xsaves and xrstors to the op code map and the perf tools new
instructions test. To run the test:
$ tools/perf/perf test "x86 ins"
39: Test x86 instruction decoder - new instructions : Ok
Or to see the details:
$ tools/perf/perf test -v "x86 ins" 2>&1 | grep 'xsave\|xrst'
For information about xsavec, xsaves and xrstors, refer the Intel SDM.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441196131-20632-8-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Add rdpkru and wrpkru to the op code map and the perf tools new
instructions test. In the case of the test, only the bytes can be
tested at the moment since binutils doesn't support the instructions
yet. To run the test:
$ tools/perf/perf test "x86 ins"
39: Test x86 instruction decoder - new instructions : Ok
Or to see the details:
$ tools/perf/perf test -v "x86 ins" 2>&1 | grep pkru
For information about rdpkru and wrpkru, refer the Intel SDM.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441196131-20632-7-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Intel Architecture Instruction Set Extensions Programing Reference (Oct
2014) describes 3 new memory instructions, namely clflushopt, clwb and
pcommit. Add them to the op code map and the perf tools new
instructions test. e.g.
$ tools/perf/perf test "x86 ins"
39: Test x86 instruction decoder - new instructions : Ok
Or to see the details:
$ tools/perf/perf test -v "x86 ins"
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441196131-20632-6-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Intel SHA Extensions are explained in the Intel Architecture
Instruction Set Extensions Programing Reference (Oct 2014).
There are 7 new instructions. Add them to the op code map
and the perf tools new instructions test. e.g.
$ tools/perf/perf test "x86 ins"
39: Test x86 instruction decoder - new instructions : Ok
Or to see the details:
$ tools/perf/perf test -v "x86 ins" 2>&1 | grep sha
Committer note:
3 lines of details, for the curious:
$ perf test -v "x86 ins" 2>&1 | grep sha256msg1 | tail -3
Decoded ok: 0f 38 cc 84 08 78 56 34 12 sha256msg1 0x12345678(%rax,%rcx,1),%xmm0
Decoded ok: 0f 38 cc 84 c8 78 56 34 12 sha256msg1 0x12345678(%rax,%rcx,8),%xmm0
Decoded ok: 44 0f 38 cc bc c8 78 56 34 12 sha256msg1 0x12345678(%rax,%rcx,8),%xmm15
$
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441196131-20632-5-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The MPX instructions are presently not described in the SDM
opcode maps, and there are not encoding characters for bnd
registers, address method or operand type. So the kernel
opcode map is using 'Gv' for bnd registers and 'Ev' for
everything else. That is fine because the instruction
decoder does not use that information anyway, except as
an indication that there is a ModR/M byte.
Nevertheless, in some cases the 'Gv' and 'Ev' are the wrong
way around, BNDLDX and BNDSTX have 2 operands not 3, and it
wouldn't hurt to identify the mandatory prefixes.
This has no effect on the decoding of valid instructions,
but the addition of the mandatory prefixes will cause some
invalid instructions to error out that wouldn't have
previously.
Note that perf tools has a copy of the instruction decoder
and provides a test for new instructions which includes MPX
instructions e.g.
$ perf test "x86 ins"
39: Test x86 instruction decoder - new instructions : Ok
Or to see the details:
$ perf test -v "x86 ins"
Commiter notes:
And to see these MPX instructions specifically:
$ perf test -v "x86 ins" 2>&1 | grep bndldx | head -3
Decoded ok: 0f 1a 00 bndldx (%eax),%bnd0
Decoded ok: 0f 1a 05 78 56 34 12 bndldx 0x12345678,%bnd0
Decoded ok: 0f 1a 18 bndldx (%eax),%bnd3
$ perf test -v "x86 ins" 2>&1 | grep bndstx | head -3
Decoded ok: 0f 1b 00 bndstx %bnd0,(%eax)
Decoded ok: 0f 1b 05 78 56 34 12 bndstx %bnd0,0x12345678
Decoded ok: 0f 1b 18 bndstx %bnd3,(%eax)
$
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441196131-20632-4-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
MWAITX can enable a timer and a corresponding timer value
specified in SW P0 clocks. The SW P0 frequency is the same as
TSC. The timer provides an upper bound on how long the
instruction waits before exiting.
This way, a delay function in the kernel can leverage that
MWAITX timer of MWAITX.
When a CPU core executes MWAITX, it will be quiesced in a
waiting phase, diminishing its power consumption. This way, we
can save power in comparison to our default TSC-based delays.
A simple test shows that:
$ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
$ sleep 10000s
$ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
Results:
* TSC-based default delay: 485115 uWatts average power
* MWAITX-based delay: 252738 uWatts average power
Thus, that's about 240 milliWatts less power consumption. The
test method relies on the support of AMD CPU accumulated power
algorithm in fam15h_power for which patches are forthcoming.
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Suggested-by: Borislav Petkov <bp@suse.de>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Huang Rui <ray.huang@amd.com>
[ Fix delay truncation. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andreas Herrmann <herrmann.der.user@gmail.com>
Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Jacob Shin <jacob.w.shin@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Li <tony.li@amd.com>
Link: http://lkml.kernel.org/r/1438744732-1459-3-git-send-email-ray.huang@amd.com
Link: http://lkml.kernel.org/r/1439201994-28067-4-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
rdtsc_barrier(); rdtsc() is an unnecessary mouthful and requires
more thought than should be necessary. Add an rdtsc_ordered()
helper and replace the trivial call sites with it.
This should not change generated code. The duplication of the
fence asm is temporary.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/dddbf98a2af53312e9aa73a5a2b1622fe5d6f52b.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that there is no paravirt TSC, the "native" is
inappropriate. The function does RDTSC, so give it the obvious
name: rdtsc().
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/fd43e16281991f096c1e4d21574d9e1402c62d39.1434501121.git.luto@kernel.org
[ Ported it to v4.2-rc1. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As a very minor optimization, delay_tsc() was only using the low
32 bits of the TSC. It's a delay function, so just use the whole
thing.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/bd1a277c71321b67c4794970cb5ace05efe21ab6.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that the ->read_tsc() paravirt hook is gone, rdtscll() is
just a wrapper around native_read_tsc(). Unwrap it.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/d2449ae62c1b1fb90195bcfb19ef4a35883a04dc.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 0a196848ca ("perf: Fix arch_perf_out_copy_user default"),
changes copy_from_user_nmi() to return the number of
remaining bytes so that it behave like copy_from_user().
Unfortunately, when the range is outside of the process
memory, the return value is still the number of byte
copied, eg. 0, instead of the remaining bytes.
As all users of copy_from_user_nmi() were modified as
part of commit 0a196848ca, the function should be
fixed to return the total number of bytes if range is
not correct.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435001923-30986-1-git-send-email-ydroneaud@opteya.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 core updates from Ingo Molnar:
"There were so many changes in the x86/asm, x86/apic and x86/mm topics
in this cycle that the topical separation of -tip broke down somewhat -
so the result is a more traditional architecture pull request,
collected into the 'x86/core' topic.
The topics were still maintained separately as far as possible, so
bisectability and conceptual separation should still be pretty good -
but there were a handful of merge points to avoid excessive
dependencies (and conflicts) that would have been poorly tested in the
end.
The next cycle will hopefully be much more quiet (or at least will
have fewer dependencies).
The main changes in this cycle were:
* x86/apic changes, with related IRQ core changes: (Jiang Liu, Thomas
Gleixner)
- This is the second and most intrusive part of changes to the x86
interrupt handling - full conversion to hierarchical interrupt
domains:
[IOAPIC domain] -----
|
[MSI domain] --------[Remapping domain] ----- [ Vector domain ]
| (optional) |
[HPET MSI domain] ----- |
|
[DMAR domain] -----------------------------
|
[Legacy domain] -----------------------------
This now reflects the actual hardware and allowed us to distangle
the domain specific code from the underlying parent domain, which
can be optional in the case of interrupt remapping. It's a clear
separation of functionality and removes quite some duct tape
constructs which plugged the remap code between ioapic/msi/hpet
and the vector management.
- Intel IOMMU IRQ remapping enhancements, to allow direct interrupt
injection into guests (Feng Wu)
* x86/asm changes:
- Tons of cleanups and small speedups, micro-optimizations. This
is in preparation to move a good chunk of the low level entry
code from assembly to C code (Denys Vlasenko, Andy Lutomirski,
Brian Gerst)
- Moved all system entry related code to a new home under
arch/x86/entry/ (Ingo Molnar)
- Removal of the fragile and ugly CFI dwarf debuginfo annotations.
Conversion to C will reintroduce many of them - but meanwhile
they are only getting in the way, and the upstream kernel does
not rely on them (Ingo Molnar)
- NOP handling refinements. (Borislav Petkov)
* x86/mm changes:
- Big PAT and MTRR rework: making the code more robust and
preparing to phase out exposing direct MTRR interfaces to drivers -
in favor of using PAT driven interfaces (Toshi Kani, Luis R
Rodriguez, Borislav Petkov)
- New ioremap_wt()/set_memory_wt() interfaces to support
Write-Through cached memory mappings. This is especially
important for good performance on NVDIMM hardware (Toshi Kani)
* x86/ras changes:
- Add support for deferred errors on AMD (Aravind Gopalakrishnan)
This is an important RAS feature which adds hardware support for
poisoned data. That means roughly that the hardware marks data
which it has detected as corrupted but wasn't able to correct, as
poisoned data and raises an APIC interrupt to signal that in the
form of a deferred error. It is the OS's responsibility then to
take proper recovery action and thus prolonge system lifetime as
far as possible.
- Add support for Intel "Local MCE"s: upcoming CPUs will support
CPU-local MCE interrupts, as opposed to the traditional system-
wide broadcasted MCE interrupts (Ashok Raj)
- Misc cleanups (Borislav Petkov)
* x86/platform changes:
- Intel Atom SoC updates
... and lots of other cleanups, fixlets and other changes - see the
shortlog and the Git log for details"
* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (222 commits)
x86/hpet: Use proper hpet device number for MSI allocation
x86/hpet: Check for irq==0 when allocating hpet MSI interrupts
x86/mm/pat, drivers/infiniband/ipath: Use arch_phys_wc_add() and require PAT disabled
x86/mm/pat, drivers/media/ivtv: Use arch_phys_wc_add() and require PAT disabled
x86/platform/intel/baytrail: Add comments about why we disabled HPET on Baytrail
genirq: Prevent crash in irq_move_irq()
genirq: Enhance irq_data_to_desc() to support hierarchy irqdomain
iommu, x86: Properly handle posted interrupts for IOMMU hotplug
iommu, x86: Provide irq_remapping_cap() interface
iommu, x86: Setup Posted-Interrupts capability for Intel iommu
iommu, x86: Add cap_pi_support() to detect VT-d PI capability
iommu, x86: Avoid migrating VT-d posted interrupts
iommu, x86: Save the mode (posted or remapped) of an IRTE
iommu, x86: Implement irq_set_vcpu_affinity for intel_ir_chip
iommu: dmar: Provide helper to copy shared irte fields
iommu: dmar: Extend struct irte for VT-d Posted-Interrupts
iommu: Add new member capability to struct irq_remap_ops
x86/asm/entry/64: Disentangle error_entry/exit gsbase/ebx/usermode code
x86/asm/entry/32: Shorten __audit_syscall_entry() args preparation
x86/asm/entry/32: Explain reloading of registers after __audit_syscall_entry()
...
Pull x86 FPU updates from Ingo Molnar:
"This tree contains two main changes:
- The big FPU code rewrite: wide reaching cleanups and reorganization
that pulls all the FPU code together into a clean base in
arch/x86/fpu/.
The resulting code is leaner and faster, and much easier to
understand. This enables future work to further simplify the FPU
code (such as removing lazy FPU restores).
By its nature these changes have a substantial regression risk: FPU
code related bugs are long lived, because races are often subtle
and bugs mask as user-space failures that are difficult to track
back to kernel side backs. I'm aware of no unfixed (or even
suspected) FPU related regression so far.
- MPX support rework/fixes. As this is still not a released CPU
feature, there were some buglets in the code - should be much more
robust now (Dave Hansen)"
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (250 commits)
x86/fpu: Fix double-increment in setup_xstate_features()
x86/mpx: Allow 32-bit binaries on 64-bit kernels again
x86/mpx: Do not count MPX VMAs as neighbors when unmapping
x86/mpx: Rewrite the unmap code
x86/mpx: Support 32-bit binaries on 64-bit kernels
x86/mpx: Use 32-bit-only cmpxchg() for 32-bit apps
x86/mpx: Introduce new 'directory entry' to 'addr' helper function
x86/mpx: Add temporary variable to reduce masking
x86: Make is_64bit_mm() widely available
x86/mpx: Trace allocation of new bounds tables
x86/mpx: Trace the attempts to find bounds tables
x86/mpx: Trace entry to bounds exception paths
x86/mpx: Trace #BR exceptions
x86/mpx: Introduce a boot-time disable flag
x86/mpx: Restrict the mmap() size check to bounds tables
x86/mpx: Remove redundant MPX_BNDCFG_ADDR_MASK
x86/mpx: Clean up the code by not passing a task pointer around when unnecessary
x86/mpx: Use the new get_xsave_field_ptr()API
x86/fpu/xstate: Wrap get_xsave_addr() to make it safer
x86/fpu/xstate: Fix up bad get_xsave_addr() assumptions
...
preempt_schedule_context() is a tracing safe preemption point but it's
only used when CONFIG_CONTEXT_TRACKING=y. Other configs have tracing
recursion issues since commit:
b30f0e3ffe ("sched/preempt: Optimize preemption operations on __schedule() callers")
introduced function based preemp_count_*() ops.
Lets make it available on all configs and give it a more appropriate
name for its new position.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1433432349-1021-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
These are all calling x86 entry code functions, so move them close
to other entry code.
Change lib-y to obj-y: there's no real difference between the two
as we don't really drop any of them during the linking stage, and
obj-y is the more common approach for core kernel object code.
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the dwarf2 annotations in low level assembly code have
become an increasing hindrance: unreadable, messy macros
mixed into some of the most security sensitive code paths
of the Linux kernel.
These debug info annotations don't even buy the upstream
kernel anything: dwarf driven stack unwinding has caused
problems in the past so it's out of tree, and the upstream
kernel only uses the much more robust framepointers based
stack unwinding method.
In addition to that there's a steady, slow bitrot going
on with these annotations, requiring frequent fixups.
There's no tooling and no functionality upstream that
keeps it correct.
So burn down the sick forest, allowing new, healthier growth:
27 files changed, 350 insertions(+), 1101 deletions(-)
Someone who has the willingness and time to do this
properly can attempt to reintroduce dwarf debuginfo in x86
assembly code plus dwarf unwinding from first principles,
with the following conditions:
- it should be maximally readable, and maximally low-key to
'ordinary' code reading and maintenance.
- find a build time method to insert dwarf annotations
automatically in the most common cases, for pop/push
instructions that manipulate the stack pointer. This could
be done for example via a preprocessing step that just
looks for common patterns - plus special annotations for
the few cases where we want to depart from the default.
We have hundreds of CFI annotations, so automating most of
that makes sense.
- it should come with build tooling checks that ensure that
CFI annotations are sensible. We've seen such efforts from
the framepointer side, and there's no reason it couldn't be
done on the dwarf side.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frédéric Weisbecker <fweisbec@gmail.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We already have fpu/types.h, move i387.h to fpu/api.h.
The file name has become a misnomer anyway: it offers generic FPU APIs,
but is not limited to i387 functionality.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull it up into the header and kill duplicate versions.
Separately, both macros are identical:
35948b2bd3431aee7149e85cfe4becbc /tmp/a
35948b2bd3431aee7149e85cfe4becbc /tmp/b
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431538944-27724-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit cae2a173fe ("x86: clean up/fix 'copy_in_user()' tail zeroing")
fixed the failure case tail zeroing of one special case of the x86-64
generic user-copy routine, namely when used for the user-to-user case
("copy_in_user()").
But in the process it broke an even more unusual case: using the user
copy routine for kernel-to-kernel copying.
Now, normally kernel-kernel copies are obviously done using memcpy(),
but we have a couple of special cases when we use the user-copy
functions. One is when we pass a kernel buffer to a regular user-buffer
routine, using set_fs(KERNEL_DS). That's a "normal" case, and continued
to work fine, because it never takes any faults (with the possible
exception of a silent and successful vmalloc fault).
But Jan Beulich pointed out another, very unusual, special case: when we
use the user-copy routines not because it's a path that expects a user
pointer, but for a couple of ftrace/kgdb cases that want to do a kernel
copy, but do so using "unsafe" buffers, and use the user-copy routine to
gracefully handle faults. IOW, for probe_kernel_write().
And that broke for the case of a faulting kernel destination, because we
saw the kernel destination and wanted to try to clear the tail of the
buffer. Which doesn't work, since that's what faults.
This only triggers for things like kgdb and ftrace users (eg trying
setting a breakpoint on read-only memory), but it's definitely a bug.
The fix is to not compare against the kernel address start (TASK_SIZE),
but instead use the same limits "access_ok()" uses.
Reported-and-tested-by: Jan Beulich <jbeulich@suse.com>
Cc: stable@vger.kernel.org # 4.0
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 asm changes from Ingo Molnar:
"There were lots of changes in this development cycle:
- over 100 separate cleanups, restructuring changes, speedups and
fixes in the x86 system call, irq, trap and other entry code, part
of a heroic effort to deobfuscate a decade old spaghetti asm code
and its C code dependencies (Denys Vlasenko, Andy Lutomirski)
- alternatives code fixes and enhancements (Borislav Petkov)
- simplifications and cleanups to the compat code (Brian Gerst)
- signal handling fixes and new x86 testcases (Andy Lutomirski)
- various other fixes and cleanups
By their nature many of these changes are risky - we tried to test
them well on many different x86 systems (there are no known
regressions), and they are split up finely to help bisection - but
there's still a fair bit of residual risk left so caveat emptor"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (148 commits)
perf/x86/64: Report regs_user->ax too in get_regs_user()
perf/x86/64: Simplify regs_user->abi setting code in get_regs_user()
perf/x86/64: Do report user_regs->cx while we are in syscall, in get_regs_user()
perf/x86/64: Do not guess user_regs->cs, ss, sp in get_regs_user()
x86/asm/entry/32: Tidy up JNZ instructions after TESTs
x86/asm/entry/64: Reduce padding in execve stubs
x86/asm/entry/64: Remove GET_THREAD_INFO() in ret_from_fork
x86/asm/entry/64: Simplify jumps in ret_from_fork
x86/asm/entry/64: Remove a redundant jump
x86/asm/entry/64: Optimize [v]fork/clone stubs
x86/asm/entry: Zero EXTRA_REGS for stub32_execve() too
x86/asm/entry/64: Move stub_x32_execvecloser() to stub_execveat()
x86/asm/entry/64: Use common code for rt_sigreturn() epilogue
x86/asm/entry/64: Add forgotten CFI annotation
x86/asm/entry/irq: Simplify interrupt dispatch table (IDT) layout
x86/asm/entry/64: Move opportunistic sysret code to syscall code path
x86, selftests: Add sigreturn selftest
x86/alternatives: Guard NOPs optimization
x86/asm/entry: Clear EXTRA_REGS for all executable formats
x86/signal: Remove pax argument from restore_sigcontext
...
The rule for 'copy_from_user()' is that it zeroes the remaining kernel
buffer even when the copy fails halfway, just to make sure that we don't
leave uninitialized kernel memory around. Because even if we check for
errors, some kernel buffers stay around after thge copy (think page
cache).
However, the x86-64 logic for user copies uses a copy_user_generic()
function for all the cases, that set the "zerorest" flag for any fault
on the source buffer. Which meant that it didn't just try to clear the
kernel buffer after a failure in copy_from_user(), it also tried to
clear the destination user buffer for the "copy_in_user()" case.
Not only is that pointless, it also means that the clearing code has to
worry about the tail clearing taking page faults for the user buffer
case. Which is just stupid, since that case shouldn't happen in the
first place.
Get rid of the whole "zerorest" thing entirely, and instead just check
if the destination is in kernel space or not. And then just use
memset() to clear the tail of the kernel buffer if necessary.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By the nature of the TEST operation, it is often possible to test
a narrower part of the operand:
"testl $3, mem" -> "testb $3, mem",
"testq $3, %rcx" -> "testb $3, %cl"
This results in shorter instructions, because the TEST instruction
has no sign-entending byte-immediate forms unlike other ALU ops.
Note that this change does not create any LCP (Length-Changing Prefix)
stalls, which happen when adding a 0x66 prefix, which happens when
16-bit immediates are used, which changes such TEST instructions:
[test_opcode] [modrm] [imm32]
to:
[0x66] [test_opcode] [modrm] [imm16]
where [imm16] has a *different length* now: 2 bytes instead of 4.
This confuses the decoder and slows down execution.
REX prefixes were carefully designed to almost never hit this case:
adding REX prefix does not change instruction length except MOVABS
and MOV [addr],RAX instruction.
This patch does not add instructions which would use a 0x66 prefix,
code changes in assembly are:
-48 f7 07 01 00 00 00 testq $0x1,(%rdi)
+f6 07 01 testb $0x1,(%rdi)
-48 f7 c1 01 00 00 00 test $0x1,%rcx
+f6 c1 01 test $0x1,%cl
-48 f7 c1 02 00 00 00 test $0x2,%rcx
+f6 c1 02 test $0x2,%cl
-41 f7 c2 01 00 00 00 test $0x1,%r10d
+41 f6 c2 01 test $0x1,%r10b
-48 f7 c1 04 00 00 00 test $0x4,%rcx
+f6 c1 04 test $0x4,%cl
-48 f7 c1 08 00 00 00 test $0x8,%rcx
+f6 c1 08 test $0x8,%cl
Linus further notes:
"There are no stalls from using 8-bit instruction forms.
Now, changing from 64-bit or 32-bit 'test' instructions to 8-bit ones
*could* cause problems if it ends up having forwarding issues, so that
instead of just forwarding the result, you end up having to wait for
it to be stable in the L1 cache (or possibly the register file). The
forwarding from the store buffer is simplest and most reliable if the
read is done at the exact same address and the exact same size as the
write that gets forwarded.
But that's true only if:
(a) the write was very recent and is still in the write queue. I'm
not sure that's the case here anyway.
(b) on at least most Intel microarchitectures, you have to test a
different byte than the lowest one (so forwarding a 64-bit write
to a 8-bit read ends up working fine, as long as the 8-bit read
is of the low 8 bits of the written data).
A very similar issue *might* show up for registers too, not just
memory writes, if you use 'testb' with a high-byte register (where
instead of forwarding the value from the original producer it needs to
go through the register file and then shifted). But it's mainly a
problem for store buffers.
But afaik, the way Denys changed the test instructions, neither of the
above issues should be true.
The real problem for store buffer forwarding tends to be "write 8
bits, read 32 bits". That can be really surprisingly expensive,
because the read ends up having to wait until the write has hit the
cacheline, and we might talk tens of cycles of latency here. But
"write 32 bits, read the low 8 bits" *should* be fast on pretty much
all x86 chips, afaik."
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1425675332-31576-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a preparatory patch for change in "struct pt_regs"
handling in entry_64.S.
trace_hardirqs*() thunks were (ab)using a part of the
'pt_regs' handling code, namely the SAVE_ARGS/RESTORE_ARGS
macros, to save/restore registers across C function calls.
Since SAVE_ARGS is going to be changed, open-code
register saving/restoring here.
Incidentally, this removes a bit of dead code:
one SAVE_ARGS was used just to emit a CFI annotation,
but it also generated unreachable assembly instructions.
Take a page from thunk_32.S and use push/pop instructions
instead of movq, they are far shorter:
1 or 2 bytes versus 5, and no need for instructions to adjust %rsp:
text data bss dec hex filename
333 40 0 373 175 thunk_64_movq.o
104 40 0 144 90 thunk_64_push_pop.o
[ This is ugly as sin, but we'll fix up the ugliness in the next
patch. I see no point in reordering patches just to avoid an
ugly intermediate state. --Andy ]
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1420927210-19738-4-git-send-email-dvlasenk@redhat.com
Link: http://lkml.kernel.org/r/4c979ad604f0f02c5ade3b3da308b53eabd5e198.1424989793.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
pad instructions and thus make using the alternatives macros more
straightforward and without having to figure out old and new instruction
sizes but have the toolchain figure that out for us.
Furthermore, it optimizes JMPs used so that fetch and decode can be
relieved with smaller versions of the JMPs, where possible.
Some stats:
x86_64 defconfig:
Alternatives sites total: 2478
Total padding added (in Bytes): 6051
The padding is currently done for:
X86_FEATURE_ALWAYS
X86_FEATURE_ERMS
X86_FEATURE_LFENCE_RDTSC
X86_FEATURE_MFENCE_RDTSC
X86_FEATURE_SMAP
This is with the latest version of the patchset. Of course, on each
machine the alternatives sites actually being patched are a proper
subset of the total number.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU9ekpAAoJEBLB8Bhh3lVKyjYP/AiHEiHkkjnpwTt49kUtUMI6
GIlGfJVNjp5LLnSRD/fkL/wdkBgQtMzr9O1g8Qi/lbFqxsOFteU9f1OtLx34ZwZw
MhtdiHcrKGMsaIxTJh4FaqPHBT5ussm2yn1jlAX+LgILd3dpqe3oytsO8JihcK9j
t2u9V/Lq92TV7zXxGgWJsPc86WhhgdldlU3X96S++Di18bnDaKbGkzthU6WzZG/H
qtFZ5bfK8TlVHYduft+D9ZPzFYGp1WCOa03qU4+Djaxw02HDB6Ltysend9zg0lB1
RT/BP0PwHD3mOL11qpgtV1ChCbR8FJMN/z5+YdSNJgzDQA0H5Sf0UueTweosfAz+
/iC5t/wkegdYtqtA0nKVypYOJCS+UdfMZXenYgtSUJl6drB6I5BCW4mVft3AuWo+
EilPGpblvmjWRx1HiF4/Q/5zrSWHzmKQDyXuyxI9m0OUxAGAM0+8CY6wOqRA5pX+
/f5MjZ1hXELQGhl5Qdj4nqJacICGevJ8WYdZ53B+uYVxz7fbXk9hSYcZKT94UshD
qSdaV4XJSuC7pDKqiWoNWXp5N1g+D2BgfwoQEr/RnodFZRlfc+cmOv/visak0OLr
E/pp1vJvCi3+T3ImX1MCDiXmflQtFctiL3hNgMXYK2IGhJb2RDC2bFeZkksOHuAE
BGgrn+usQDjVlikEnfI3
=0KXp
-----END PGP SIGNATURE-----
Merge tag 'alternatives_padding' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp into x86/asm
Pull alternative instructions framework improvements from Borislav Petkov:
"A more involved rework of the alternatives framework to be able to
pad instructions and thus make using the alternatives macros more
straightforward and without having to figure out old and new instruction
sizes but have the toolchain figure that out for us.
Furthermore, it optimizes JMPs used so that fetch and decode can be
relieved with smaller versions of the JMPs, where possible.
Some stats:
x86_64 defconfig:
Alternatives sites total: 2478
Total padding added (in Bytes): 6051
The padding is currently done for:
X86_FEATURE_ALWAYS
X86_FEATURE_ERMS
X86_FEATURE_LFENCE_RDTSC
X86_FEATURE_MFENCE_RDTSC
X86_FEATURE_SMAP
This is with the latest version of the patchset. Of course, on each
machine the alternatives sites actually being patched are a proper
subset of the total number."
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make it execute the ERMS version if support is present and we're in the
forward memmove() part and remove the unfolded alternatives section
definition.
Signed-off-by: Borislav Petkov <bp@suse.de>
Make alternatives replace single JMPs instead of whole memset functions,
thus decreasing the amount of instructions copied during patching time
at boot.
While at it, make it use the REP_GOOD version by default which means
alternatives NOP out the JMP to the other versions, as REP_GOOD is set
by default on the majority of relevant x86 processors.
Signed-off-by: Borislav Petkov <bp@suse.de>
Move clear_page() up so that we can get 2-byte forward JMPs when
patching:
apply_alternatives: feat: 3*32+16, old: (ffffffff8130adb0, len: 5), repl: (ffffffff81d0b859, len: 5)
ffffffff8130adb0: alt_insn: 90 90 90 90 90
recompute_jump: new_displ: 0x0000003e
ffffffff81d0b859: rpl_insn: eb 3e 66 66 90
even though the compiler generated 5-byte JMPs which we padded with 5
NOPs.
Also, make the REP_GOOD version be the default as the majority of
machines set REP_GOOD. This way we get to save ourselves the JMP:
old insn VA: 0xffffffff813038b0, CPU feat: X86_FEATURE_REP_GOOD, size: 5, padlen: 0
clear_page:
ffffffff813038b0 <clear_page>:
ffffffff813038b0: e9 0b 00 00 00 jmpq ffffffff813038c0
repl insn: 0xffffffff81cf0e92, size: 0
old insn VA: 0xffffffff813038b0, CPU feat: X86_FEATURE_ERMS, size: 5, padlen: 0
clear_page:
ffffffff813038b0 <clear_page>:
ffffffff813038b0: e9 0b 00 00 00 jmpq ffffffff813038c0
repl insn: 0xffffffff81cf0e92, size: 5
ffffffff81cf0e92: e9 69 2a 61 ff jmpq ffffffff81303900
ffffffff813038b0 <clear_page>:
ffffffff813038b0: e9 69 2a 61 ff jmpq ffffffff8091631e
Signed-off-by: Borislav Petkov <bp@suse.de>
... instead of the semi-version with the spelled out sections.
What is more, make the REP_GOOD version be the default copy_page()
version as the majority of the relevant x86 CPUs do set
X86_FEATURE_REP_GOOD. Thus, copy_page gets compiled to:
ffffffff8130af80 <copy_page>:
ffffffff8130af80: e9 0b 00 00 00 jmpq ffffffff8130af90 <copy_page_regs>
ffffffff8130af85: b9 00 02 00 00 mov $0x200,%ecx
ffffffff8130af8a: f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi)
ffffffff8130af8d: c3 retq
ffffffff8130af8e: 66 90 xchg %ax,%ax
ffffffff8130af90 <copy_page_regs>:
...
and after the alternatives have run, the JMP to the old, unrolled
version gets NOPed out:
ffffffff8130af80 <copy_page>:
ffffffff8130af80: 66 66 90 xchg %ax,%ax
ffffffff8130af83: 66 90 xchg %ax,%ax
ffffffff8130af85: b9 00 02 00 00 mov $0x200,%ecx
ffffffff8130af8a: f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi)
ffffffff8130af8d: c3 retq
On modern uarches, those NOPs are cheaper than the unconditional JMP
previously.
Signed-off-by: Borislav Petkov <bp@suse.de>
Up until now we had to pay attention to relative JMPs in alternatives
about how their relative offset gets computed so that the jump target
is still correct. Or, as it is the case for near CALLs (opcode e8), we
still have to go and readjust the offset at patching time.
What is more, the static_cpu_has_safe() facility had to forcefully
generate 5-byte JMPs since we couldn't rely on the compiler to generate
properly sized ones so we had to force the longest ones. Worse than
that, sometimes it would generate a replacement JMP which is longer than
the original one, thus overwriting the beginning of the next instruction
at patching time.
So, in order to alleviate all that and make using JMPs more
straight-forward we go and pad the original instruction in an
alternative block with NOPs at build time, should the replacement(s) be
longer. This way, alternatives users shouldn't pay special attention
so that original and replacement instruction sizes are fine but the
assembler would simply add padding where needed and not do anything
otherwise.
As a second aspect, we go and recompute JMPs at patching time so that we
can try to make 5-byte JMPs into two-byte ones if possible. If not, we
still have to recompute the offsets as the replacement JMP gets put far
away in the .altinstr_replacement section leading to a wrong offset if
copied verbatim.
For example, on a locally generated kernel image
old insn VA: 0xffffffff810014bd, CPU feat: X86_FEATURE_ALWAYS, size: 2
__switch_to:
ffffffff810014bd: eb 21 jmp ffffffff810014e0
repl insn: size: 5
ffffffff81d0b23c: e9 b1 62 2f ff jmpq ffffffff810014f2
gets corrected to a 2-byte JMP:
apply_alternatives: feat: 3*32+21, old: (ffffffff810014bd, len: 2), repl: (ffffffff81d0b23c, len: 5)
alt_insn: e9 b1 62 2f ff
recompute_jumps: next_rip: ffffffff81d0b241, tgt_rip: ffffffff810014f2, new_displ: 0x00000033, ret len: 2
converted to: eb 33 90 90 90
and a 5-byte JMP:
old insn VA: 0xffffffff81001516, CPU feat: X86_FEATURE_ALWAYS, size: 2
__switch_to:
ffffffff81001516: eb 30 jmp ffffffff81001548
repl insn: size: 5
ffffffff81d0b241: e9 10 63 2f ff jmpq ffffffff81001556
gets shortened into a two-byte one:
apply_alternatives: feat: 3*32+21, old: (ffffffff81001516, len: 2), repl: (ffffffff81d0b241, len: 5)
alt_insn: e9 10 63 2f ff
recompute_jumps: next_rip: ffffffff81d0b246, tgt_rip: ffffffff81001556, new_displ: 0x0000003e, ret len: 2
converted to: eb 3e 90 90 90
... and so on.
This leads to a net win of around
40ish replacements * 3 bytes savings =~ 120 bytes of I$
on an AMD guest which means some savings of precious instruction cache
bandwidth. The padding to the shorter 2-byte JMPs are single-byte NOPs
which on smart microarchitectures means discarding NOPs at decode time
and thus freeing up execution bandwidth.
Signed-off-by: Borislav Petkov <bp@suse.de>
Up until now we have always paid attention to make sure the length of
the new instruction replacing the old one is at least less or equal to
the length of the old instruction. If the new instruction is longer, at
the time it replaces the old instruction it will overwrite the beginning
of the next instruction in the kernel image and cause your pants to
catch fire.
So instead of having to pay attention, teach the alternatives framework
to pad shorter old instructions with NOPs at buildtime - but only in the
case when
len(old instruction(s)) < len(new instruction(s))
and add nothing in the >= case. (In that case we do add_nops() when
patching).
This way the alternatives user shouldn't have to care about instruction
sizes and simply use the macros.
Add asm ALTERNATIVE* flavor macros too, while at it.
Also, we need to save the pad length in a separate struct alt_instr
member for NOP optimization and the way to do that reliably is to carry
the pad length instead of trying to detect whether we're looking at
single-byte NOPs or at pathological instruction offsets like e9 90 90 90
90, for example, which is a valid instruction.
Thanks to Michael Matz for the great help with toolchain questions.
Signed-off-by: Borislav Petkov <bp@suse.de>
x86 instructions cannot exceed 15 bytes, and the instruction
decoder should enforce that. Prior to 6ba48ff46f, the
instruction length limit was implicitly set to 16, which was an
approximation of 15, but there is currently no limit at all.
Fix MAX_INSN_SIZE (it should be 15, not 16), and fix the decoder
to reject instructions that exceed MAX_INSN_SIZE.
Other than potentially confusing some of the decoder sanity
checks, I'm not aware of any actual problems that omitting this
check would cause, nor am I aware of any practical problems
caused by the MAX_INSN_SIZE error.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Fixes: 6ba48ff46f ("x86: Remove arbitrary instruction size limit ...
Link: http://lkml.kernel.org/r/f8f0bc9b8c58cfd6830f7d88400bf1396cbdcd0f.1422403511.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Recently instrumentation of builtin functions calls was removed from GCC
5.0. To check the memory accessed by such functions, userspace asan
always uses interceptors for them.
So now we should do this as well. This patch declares
memset/memmove/memcpy as weak symbols. In mm/kasan/kasan.c we have our
own implementation of those functions which checks memory before accessing
it.
Default memset/memmove/memcpy now now always have aliases with '__'
prefix. For files that built without kasan instrumentation (e.g.
mm/slub.c) original mem* replaced (via #define) with prefixed variants,
cause we don't want to check memory accesses there.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrey Konovalov <adech.fo@gmail.com>
Cc: Yuri Gribov <tetra2005@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stephane reported that the PEBS fixup was broken by the recent commit to
the instruction decoder. The thing had an off-by-one which resulted in
not being able to decode the last instruction and always bail.
Reported-by: Stephane Eranian <eranian@google.com>
Fixes: 6ba48ff46f ("x86: Remove arbitrary instruction size limit in instruction decoder")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org # 3.18
Cc: <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Liang Kan <kan.liang@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20141216104614.GV3337@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking updates from David Miller:
1) New offloading infrastructure and example 'rocker' driver for
offloading of switching and routing to hardware.
This work was done by a large group of dedicated individuals, not
limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu
2) Start making the networking operate on IOV iterators instead of
modifying iov objects in-situ during transfers. Thanks to Al Viro
and Herbert Xu.
3) A set of new netlink interfaces for the TIPC stack, from Richard
Alpe.
4) Remove unnecessary looping during ipv6 routing lookups, from Martin
KaFai Lau.
5) Add PAUSE frame generation support to gianfar driver, from Matei
Pavaluca.
6) Allow for larger reordering levels in TCP, which are easily
achievable in the real world right now, from Eric Dumazet.
7) Add a variable of napi_schedule that doesn't need to disable cpu
interrupts, from Eric Dumazet.
8) Use a doubly linked list to optimize neigh_parms_release(), from
Nicolas Dichtel.
9) Various enhancements to the kernel BPF verifier, and allow eBPF
programs to actually be attached to sockets. From Alexei
Starovoitov.
10) Support TSO/LSO in sunvnet driver, from David L Stevens.
11) Allow controlling ECN usage via routing metrics, from Florian
Westphal.
12) Remote checksum offload, from Tom Herbert.
13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
driver, from Thomas Lendacky.
14) Add MPLS support to openvswitch, from Simon Horman.
15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
Klassert.
16) Do gro flushes on a per-device basis using a timer, from Eric
Dumazet. This tries to resolve the conflicting goals between the
desired handling of bulk vs. RPC-like traffic.
17) Allow userspace to ask for the CPU upon what a packet was
received/steered, via SO_INCOMING_CPU. From Eric Dumazet.
18) Limit GSO packets to half the current congestion window, from Eric
Dumazet.
19) Add a generic helper so that all drivers set their RSS keys in a
consistent way, from Eric Dumazet.
20) Add xmit_more support to enic driver, from Govindarajulu
Varadarajan.
21) Add VLAN packet scheduler action, from Jiri Pirko.
22) Support configurable RSS hash functions via ethtool, from Eyal
Perry.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
Fix race condition between vxlan_sock_add and vxlan_sock_release
net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
net/mlx4: Add support for A0 steering
net/mlx4: Refactor QUERY_PORT
net/mlx4_core: Add explicit error message when rule doesn't meet configuration
net/mlx4: Add A0 hybrid steering
net/mlx4: Add mlx4_bitmap zone allocator
net/mlx4: Add a check if there are too many reserved QPs
net/mlx4: Change QP allocation scheme
net/mlx4_core: Use tasklet for user-space CQ completion events
net/mlx4_core: Mask out host side virtualization features for guests
net/mlx4_en: Set csum level for encapsulated packets
be2net: Export tunnel offloads only when a VxLAN tunnel is created
gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
net: fec: only enable mdio interrupt before phy device link up
net: fec: clear all interrupt events to support i.MX6SX
net: fec: reset fep link status in suspend function
net: sock: fix access via invalid file descriptor
net: introduce helper macro for_each_cmsghdr
...
As there are now no remaining users of arch_fast_hash(), lets kill
it entirely.
This basically reverts commit 71ae8aac3e ("lib: introduce arch
optimized hash library") and follow-up work, that is f.e., commit
237217546d ("lib: hash: follow-up fixups for arch hash"),
commit e3fec2f74f ("lib: Add missing arch generic-y entries for
asm-generic/hash.h") and last but not least commit 6a02652df5
("perf tools: Fix include for non x86 architectures").
Cc: Francesco Fusco <fusco@ntop.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull x86 MPX support from Thomas Gleixner:
"This enables support for x86 MPX.
MPX is a new debug feature for bound checking in user space. It
requires kernel support to handle the bound tables and decode the
bound violating instruction in the trap handler"
* 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
asm-generic: Remove asm-generic arch_bprm_mm_init()
mm: Make arch_unmap()/bprm_mm_init() available to all architectures
x86: Cleanly separate use of asm-generic/mm_hooks.h
x86 mpx: Change return type of get_reg_offset()
fs: Do not include mpx.h in exec.c
x86, mpx: Add documentation on Intel MPX
x86, mpx: Cleanup unused bound tables
x86, mpx: On-demand kernel allocation of bounds tables
x86, mpx: Decode MPX instruction to get bound violation information
x86, mpx: Add MPX-specific mmap interface
x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific
x86, mpx: Add MPX to disabled features
ia64: Sync struct siginfo with general version
mips: Sync struct siginfo with general version
mpx: Extend siginfo structure to include bound violation information
x86, mpx: Rename cfg_reg_u and status_reg
x86: mpx: Give bndX registers actual names
x86: Remove arbitrary instruction size limit in instruction decoder
The current x86 instruction decoder steps along through the
instruction stream but always ensures that it never steps farther
than the largest possible instruction size (MAX_INSN_SIZE).
The MPX code is now going to be doing some decoding of userspace
instructions. We copy those from userspace in to the kernel and
they're obviously completely untrusted coming from userspace. In
addition to the constraint that instructions can only be so long,
we also have to be aware of how long the buffer is that came in
from userspace. This _looks_ to be similar to what the perf and
kprobes is doing, but it's unclear to me whether they are
affected.
The whole reason we need this is that it is perfectly valid to be
executing an instruction within MAX_INSN_SIZE bytes of an
unreadable page. We should be able to gracefully handle short
reads in those cases.
This adds support to the decoder to record how long the buffer
being decoded is and to refuse to "validate" the instruction if
we would have gone over the end of the buffer to decode it.
The kprobes code probably needs to be looked at here a bit more
carefully. This patch still respects the MAX_INSN_SIZE limit
there but the kprobes code does look like it might be able to
be a bit more strict than it currently is.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: x86@kernel.org
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20141114153957.E6B01535@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Al Viro pointed out that the x86-64 csum_partial_copy_from_user() is
somewhat confused about what it should do on errors, notably it mostly
clears the uncopied end result buffer, but misses that for the initial
alignment case.
All users should check for errors, so it's dubious whether the clearing
is even necessary, and Al also points out that we should probably clean
up the calling conventions, but regardless of any future changes to this
function, the fact that it is inconsistent is just annoying.
So make the __get_user() failure path use the same error exit as all the
other errors do.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 asm updates from Ingo Molnar:
"The changes in this cycle were:
- Speed up the x86 __preempt_schedule() implementation
- Fix/improve low level asm code debug info annotations"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Unwind-annotate thunk_32.S
x86: Improve cmpxchg8b_emu.S
x86: Improve cmpxchg16b_emu.S
x86/lib/Makefile: Remove the unnecessary "+= thunk_64.o"
x86: Speed up ___preempt_schedule*() by using THUNK helpers
- don't include unneeded headers
- drop redundant entry point label
- complete unwind annotations
- use .L prefix on local labels to not clutter the symbol table
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/5422917E0200007800038081@mail.emea.novell.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
- don't include unneeded headers
- don't open-code PER_CPU_VAR()
- drop redundant entry point label
- complete unwind annotations
- use .L prefix on local label to not clutter the symbol table
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/542290BC020000780003807D@mail.emea.novell.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Trivial. We have "lib-y += thunk_$(BITS).o" at the start, no
need to add thunk_64.o if !CONFIG_X86_32.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140921184232.GB23727@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
___preempt_schedule() does SAVE_ALL/RESTORE_ALL but this is
suboptimal, we do not need to save/restore the callee-saved
register. And we already have arch/x86/lib/thunk_*.S which
implements the similar asm wrappers, so it makes sense to
redefine ___preempt_schedule() as "THUNK ..." and remove
preempt.S altogether.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140921184153.GA23727@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch removes the unused asm/rwlock.h and rwlock.S files.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1408037251-45918-3-git-send-email-Waiman.Long@hp.com
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Francesco Fusco <ffusco@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Graf <tgraf@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull more perf updates from Ingo Molnar:
"A second round of perf updates:
- wide reaching kprobes sanitization and robustization, with the hope
of fixing all 'probe this function crashes the kernel' bugs, by
Masami Hiramatsu.
- uprobes updates from Oleg Nesterov: tmpfs support, corner case
fixes and robustization work.
- perf tooling updates and fixes from Jiri Olsa, Namhyung Ki, Arnaldo
et al:
* Add support to accumulate hist periods (Namhyung Kim)
* various fixes, refactorings and enhancements"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
perf: Differentiate exec() and non-exec() comm events
perf: Fix perf_event_comm() vs. exec() assumption
uprobes/x86: Rename arch_uprobe->def to ->defparam, minor comment updates
perf/documentation: Add description for conditional branch filter
perf/x86: Add conditional branch filtering support
perf/tool: Add conditional branch filter 'cond' to perf record
perf: Add new conditional branch filter 'PERF_SAMPLE_BRANCH_COND'
uprobes: Teach copy_insn() to support tmpfs
uprobes: Shift ->readpage check from __copy_insn() to uprobe_register()
perf/x86: Use common PMU interrupt disabled code
perf/ARM: Use common PMU interrupt disabled code
perf: Disable sampled events if no PMU interrupt
perf: Fix use after free in perf_remove_from_context()
perf tools: Fix 'make help' message error
perf record: Fix poll return value propagation
perf tools: Move elide bool into perf_hpp_fmt struct
perf tools: Remove elide setup for SORT_MODE__MEMORY mode
perf tools: Fix "==" into "=" in ui_browser__warning assignment
perf tools: Allow overriding sysfs and proc finding with env var
perf tools: Consider header files outside perf directory in tags target
...
Carve out early cmdline parsing function into .../lib/cmdline.c so it
can be used by early code in the kernel proper as well.
Adapted from arch/x86/boot/cmdline.c.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1400525957-11525-2-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Due to a typo the msr accessor function introduced in
22085a66c2 didn't have any lasting
effects because they accidentally wrote the old value back.
After c0a639ad0b this at the very least
this causes cpuid limits not to be lifted on some cpus leading to
missing capabilities for those.
Signed-off-by: Andres Freund <andres@anarazel.de>
Link: http://lkml.kernel.org/r/1399598957-7011-2-git-send-email-andres@anarazel.de
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
thunk/restore functions are also used for tracing irqoff etc.
and those are involved in kprobe's exception handling.
Prohibit probing on them to avoid kernel crash.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140417081726.26341.3872.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 LTO changes from Peter Anvin:
"More infrastructure work in preparation for link-time optimization
(LTO). Most of these changes is to make sure symbols accessed from
assembly code are properly marked as visible so the linker doesn't
remove them.
My understanding is that the changes to support LTO are still not
upstream in binutils, but are on the way there. This patchset should
conclude the x86-specific changes, and remaining patches to actually
enable LTO will be fed through the Kbuild tree (other than keeping up
with changes to the x86 code base, of course), although not
necessarily in this merge window"
* 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
Kbuild, lto: Handle basic LTO in modpost
Kbuild, lto: Disable LTO for asm-offsets.c
Kbuild, lto: Add a gcc-ld script to let run gcc as ld
Kbuild, lto: add ld-version and ld-ifversion macros
Kbuild, lto: Drop .number postfixes in modpost
Kbuild, lto, workaround: Don't warn for initcall_reference in modpost
lto: Disable LTO for sys_ni
lto: Handle LTO common symbols in module loader
lto, workaround: Add workaround for initcall reordering
lto: Make asmlinkage __visible
x86, lto: Disable LTO for the x86 VDSO
initconst, x86: Fix initconst mistake in ts5500 code
initconst: Fix initconst mistake in dcdbas
asmlinkage: Make trace_hardirqs_on/off_caller visible
asmlinkage, x86: Fix 32bit memcpy for LTO
asmlinkage Make __stack_chk_failed and memcmp visible
asmlinkage: Mark rwsem functions that can be called from assembler asmlinkage
asmlinkage: Make main_extable_sort_needed visible
asmlinkage, mutex: Mark __visible
asmlinkage: Make trace_hardirq visible
...
Pull x86 hashing changes from Ingo Molnar:
"Small fixes and cleanups to the librarized arch_fast_hash() methods,
used by the net/openvswitch code"
* 'x86-hash-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, hash: Simplify switch, add __init annotation
x86, hash: Swap arguments passed to crc32_u32()
x86, hash: Fix build failure with older binutils
Minor cleanups:
- simplify switch statement
- add __init annotation to setup_arch_fast_hash()
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/530F09CE020000780011FBEF@nat28.tlf.novell.com
Cc: Francesco Fusco <ffusco@redhat.com>
Cc: Thomas Graf <tgraf@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
... to match the function's parameters. While reportedly commutative,
using the proper order allows for leveraging the instruction permitting
the source operand to be in memory.
[ hpa: This code originated in the dpdk toolkit. This was a bug in dpdk
which has recently been fixed in part due to an earlier version of
this patch. ]
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/530F09B6020000780011FBEB@nat28.tlf.novell.com
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Francesco Fusco <ffusco@redhat.com>
Cc: Thomas Graf <tgraf@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Just like for other ISA extension instruction uses we should check
whether the assembler actually supports them. The fallback here simply
is to encode an instruction with fixed operands (%eax and %ecx).
[ hpa: tagging for -stable as a build fix ]
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/530F0996020000780011FBE7@nat28.tlf.novell.com
Cc: Francesco Fusco <ffusco@redhat.com>
Cc: Thomas Graf <tgraf@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org> # v3.14
We very often need to set or clear a bit in an MSR as a result of doing
some sort of a hardware configuration. Add generic versions of that
repeated functionality in order to save us a bunch of duplicated code in
the early CPU vendor detection/config code.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1394384725-10796-2-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Pull networking updates from David Miller:
1) BPF debugger and asm tool by Daniel Borkmann.
2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann.
3) Correct reciprocal_divide and update users, from Hannes Frederic
Sowa and Daniel Borkmann.
4) Currently we only have a "set" operation for the hw timestamp socket
ioctl, add a "get" operation to match. From Ben Hutchings.
5) Add better trace events for debugging driver datapath problems, also
from Ben Hutchings.
6) Implement auto corking in TCP, from Eric Dumazet. Basically, if we
have a small send and a previous packet is already in the qdisc or
device queue, defer until TX completion or we get more data.
7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko.
8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel
Borkmann.
9) Share IP header compression code between Bluetooth and IEEE802154
layers, from Jukka Rissanen.
10) Fix ipv6 router reachability probing, from Jiri Benc.
11) Allow packets to be captured on macvtap devices, from Vlad Yasevich.
12) Support tunneling in GRO layer, from Jerry Chu.
13) Allow bonding to be configured fully using netlink, from Scott
Feldman.
14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can
already get the TCI. From Atzm Watanabe.
15) New "Heavy Hitter" qdisc, from Terry Lam.
16) Significantly improve the IPSEC support in pktgen, from Fan Du.
17) Allow ipv4 tunnels to cache routes, just like sockets. From Tom
Herbert.
18) Add Proportional Integral Enhanced packet scheduler, from Vijay
Subramanian.
19) Allow openvswitch to mmap'd netlink, from Thomas Graf.
20) Key TCP metrics blobs also by source address, not just destination
address. From Christoph Paasch.
21) Support 10G in generic phylib. From Andy Fleming.
22) Try to short-circuit GRO flow compares using device provided RX
hash, if provided. From Tom Herbert.
The wireless and netfilter folks have been busy little bees too.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits)
net/cxgb4: Fix referencing freed adapter
ipv6: reallocate addrconf router for ipv6 address when lo device up
fib_frontend: fix possible NULL pointer dereference
rtnetlink: remove IFLA_BOND_SLAVE definition
rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info
qlcnic: update version to 5.3.55
qlcnic: Enhance logic to calculate msix vectors.
qlcnic: Refactor interrupt coalescing code for all adapters.
qlcnic: Update poll controller code path
qlcnic: Interrupt code cleanup
qlcnic: Enhance Tx timeout debugging.
qlcnic: Use bool for rx_mac_learn.
bonding: fix u64 division
rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC
sfc: Use the correct maximum TX DMA ring size for SFC9100
Add Shradha Shah as the sfc driver maintainer.
net/vxlan: Share RX skb de-marking and checksum checks with ovs
tulip: cleanup by using ARRAY_SIZE()
ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called
net/cxgb4: Don't retrieve stats during recovery
...
Pull x86 cpufeature and mpx updates from Peter Anvin:
"This includes the basic infrastructure for MPX (Memory Protection
Extensions) support, but does not include MPX support itself. It is,
however, a prerequisite for KVM support for MPX, which I believe will
be pushed later this merge window by the KVM team.
This includes moving the functionality in
futex_atomic_cmpxchg_inatomic() into a new function in uaccess.h so it
can be reused - this will be used by the final MPX patches.
The actual MPX functionality (map management and so on) will be pushed
in a future merge window, when ready"
* 'x86/mpx' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel/mpx: Remove unused LWP structure
x86, mpx: Add MPX related opcodes to the x86 opcode map
x86: replace futex_atomic_cmpxchg_inatomic() with user_atomic_cmpxchg_inatomic
x86: add user_atomic_cmpxchg_inatomic at uaccess.h
x86, xsave: Support eager-only xsave features, add MPX support
x86, cpufeature: Define the Intel MPX feature flag
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, cpu, amd: Fix a shadowed variable situation
um, x86: Fix vDSO build
x86: Delete non-required instances of include <linux/init.h>
x86, realmode: Pointer walk cleanups, pull out invariant use of __pa()
x86/traps: Clean up error exception handler definitions
This patch adds all the MPX instructions to x86 opcode map, so the x86
instruction decoder can decode MPX instructions.
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Link: http://lkml.kernel.org/r/1389518403-7715-4-git-send-email-qiaowei.ren@intel.com
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
None of these files are actually using any __init type directives
and hence don't need to include <linux/init.h>. Most are just a
left over from __devinit and __cpuinit removal, or simply due to
code getting copied from one driver to the next.
[ hpa: undid incorrect removal from arch/x86/kernel/head_32.S ]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/1389054026-12947-1-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
We introduce a new hashing library that is meant to be used in
the contexts where speed is more important than uniformity of the
hashed values. The hash library leverages architecture specific
implementation to achieve high performance and fall backs to
jhash() for the generic case.
On Intel-based x86 architectures, the library can exploit the crc32l
instruction, part of the Intel SSE4.2 instruction set, if the
instruction is supported by the processor. This implementation
is twice as fast as the jhash() implementation on an i7 processor.
Additional architectures, such as Arm64 provide instructions for
accelerating the computation of CRC, so they could be added as well
in follow-up work.
Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Thomas Graf <tgraf@redhat.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
When we are using lea to produce a 32-bit result, we can use the leal
form, rather than using leaq and worry about truncation elsewhere.
Make the leal explicit, both to be more obvious and since that is what
gcc generates and thus is less likely to trigger obscure gas bugs.
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1384634221-6006-1-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Operation of rep movsb instruction handles zero byte copy. As pointed out by
Linus, there is no need to check zero size in kernel. Removing this redundant
check saves a few cycles in copy user functions.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1384634221-6006-1-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
- New power capping framework and the the Intel Running Average Power
Limit (RAPL) driver using it from Srinivas Pandruvada and Jacob Pan.
- Addition of the in-kernel switching feature to the arm_big_little
cpufreq driver from Viresh Kumar and Nicolas Pitre.
- cpufreq support for iMac G5 from Aaro Koskinen.
- Baytrail processors support for intel_pstate from Dirk Brandewie.
- cpufreq support for Midway/ECX-2000 from Mark Langsdorf.
- ARM vexpress/TC2 cpufreq support from Sudeep KarkadaNagesha.
- ACPI power management support for the I2C and SPI bus types from
Mika Westerberg and Lv Zheng.
- cpufreq core fixes and cleanups from Viresh Kumar, Srivatsa S Bhat,
Stratos Karafotis, Xiaoguang Chen, Lan Tianyu.
- cpufreq drivers updates (mostly fixes and cleanups) from Viresh Kumar,
Aaro Koskinen, Jungseok Lee, Sudeep KarkadaNagesha, Lukasz Majewski,
Manish Badarkhe, Hans-Christian Egtvedt, Evgeny Kapaev.
- intel_pstate updates from Dirk Brandewie and Adrian Huang.
- ACPICA update to version 20130927 includig fixes and cleanups and
some reduction of divergences between the ACPICA code in the kernel
and ACPICA upstream in order to improve the automatic ACPICA patch
generation process. From Bob Moore, Lv Zheng, Tomasz Nowicki,
Naresh Bhat, Bjorn Helgaas, David E Box.
- ACPI IPMI driver fixes and cleanups from Lv Zheng.
- ACPI hotplug fixes and cleanups from Bjorn Helgaas, Toshi Kani,
Zhang Yanfei, Rafael J Wysocki.
- Conversion of the ACPI AC driver to the platform bus type and
multiple driver fixes and cleanups related to ACPI from Zhang Rui.
- ACPI processor driver fixes and cleanups from Hanjun Guo, Jiang Liu,
Bartlomiej Zolnierkiewicz, Mathieu Rhéaume, Rafael J Wysocki.
- Fixes and cleanups and new blacklist entries related to the ACPI
video support from Aaron Lu, Felipe Contreras, Lennart Poettering,
Kirill Tkhai.
- cpuidle core cleanups from Viresh Kumar and Lorenzo Pieralisi.
- cpuidle drivers fixes and cleanups from Daniel Lezcano, Jingoo Han,
Bartlomiej Zolnierkiewicz, Prarit Bhargava.
- devfreq updates from Sachin Kamat, Dan Carpenter, Manish Badarkhe.
- Operation Performance Points (OPP) core updates from Nishanth Menon.
- Runtime power management core fix from Rafael J Wysocki and update
from Ulf Hansson.
- Hibernation fixes from Aaron Lu and Rafael J Wysocki.
- Device suspend/resume lockup detection mechanism from Benoit Goby.
- Removal of unused proc directories created for various ACPI drivers
from Lan Tianyu.
- ACPI LPSS driver fix and new device IDs for the ACPI platform scan
handler from Heikki Krogerus and Jarkko Nikula.
- New ACPI _OSI blacklist entry for Toshiba NB100 from Levente Kurusa.
- Assorted fixes and cleanups related to ACPI from Andy Shevchenko,
Al Stone, Bartlomiej Zolnierkiewicz, Colin Ian King, Dan Carpenter,
Felipe Contreras, Jianguo Wu, Lan Tianyu, Yinghai Lu, Mathias Krause,
Liu Chuansheng.
- Assorted PM fixes and cleanups from Andy Shevchenko, Thierry Reding,
Jean-Christophe Plagniol-Villard.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABCAAGBQJSfPKLAAoJEILEb/54YlRxH6YQAJwDKi25RCZziFSIenXuqzC/
c6JxoH/tSnDHJHhcTgqh7H7Raa+zmatMDf0m2oEv2Wjfx4Lt4BQK4iefhe/zY4lX
yJ8uXDg+U8DYhDX2XwbwnFpd1M1k/A+s2gIHDTHHGnE0kDngXdd8RAFFktBmooTZ
l5LBQvOrTlgX/ZfqI/MNmQ6lfY6kbCABGSHV1tUUsDA6Kkvk/LAUTOMSmptv1q22
hcs6k55vR34qADPkUX5GghjmcYJv+gNtvbDEJUjcmCwVoPWouF415m7R5lJ8w3/M
49Q8Tbu5HELWLwca64OorS8qh/P7sgUOf1BX5IDzHnJT+TGeDfvcYbMv2Z275/WZ
/bqhuLuKBpsHQ2wvEeT+lYV3FlifKeTf1FBxER3ApjzI3GfpmVVQ+dpEu8e9hcTh
ZTPGzziGtoIsHQ0unxb+zQOyt1PmIk+cU4IsKazs5U20zsVDMcKzPrb19Od49vMX
gCHvRzNyOTqKWpE83Ss4NGOVPAG02AXiXi/BpuYBHKDy6fTH/liKiCw5xlCDEtmt
lQrEbupKpc/dhCLo5ws6w7MZzjWJs2eSEQcNR4DlR++pxIpYOOeoPTXXrghgZt2X
mmxZI2qsJ7GAvPzII8OBeF3CRO3fabZ6Nez+M+oEZjGe05ZtpB3ccw410HwieqBn
dYpJFt/BHK189odhV9CM
=JCxk
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael J Wysocki:
- New power capping framework and the the Intel Running Average Power
Limit (RAPL) driver using it from Srinivas Pandruvada and Jacob Pan.
- Addition of the in-kernel switching feature to the arm_big_little
cpufreq driver from Viresh Kumar and Nicolas Pitre.
- cpufreq support for iMac G5 from Aaro Koskinen.
- Baytrail processors support for intel_pstate from Dirk Brandewie.
- cpufreq support for Midway/ECX-2000 from Mark Langsdorf.
- ARM vexpress/TC2 cpufreq support from Sudeep KarkadaNagesha.
- ACPI power management support for the I2C and SPI bus types from Mika
Westerberg and Lv Zheng.
- cpufreq core fixes and cleanups from Viresh Kumar, Srivatsa S Bhat,
Stratos Karafotis, Xiaoguang Chen, Lan Tianyu.
- cpufreq drivers updates (mostly fixes and cleanups) from Viresh
Kumar, Aaro Koskinen, Jungseok Lee, Sudeep KarkadaNagesha, Lukasz
Majewski, Manish Badarkhe, Hans-Christian Egtvedt, Evgeny Kapaev.
- intel_pstate updates from Dirk Brandewie and Adrian Huang.
- ACPICA update to version 20130927 includig fixes and cleanups and
some reduction of divergences between the ACPICA code in the kernel
and ACPICA upstream in order to improve the automatic ACPICA patch
generation process. From Bob Moore, Lv Zheng, Tomasz Nowicki, Naresh
Bhat, Bjorn Helgaas, David E Box.
- ACPI IPMI driver fixes and cleanups from Lv Zheng.
- ACPI hotplug fixes and cleanups from Bjorn Helgaas, Toshi Kani, Zhang
Yanfei, Rafael J Wysocki.
- Conversion of the ACPI AC driver to the platform bus type and
multiple driver fixes and cleanups related to ACPI from Zhang Rui.
- ACPI processor driver fixes and cleanups from Hanjun Guo, Jiang Liu,
Bartlomiej Zolnierkiewicz, Mathieu Rhéaume, Rafael J Wysocki.
- Fixes and cleanups and new blacklist entries related to the ACPI
video support from Aaron Lu, Felipe Contreras, Lennart Poettering,
Kirill Tkhai.
- cpuidle core cleanups from Viresh Kumar and Lorenzo Pieralisi.
- cpuidle drivers fixes and cleanups from Daniel Lezcano, Jingoo Han,
Bartlomiej Zolnierkiewicz, Prarit Bhargava.
- devfreq updates from Sachin Kamat, Dan Carpenter, Manish Badarkhe.
- Operation Performance Points (OPP) core updates from Nishanth Menon.
- Runtime power management core fix from Rafael J Wysocki and update
from Ulf Hansson.
- Hibernation fixes from Aaron Lu and Rafael J Wysocki.
- Device suspend/resume lockup detection mechanism from Benoit Goby.
- Removal of unused proc directories created for various ACPI drivers
from Lan Tianyu.
- ACPI LPSS driver fix and new device IDs for the ACPI platform scan
handler from Heikki Krogerus and Jarkko Nikula.
- New ACPI _OSI blacklist entry for Toshiba NB100 from Levente Kurusa.
- Assorted fixes and cleanups related to ACPI from Andy Shevchenko, Al
Stone, Bartlomiej Zolnierkiewicz, Colin Ian King, Dan Carpenter,
Felipe Contreras, Jianguo Wu, Lan Tianyu, Yinghai Lu, Mathias Krause,
Liu Chuansheng.
- Assorted PM fixes and cleanups from Andy Shevchenko, Thierry Reding,
Jean-Christophe Plagniol-Villard.
* tag 'pm+acpi-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (386 commits)
cpufreq: conservative: fix requested_freq reduction issue
ACPI / hotplug: Consolidate deferred execution of ACPI hotplug routines
PM / runtime: Use pm_runtime_put_sync() in __device_release_driver()
ACPI / event: remove unneeded NULL pointer check
Revert "ACPI / video: Ignore BIOS initial backlight value for HP 250 G1"
ACPI / video: Quirk initial backlight level 0
ACPI / video: Fix initial level validity test
intel_pstate: skip the driver if ACPI has power mgmt option
PM / hibernate: Avoid overflow in hibernate_preallocate_memory()
ACPI / hotplug: Do not execute "insert in progress" _OST
ACPI / hotplug: Carry out PCI root eject directly
ACPI / hotplug: Merge device hot-removal routines
ACPI / hotplug: Make acpi_bus_hot_remove_device() internal
ACPI / hotplug: Simplify device ejection routines
ACPI / hotplug: Fix handle_root_bridge_removal()
ACPI / hotplug: Refuse to hot-remove all objects with disabled hotplug
ACPI / scan: Start matching drivers after trying scan handlers
ACPI: Remove acpi_pci_slot_init() headers from internal.h
ACPI / blacklist: fix name of ThinkPad Edge E530
PowerCap: Fix build error with option -Werror=format-security
...
Conflicts:
arch/arm/mach-omap2/opp.c
drivers/Kconfig
drivers/spi/spi.c
Pull x86 user access changes from Ingo Molnar:
"This tree contains two copy_[from/to]_user() build time checking
changes/enhancements from Jan Beulich.
The desired outcome is to get better compiler warnings with
CONFIG_DEBUG_STRICT_USER_COPY_CHECKS=y, to keep people from
introducing bugs such as overflows and information leaks"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Unify copy_to_user() and add size checking to it
x86: Unify copy_from_user() size checking
The arch_perf_output_copy_user() default of
__copy_from_user_inatomic() returns bytes not copied, while all other
argument functions given DEFINE_OUTPUT_COPY() return bytes copied.
Since copy_from_user_nmi() is the odd duck out by returning bytes
copied where all other *copy_{to,from}* functions return bytes not
copied, change it over and ammend DEFINE_OUTPUT_COPY() to expect bytes
not copied.
Oddly enough DEFINE_OUTPUT_COPY() already returned bytes not copied
while expecting its worker functions to return bytes copied.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: will.deacon@arm.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20131030201622.GR16117@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we can deal with nested NMI due to IRET re-enabling NMIs and
can deal with faults from NMI by making sure we preserve CR2 over NMIs
we can in fact simply access user-space memory from NMI context.
So rewrite copy_from_user_nmi() to use __copy_from_user_inatomic() and
rework the fault path to do the minimal required work before taking
the in_atomic() fault handler.
In particular avoid perf_sw_event() which would make perf recurse on
itself (it should be harmless as our recursion protections should be
able to deal with this -- but why tempt fate).
Also rename notify_page_fault() to kprobes_fault() as that is a much
better name; there is no notifier in it and its specific to kprobes.
Don measured that his worst case NMI path shrunk from ~300K cycles to
~150K cycles.
Cc: Stephane Eranian <eranian@google.com>
Cc: jmario@redhat.com
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: dave.hansen@linux.intel.com
Tested-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131024105206.GM2490@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Similarly to copy_from_user(), where the range check is to
protect against kernel memory corruption, copy_to_user() can
benefit from such checking too: Here it protects against kernel
information leaks.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/5265059502000078000FC4F6@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Commits 4a31276930 ("x86: Turn the
copy_from_user check into an (optional) compile time warning")
and 63312b6a6f ("x86: Add a
Kconfig option to turn the copy_from_user warnings into errors")
touched only the 32-bit variant of copy_from_user(), whereas the
original commit 9f0cf4adb6 ("x86:
Use __builtin_object_size() to validate the buffer size for
copy_from_user()") also added the same code to the 64-bit one.
Further the earlier conversion from an inline WARN() to the call
to copy_from_user_overflow() went a little too far: When the
number of bytes to be copied is not a constant (e.g. [looking at
3.11] in drivers/net/tun.c:__tun_chr_ioctl() or
drivers/pci/pcie/aer/aer_inject.c:aer_inject_write()), the
compiler will always have to keep the funtion call, and hence
there will always be a warning. By using __builtin_constant_p()
we can avoid this.
And then this slightly extends the effect of
CONFIG_DEBUG_STRICT_USER_COPY_CHECKS in that apart from
converting warnings to errors in the constant size case, it
retains the (possibly wrong) warnings in the non-constant size
case, such that if someone is prepared to get a few false
positives, (s)he'll be able to recover the current behavior
(except that these diagnostics now will never be converted to
errors).
Since the 32-bit variant (intentionally) didn't call
might_fault(), the unification results in this being called
twice now. Adding a suitable #ifdef would be the alternative if
that's a problem.
I'd like to point out though that with
__compiletime_object_size() being restricted to gcc before 4.6,
the whole construct is going to become more and more pointless
going forward. I would question however that commit
2fb0815c9e ("gcc4: disable
__compiletime_object_size for GCC 4.6+") was really necessary,
and instead this should have been dealt with as is done here
from the beginning.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/5265056D02000078000FC4F3@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Having 64-bit MSR access methods on given CPU can avoid shifting and
simplify MSR content manipulation. We already have other combinations
of rdmsrl_xxx and wrmsrl_xxx but missing the _on_cpu version.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
As the new x86 CPU bootup printout format code maintainer, I am
taking immediate action to improve and clean (and thus indulge
my OCD) the reporting of the cores when coming up online.
Fix padding to a right-hand alignment, cleanup code and bind
reporting width to the max number of supported CPUs on the
system, like this:
[ 0.074509] smpboot: Booting Node 0, Processors: #1#2#3#4#5#6#7 OK
[ 0.644008] smpboot: Booting Node 1, Processors: #8#9#10#11#12#13#14#15 OK
[ 1.245006] smpboot: Booting Node 2, Processors: #16#17#18#19#20#21#22#23 OK
[ 1.864005] smpboot: Booting Node 3, Processors: #24#25#26#27#28#29#30#31 OK
[ 2.489005] smpboot: Booting Node 4, Processors: #32#33#34#35#36#37#38#39 OK
[ 3.093005] smpboot: Booting Node 5, Processors: #40#41#42#43#44#45#46#47 OK
[ 3.698005] smpboot: Booting Node 6, Processors: #48#49#50#51#52#53#54#55 OK
[ 4.304005] smpboot: Booting Node 7, Processors: #56#57#58#59#60#61#62#63 OK
[ 4.961413] Brought up 64 CPUs
and this:
[ 0.072367] smpboot: Booting Node 0, Processors: #1#2#3#4#5#6#7 OK
[ 0.686329] Brought up 8 CPUs
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Libin <huawei.libin@huawei.com>
Cc: wangyijing@huawei.com
Cc: fenghua.yu@intel.com
Cc: guohanjun@huawei.com
Cc: paul.gortmaker@windriver.com
Link: http://lkml.kernel.org/r/20130927143554.GF4422@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 SMAP fixes from Ingo Molnar:
"Fixes for Intel SMAP support, to fix SIGSEGVs during bootup"
* 'x86-smap-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Introduce [compat_]save_altstack_ex() to unbreak x86 SMAP
x86, smap: Handle csum_partial_copy_*_user()
Pull x86/asmlinkage changes from Ingo Molnar:
"As a preparation for Andi Kleen's LTO patchset (link time
optimizations using GCC's -flto which build time optimization has
steadily increased in quality over the past few years and might
eventually be usable for the kernel too) this tree includes a handful
of preparatory patches that make function calling convention
annotations consistent again:
- Mark every function without arguments (or 64bit only) that is used
by assembly code with asmlinkage()
- Mark every function with parameters or variables that is used by
assembly code as __visible.
For the vanilla kernel this has documentation, consistency and
debuggability advantages, for the time being"
* 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asmlinkage: Fix warning in xen asmlinkage change
x86, asmlinkage, vdso: Mark vdso variables __visible
x86, asmlinkage, power: Make various symbols used by the suspend asm code visible
x86, asmlinkage: Make dump_stack visible
x86, asmlinkage: Make 64bit checksum functions visible
x86, asmlinkage, paravirt: Add __visible/asmlinkage to xen paravirt ops
x86, asmlinkage, apm: Make APM data structure used from assembler visible
x86, asmlinkage: Make syscall tables visible
x86, asmlinkage: Make several variables used from assembler/linker script visible
x86, asmlinkage: Make kprobes code visible and fix assembler code
x86, asmlinkage: Make various syscalls asmlinkage
x86, asmlinkage: Make 32bit/64bit __switch_to visible
x86, asmlinkage: Make _*_start_kernel visible
x86, asmlinkage: Make all interrupt handlers asmlinkage / __visible
x86, asmlinkage: Change dotraplinkage into __visible on 32bit
x86: Fix sys_call_table type in asm/syscall.h
Add SMAP annotations to csum_partial_copy_to/from_user(). These
functions legitimately access user space and thus need to set the AC
flag.
TODO: add explicit checks that the side with the kernel space pointer
really points into kernel space.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-2aps0u00eer658fd5xyanan7@git.kernel.org
Cc: <stable@vger.kernel.org> # v3.7+
Add TSX-NI related instructions and new instructions to
x86-opcode-map.txt according to the Intel(R) 64 and IA-32
Architectures Software Developer's Manual Vol2C (June, 2013).
This also includes below updates.
- Fix a typo of MWAIT (the lack of (11B)).
- Change NOP Ev to prefetchw Ev
- Add CRC32 new prefix style (66&F2)
- Add ADCX, ADOX, RDSEED, CLAC and STAC instructions
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20130806073750.4049.12365.stgit@udc4-manage.rcp.hitachi.co.jp
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The help text for this config is duplicated across the x86, parisc, and
s390 Kconfig.debug files. Arnd Bergman noted that the help text was
slightly misleading and should be fixed to state that enabling this
option isn't a problem when using pre 4.4 gcc.
To simplify the rewording, consolidate the text into lib/Kconfig.debug
and modify it there to be more explicit about when you should say N to
this config.
Also, make the text a bit more generic by stating that this option
enables compile time checks so we can cover architectures which emit
warnings vs. ones which emit errors. The details of how an
architecture decided to implement the checks isn't as important as the
concept of compile time checking of copy_from_user() calls.
While we're doing this, remove all the copy_from_user_overflow() code
that's duplicated many times and place it into lib/ so that any
architecture supporting this option can get the function for free.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Helge Deller <deller@gmx.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 cleanups from Ingo Molnar:
"Misc smaller cleanups"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/lib: Fix spelling, put space between a numeral and its units
x86/lib: Fix spelling in the comments
x86, quirks: Shut-up a long-standing gcc warning
x86, msr: Unify variable names
x86-64, docs, mm: Add vsyscall range to virtual address space layout
x86: Drop KERNEL_IMAGE_START
x86_64: Use __BOOT_DS instead_of __KERNEL_DS for safety
As suggested by Peter Anvin.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: H . Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Apparently 'byts' should be 'bytes'.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: H . Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The increment of "to" in copy_user_handle_tail() will have incremented
before a failure has been noted. This causes us to skip a byte in the
failure case.
Only do the increment when assured there is no failure.
Signed-off-by: CQ Tang <cq.tang@intel.com>
Link: http://lkml.kernel.org/r/20130318150221.8439.993.stgit@phlsvslse11.ph.intel.com
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org>
Consistently use the data register by name and use a sized assembly
instruction in getuser.S. There is never any reason to macroize it,
and being inconsistent in the same file is just annoying.
No actual code change.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Implement __get_user_8() for x86-32. It will return the
64-bit result in edx:eax register pair, and ecx is used
to pass in the address and return the error value.
For consistency, change the register assignment for all
other __get_user_x() variants, so that address is passed in
ecx/rcx, the error value is returned in ecx/rcx, and eax/rax
contains the actual value.
[ hpa: I modified the patch so that it does NOT change the calling
conventions for the existing callsites, this also means that the code
is completely unchanged for 64 bits.
Instead, continue to use eax for address input/error output and use
the ecx:edx register pair for the output. ]
This is a partial refresh of a patch [1] by Jamie Lokier from
2004. Only the minimal changes to implement 64bit get_user()
were picked from the original patch.
[1] http://article.gmane.org/gmane.linux.kernel/198823
Originally-by: Jamie Lokier <jamie@shareable.org>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link:
http://lkml.kernel.org/r/1355312043-11467-1-git-send-email-ville.syrjala@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
CONFIG_HOTPLUG is going away as an option. As a result, the __dev*
markings need to be removed.
This change removes the use of __devinit, __devexit_p, __devinitconst,
and __devexit from these drivers.
Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.
Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Drake <dsd@laptop.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull "Nuke 386-DX/SX support" from Ingo Molnar:
"This tree removes ancient-386-CPUs support and thus zaps quite a bit
of complexity:
24 files changed, 56 insertions(+), 425 deletions(-)
... which complexity has plagued us with extra work whenever we wanted
to change SMP primitives, for years.
Unfortunately there's a nostalgic cost: your old original 386 DX33
system from early 1991 won't be able to boot modern Linux kernels
anymore. Sniff."
I'm not sentimental. Good riddance.
* 'x86-nuke386-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, 386 removal: Document Nx586 as a 386 and thus unsupported
x86, cleanups: Simplify sync_core() in the case of no CPUID
x86, 386 removal: Remove CONFIG_X86_POPAD_OK
x86, 386 removal: Remove CONFIG_X86_WP_WORKS_OK
x86, 386 removal: Remove CONFIG_INVLPG
x86, 386 removal: Remove CONFIG_BSWAP
x86, 386 removal: Remove CONFIG_XADD
x86, 386 removal: Remove CONFIG_CMPXCHG
x86, 386 removal: Remove CONFIG_M386 from Kconfig
Modern CPUs use fast-string instruction to accelerate copy
performance, by combining data into 128 bit chunks.
Modify comments and coding style to match it.
Signed-off-by: Ma Ling <ling.ma@intel.com>
Cc: iant@google.com
Link: http://lkml.kernel.org/r/1350503565-19167-1-git-send-email-ling.ma@intel.com
[ Cleaned up the clean up. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix a build failure in the x86 insn_sanity program after the UAPI split. The
problem is that insn_sanity.c #includes arch/x86/lib/insn.c - which uses the
kernel string header. This leads to conflicts for various definitions against
the /usr/include/ headers.
linux/string.h can be replaced with the normal userspace string.h if __KERNEL__
is not specified.
HOSTCC arch/x86/tools/insn_sanity
In file included from /data/fs/linux-2.6-hdr/include/linux/string.h:6:0,
from /data/fs/linux-2.6-hdr/arch/x86/lib/insn.c:21,
from arch/x86/tools/insn_sanity.c:36:
/data/fs/linux-2.6-hdr/include/linux/types.h:14:26: error: conflicting types for 'fd_set'
/usr/include/sys/select.h:76:5: note: previous declaration of 'fd_set' was here
/data/fs/linux-2.6-hdr/include/linux/types.h:15:25: error: conflicting types for 'dev_t'
/usr/include/sys/types.h:61:17: note: previous declaration of 'dev_t' was here
/data/fs/linux-2.6-hdr/include/linux/types.h:25:26: error: conflicting types for 'timer_t'
/usr/include/time.h:104:19: note: previous declaration of 'timer_t' was here
/data/fs/linux-2.6-hdr/include/linux/types.h:45:26: error: conflicting types for 'loff_t'
/usr/include/sys/types.h:45:18: note: previous declaration of 'loff_t' was here
/data/fs/linux-2.6-hdr/include/linux/types.h:112:17: error: conflicting types for 'u_int64_t'
/usr/include/sys/types.h:204:1: note: previous declaration of 'u_int64_t' was here
/data/fs/linux-2.6-hdr/include/linux/types.h:113:17: error: conflicting types for 'int64_t'
/usr/include/sys/types.h:198:1: note: previous declaration of 'int64_t' was here
/data/fs/linux-2.6-hdr/include/linux/types.h:134:23: error: conflicting types for 'blkcnt_t'
/usr/include/sys/types.h:236:20: note: previous declaration of 'blkcnt_t' was here
In file included from /data/fs/linux-2.6-hdr/arch/x86/lib/insn.c:21:0,
from arch/x86/tools/insn_sanity.c:36:
/data/fs/linux-2.6-hdr/include/linux/string.h:38:12: error: expected identifier or '(' before '__extension__'
/data/fs/linux-2.6-hdr/include/linux/string.h:38:12: error: expected identifier or '(' before ')' token
/data/fs/linux-2.6-hdr/include/linux/string.h:41:12: error: expected identifier or '(' before '__extension__'
/data/fs/linux-2.6-hdr/include/linux/string.h:53:15: error: expected identifier or '(' before '__extension__'
/data/fs/linux-2.6-hdr/include/linux/string.h:61:28: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'skip_spaces'
/data/fs/linux-2.6-hdr/include/linux/string.h:65:28: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'char'
/data/fs/linux-2.6-hdr/include/linux/string.h:83:15: error: expected identifier or '(' before '__extension__'
/data/fs/linux-2.6-hdr/include/linux/string.h:83:15: error: expected identifier or '(' before ')' token
/data/fs/linux-2.6-hdr/include/linux/string.h:86:15: error: expected identifier or '(' before '__extension__'
/data/fs/linux-2.6-hdr/include/linux/string.h:86:15: error: expected identifier or '(' before ')' token
/data/fs/linux-2.6-hdr/include/linux/string.h:89:24: error: expected identifier or '(' before '__extension__'
/data/fs/linux-2.6-hdr/include/linux/string.h:89:24: error: expected identifier or '(' before ')' token
/data/fs/linux-2.6-hdr/include/linux/string.h:92:24: error: expected identifier or '(' before '__extension__'
/data/fs/linux-2.6-hdr/include/linux/string.h:92:24: error: expected identifier or '(' before ')' token
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
When Supervisor Mode Access Prevention (SMAP) is enabled, access to
userspace from the kernel is controlled by the AC flag. To make the
performance of manipulating that flag acceptable, there are two new
instructions, STAC and CLAC, to set and clear it.
This patch adds those instructions, via alternative(), when the SMAP
feature is enabled. It also adds X86_EFLAGS_AC unconditionally to the
SYSCALL entry mask; there is simply no reason to make that one
conditional.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1348256595-29119-9-git-send-email-hpa@linux.intel.com
Pull x86 fixes from Ingo Molnar.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, cpufeature: Remove stray %s, add -w to mkcapflags.pl
x86, cpufeature: Catch duplicate CPU feature strings
x86, cpufeature: Rename X86_FEATURE_DTS to X86_FEATURE_DTHERM
x86: Fix kernel-doc warnings
x86, compat: Use test_thread_flag(TIF_IA32) in compat signal delivery
I noticed that the LBR fixups were not working anymore
on programs where they used to. I tracked this down to
a recent change to copy_from_user_nmi():
db0dc75d64 ("perf/x86: Check user address explicitly in copy_from_user_nmi()")
This commit added a call to __range_not_ok() to the
copy_from_user_nmi() routine. The problem is that the logic
of the test must be reversed. __range_not_ok() returns 0 if the
range is VALID. We want to return early from copy_from_user_nmi()
if the range is NOT valid.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Arun Sharma <asharma@fb.com>
Link: http://lkml.kernel.org/r/20120611134426.GA7542@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There were paravirt_ops hooks for the full register set variant of
{rd,wr}msr_safe which are actually not used by anyone anymore. Remove
them to make the code cleaner and avoid silent breakages when the pvops
members were uninitialized. This has been boot-tested natively and under
Xen with PVOPS enabled and disabled on one machine.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Link: http://lkml.kernel.org/r/1338562358-28182-2-git-send-email-bp@amd64.org
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Fix the x86 instruction decoder to decode bsr/bsf/jmpe with
operand-size prefix (66h). This fixes the test case failure
reported by Linus, attached below.
bsf/bsr/jmpe have a special encoding. Opcode map in
Intel Software Developers Manual vol2 says they have
TZCNT/LZCNT variants if it has F3h prefix. However, there
is no information if it has other 66h or F2h prefixes.
Current instruction decoder supposes that those are
bad instructions, but it actually accepts at least
operand-size prefixes.
H. Peter Anvin further explains:
" TZCNT/LZCNT are F3 + BSF/BSR exactly because the F2 and
F3 prefixes have historically been no-ops with most instructions.
This allows software to unconditionally use the prefixed versions
and get TZCNT/LZCNT on the processors that have them if they don't
care about the difference. "
This fixes errors reported by test_get_len:
Warning: arch/x86/tools/test_get_len found difference at <em_bsf>:ffffffff81036d87
Warning: ffffffff81036de5: 66 0f bc c2 bsf %dx,%ax
Warning: objdump says 4 bytes, but insn_get_length() says 3
Warning: arch/x86/tools/test_get_len found difference at <em_bsr>:ffffffff81036ea6
Warning: ffffffff81036f04: 66 0f bd c2 bsr %dx,%ax
Warning: objdump says 4 bytes, but insn_get_length() says 3
Warning: decoded and checked 13298882 instructions with 2 warnings
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <yrl.pp-manager.tt@hitachi.com>
Link: http://lkml.kernel.org/r/20120604150911.22338.43296.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This throws away the old x86-specific functions in favor of the generic
optimized version.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The generic strncpy_from_user() is not really optimal, since it is
designed to work on both little-endian and big-endian. And on
little-endian you can simplify much of the logic to find the first zero
byte, since little-endian arithmetic doesn't have to worry about the
carry bit propagating into earlier bytes (only later bytes, which we
don't care about).
But I have patches to make the generic routines use the architecture-
specific <asm/word-at-a-time.h> infrastructure, so that we can regain
the little-endian optimizations. But before we do that, switch over to
the generic routines to make the patches each do just one well-defined
thing.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull exception table generation updates from Ingo Molnar:
"The biggest change here is to allow the build-time sorting of the
exception table, to speed up booting. This is achieved by the
architecture enabling BUILDTIME_EXTABLE_SORT. This option is enabled
for x86 and MIPS currently.
On x86 a number of fixes and changes were needed to allow build-time
sorting of the exception table, in particular a relocation invariant
exception table format was needed. This required the abstracting out
of exception table protocol and the removal of 20 years of accumulated
assumptions about the x86 exception table format.
While at it, this tree also cleans up various other aspects of
exception handling, such as early(er) exception handling for
rdmsr_safe() et al.
All in one, as the result of these changes the x86 exception code is
now pretty nice and modern. As an added bonus any regressions in this
code will be early and violent crashes, so if you see any of those,
you'll know whom to blame!"
Fix up trivial conflicts in arch/{mips,x86}/Kconfig files due to nearby
modifications of other core architecture options.
* 'x86-extable-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
Revert "x86, extable: Disable presorted exception table for now"
scripts/sortextable: Handle relative entries, and other cleanups
x86, extable: Switch to relative exception table entries
x86, extable: Disable presorted exception table for now
x86, extable: Add _ASM_EXTABLE_EX() macro
x86, extable: Remove open-coded exception table entries in arch/x86/ia32/ia32entry.S
x86, extable: Remove open-coded exception table entries in arch/x86/include/asm/xsave.h
x86, extable: Remove open-coded exception table entries in arch/x86/include/asm/kvm_host.h
x86, extable: Remove the now-unused __ASM_EX_SEC macros
x86, extable: Remove open-coded exception table entries in arch/x86/xen/xen-asm_32.S
x86, extable: Remove open-coded exception table entries in arch/x86/um/checksum_32.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/usercopy_32.c
x86, extable: Remove open-coded exception table entries in arch/x86/lib/putuser.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/getuser.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/csum-copy_64.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/copy_user_nocache_64.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/copy_user_64.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/checksum_32.S
x86, extable: Remove open-coded exception table entries in arch/x86/kernel/test_rodata.c
x86, extable: Remove open-coded exception table entries in arch/x86/kernel/entry_64.S
...
This makes the newly optimized x86 strncpy_from_user clear the final
bytes in the word past the final NUL character, rather than copy them as
the word they were in the source.
NOTE! Unlike the silly semantics of the libc 'strncpy()' function, the
kernel strncpy_from_user() has never cleared all of the end of the
destination buffer. And neither does it do so now: it only clears the
bytes at the end of the last word it copied.
So why make this change at all? It doesn't really cost us anything extra
(we have to calculate the mask to get the length anyway), and it means
that *if* any user actually cares about zeroing the whole buffer, they
can do a "memset()" before the strncpy_from_user(), and we will no
longer write random bytes after the NUL character.
In particular, the buffer contents will now at no point contain random
source data from beyond the end of the string.
In other words, it makes behavior a bit more repeatable at no new cost,
so it's a small cleanup. I've been carrying this as a patch for the
last few weeks or so in my tree (done at the same time the sign error
was fixed in commit 12e993b894), I might as well commit it.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove open-coded exception table entries in arch/x86/lib/usercopy_32.c,
and replace them with _ASM_EXTABLE() macros; this will allow us to
change the format and type of the exception table entries.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: David Daney <david.daney@cavium.com>
Link: http://lkml.kernel.org/r/CA%2B55aFyijf43qSu3N9nWHEBwaGbb7T2Oq9A=9EyR=Jtyqfq_cQ@mail.gmail.com
Remove open-coded exception table entries in arch/x86/lib/putuser.S,
and replace them with _ASM_EXTABLE() macros; this will allow us to
change the format and type of the exception table entries.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: David Daney <david.daney@cavium.com>
Link: http://lkml.kernel.org/r/CA%2B55aFyijf43qSu3N9nWHEBwaGbb7T2Oq9A=9EyR=Jtyqfq_cQ@mail.gmail.com
Remove open-coded exception table entries in arch/x86/lib/getuser.S,
and replace them with _ASM_EXTABLE() macros; this will allow us to
change the format and type of the exception table entries.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: David Daney <david.daney@cavium.com>
Link: http://lkml.kernel.org/r/CA%2B55aFyijf43qSu3N9nWHEBwaGbb7T2Oq9A=9EyR=Jtyqfq_cQ@mail.gmail.com
Remove open-coded exception table entries in arch/x86/lib/csum-copy_64.S,
and replace them with _ASM_EXTABLE() macros; this will allow us to
change the format and type of the exception table entries.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: David Daney <david.daney@cavium.com>
Link: http://lkml.kernel.org/r/CA%2B55aFyijf43qSu3N9nWHEBwaGbb7T2Oq9A=9EyR=Jtyqfq_cQ@mail.gmail.com
Remove open-coded exception table entries in arch/x86/lib/copy_user_nocache_64.S,
and replace them with _ASM_EXTABLE() macros; this will allow us to
change the format and type of the exception table entries.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: David Daney <david.daney@cavium.com>
Link: http://lkml.kernel.org/r/CA%2B55aFyijf43qSu3N9nWHEBwaGbb7T2Oq9A=9EyR=Jtyqfq_cQ@mail.gmail.com
Remove open-coded exception table entries in arch/x86/lib/copy_user_64.S,
and replace them with _ASM_EXTABLE() macros; this will allow us to
change the format and type of the exception table entries.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: David Daney <david.daney@cavium.com>
Link: http://lkml.kernel.org/r/CA%2B55aFyijf43qSu3N9nWHEBwaGbb7T2Oq9A=9EyR=Jtyqfq_cQ@mail.gmail.com
Remove open-coded exception table entries in arch/x86/lib/checksum_32.S,
and replace them with _ASM_EXTABLE() macros; this will allow us to
change the format and type of the exception table entries.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: David Daney <david.daney@cavium.com>
Link: http://lkml.kernel.org/r/CA%2B55aFyijf43qSu3N9nWHEBwaGbb7T2Oq9A=9EyR=Jtyqfq_cQ@mail.gmail.com
This can happen if the instruction is much longer than the maximum length,
or if insn->opnd_bytes is manually changed.
This patch also fixes warnings from -Wswitch-default flag.
Reported-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120413032427.32577.42602.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 'max' range needs to be unsigned, since the size of the user address
space is bigger than 2GB.
We know that 'count' is positive in 'long' (that is checked in the
caller), so we will truncate 'max' down to something that fits in a
signed long, but before we actually do that, that comparison needs to be
done in unsigned.
Bug introduced in commit 92ae03f2ef ("x86: merge 32/64-bit versions of
'strncpy_from_user()' and speed it up"). On x86-64 you can't trigger
this, since the user address space is much smaller than 63 bits, and on
x86-32 it works in practice, since you would seldom hit the strncpy
limits anyway.
I had actually tested the corner-cases, I had only tested them on
x86-64. Besides, I had only worried about the case of a pointer *close*
to the end of the address space, rather than really far away from it ;)
This also changes the "we hit the user-specified maximum" to return
'res', for the trivial reason that gcc seems to generate better code
that way. 'res' and 'count' are the same in that case, so it really
doesn't matter which one we return.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This merges the 32- and 64-bit versions of the x86 strncpy_from_user()
by just rewriting it in C rather than the ancient inline asm versions
that used lodsb/stosb and had been duplicated for (trivial) differences
between the 32-bit and 64-bit versions.
While doing that, it also speeds them up by doing the accesses a word at
a time. Finally, the new routines also properly handle the case of
hitting the end of the address space, which we have never done correctly
before (fs/namei.c has a hack around it for that reason).
Despite all these improvements, it actually removes more lines than it
adds, due to the de-duplication. Also, we no longer export (or define)
the legacy __strncpy_from_user() function (that was defined to not do
the user permission checks), since it's not actually used anywhere, and
the user address space checks are built in to the new code.
Other architecture maintainers have been notified that the old hack in
fs/namei.c will be going away in the 3.5 merge window, in case they
copied the x86 approach of being a bit cavalier about the end of the
address space.
Cc: linux-arch@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin" <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86/asm changes from Ingo Molnar
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Include probe_roms.h in probe_roms.c
x86/32: Print control and debug registers for kerenel context
x86: Tighten dependencies of CPU_SUP_*_32
x86/numa: Improve internode cache alignment
x86: Fix the NMI nesting comments
x86-64: Improve insn scheduling in SAVE_ARGS_IRQ
x86-64: Fix CFI annotations for NMI nesting code
bitops: Add missing parentheses to new get_order macro
bitops: Optimise get_order()
bitops: Adjust the comment on get_order() to describe the size==0 case
x86/spinlocks: Eliminate TICKET_MASK
x86-64: Handle byte-wise tail copying in memcpy() without a loop
x86-64: Fix memcpy() to support sizes of 4Gb and above
x86-64: Fix memset() to support sizes of 4Gb and above
x86-64: Slightly shorten copy_page()
Pull kmap_atomic cleanup from Cong Wang.
It's been in -next for a long time, and it gets rid of the (no longer
used) second argument to k[un]map_atomic().
Fix up a few trivial conflicts in various drivers, and do an "evil
merge" to catch some new uses that have come in since Cong's tree.
* 'kmap_atomic' of git://github.com/congwang/linux: (59 commits)
feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal
highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename]
drbd: remove the second argument of k[un]map_atomic()
zcache: remove the second argument of k[un]map_atomic()
gma500: remove the second argument of k[un]map_atomic()
dm: remove the second argument of k[un]map_atomic()
tomoyo: remove the second argument of k[un]map_atomic()
sunrpc: remove the second argument of k[un]map_atomic()
rds: remove the second argument of k[un]map_atomic()
net: remove the second argument of k[un]map_atomic()
mm: remove the second argument of k[un]map_atomic()
lib: remove the second argument of k[un]map_atomic()
power: remove the second argument of k[un]map_atomic()
kdb: remove the second argument of k[un]map_atomic()
udf: remove the second argument of k[un]map_atomic()
ubifs: remove the second argument of k[un]map_atomic()
squashfs: remove the second argument of k[un]map_atomic()
reiserfs: remove the second argument of k[un]map_atomic()
ocfs2: remove the second argument of k[un]map_atomic()
ntfs: remove the second argument of k[un]map_atomic()
...
Commit f0fbf0abc0 ("x86: integrate delay functions") converted
delay_tsc() into a random delay generator for 64 bit. The reason is
that it merged the mostly identical versions of delay_32.c and
delay_64.c. Though the subtle difference of the result was:
static void delay_tsc(unsigned long loops)
{
- unsigned bclock, now;
+ unsigned long bclock, now;
Now the function uses rdtscl() which returns the lower 32bit of the
TSC. On 32bit that's not problematic as unsigned long is 32bit. On 64
bit this fails when the lower 32bit are close to wrap around when
bclock is read, because the following check
if ((now - bclock) >= loops)
break;
evaluated to true on 64bit for e.g. bclock = 0xffffffff and now = 0
because the unsigned long (now - bclock) of these values results in
0xffffffff00000001 which is definitely larger than the loops
value. That explains Tvortkos observation:
"Because I am seeing udelay(500) (_occasionally_) being short, and
that by delaying for some duration between 0us (yep) and 491us."
Make those variables explicitely u32 again, so this works for both 32
and 64 bit.
Reported-by: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org # >= 2.6.27
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix to decode grouped AVX with VEX pp bits which should be
handled as same as last-prefixes. This fixes below warnings
in posttest with CONFIG_CRYPTO_SHA1_SSSE3=y.
Warning: arch/x86/tools/test_get_len found difference at <sha1_transform_avx>:ffffffff810d5fc0
Warning: ffffffff810d6069: c5 f9 73 de 04 vpsrldq $0x4,%xmm6,%xmm0
Warning: objdump says 5 bytes, but insn_get_length() says 4
...
With this change, test_get_len can decode it correctly.
$ arch/x86/tools/test_get_len -v -y
ffffffff810d6069: c5 f9 73 de 04 vpsrldq $0x4,%xmm6,%xmm0
Succeed: decoded and checked 1 instructions
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20120210053340.30429.73410.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While hard to measure, reducing the number of possibly/likely
mis-predicted branches can generally be expected to be slightly
better.
Other than apparent at the first glance, this also doesn't grow
the function size (the alignment gap to the next function just
gets smaller).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/4F218584020000780006F422@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While currently there doesn't appear to be any reachable in-tree
case where such large memory blocks may be passed to memcpy(),
we already had hit the problem in our Xen kernels. Just like
done recently for mmeset(), rather than working around it,
prevent others from falling into the same trap by fixing this
long standing limitation.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/4F21846F020000780006F3FA@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While currently there doesn't appear to be any reachable in-tree
case where such large memory blocks may be passed to memset()
(alloc_bootmem() being the primary non-reachable one, as it gets
called with suitably large sizes in FLATMEM configurations), we
have recently hit the problem a second time in our Xen kernels.
Rather than working around it a second time, prevent others from
falling into the same trap by fixing this long standing
limitation.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/4F05D992020000780006AA09@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In the "xchg" implementation, %ebx and %ecx don't need to be copied
into %eax and %edx respectively (this is only necessary when desiring
to only read the stored value).
In the "add_unless" implementation, swapping the use of %ecx and %esi
for passing arguments allows %esi to become an input only (i.e.
permitting the register to be re-used to address the same object
without reload).
In "{add,sub}_return", doing the initial read64 through the passed in
%ecx decreases a register dependency.
In "inc_not_zero", a branch can be eliminated by or-ing together the
two halves of the current (64-bit) value, and code size can be further
reduced by adjusting the arithmetic slightly.
v2: Undo the folding of "xchg" and "set".
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/4F19A2BC020000780006E0DC@nat28.tlf.novell.com
Cc: Luca Barbieri <luca@luca-barbieri.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Eric pointed out overly restrictive constraints in atomic64_set(), but
there are issues throughout the file. In the cited case, %ebx and %ecx
are inputs only (don't get changed by either of the two low level
implementations). This was also the case elsewhere.
Further in many cases early-clobber indicators were missing.
Finally, the previous implementation rolled a custom alternative
instruction macro from scratch, rather than using alternative_call()
(which was introduced with the commit that the description of the
change in question actually refers to). Adjusting has the benefit of
not hiding referenced symbols from the compiler, which however requires
them to be declared not just in the exporting source file (which, as a
desirable side effect, in turn allows that exporting file to become a
real 5-line stub).
This patch does not eliminate the overly restrictive memory clobbers,
however: Doing so would occasionally make the compiler set up a second
register for accessing the memory object (to satisfy the added "m"
constraint), and it's not clear which of the two non-optimal
alternatives is better.
v2: Re-do the declaration and exporting of the internal symbols.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/4F19A2A5020000780006E0D9@nat28.tlf.novell.com
Cc: Luca Barbieri <luca@luca-barbieri.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/accounting, proc: Fix /proc/stat interrupts sum
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tracepoints/module: Fix disabling tracepoints with taint CRAP or OOT
x86/kprobes: Add arch/x86/tools/insn_sanity to .gitignore
x86/kprobes: Fix typo transferred from Intel manual
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, syscall: Need __ARCH_WANT_SYS_IPC for 32 bits
x86, tsc: Fix SMI induced variation in quick_pit_calibrate()
x86, opcode: ANDN and Group 17 in x86-opcode-map.txt
x86/kconfig: Move the ZONE_DMA entry under a menu
x86/UV2: Add accounting for BAU strong nacks
x86/UV2: Ack BAU interrupt earlier
x86/UV2: Remove stale no-resources test for UV2 BAU
x86/UV2: Work around BAU bug
x86/UV2: Fix BAU destination timeout initialization
x86/UV2: Fix new UV2 hardware by using native UV2 broadcast mode
x86: Get rid of dubious one-bit signed bitfield
The Intel documentation at
http://software.intel.com/file/36945
shows the ANDN opcode and Group 17 with encoding f2 and f3 encoding
respectively. The current version of x86-opcode-map.txt shows them
with f3 and f4. Unless someone can point to documentation which shows
the currently used encoding the following patch be applied.
Signed-off-by: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/CAOPLpQdq5SuVo9=023CYhbFLAX9rONyjmYq7jJkqc5xwctW5eA@mail.gmail.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The arch/x86/lib/x86-opcode-map.txt file [used by the
kprobes instruction decoder] contains the line:
af: SCAS/W/D/Q rAX,Xv
This is what the Intel manuals show, but it's not correct.
The 'X' stands for:
Memory addressed by the DS:rSI register pair (for example, MOVS, CMPS, OUTS, or LODS).
On the other hand 'Y' means (also see the ae byte entry for
SCASB):
Memory addressed by the ES:rDI register pair (for example, MOVS, CMPS, INS, STOS, or SCAS).
Signed-off-by: Ulrich Drepper <drepper@gmail.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/CAOPLpQfytPyDEBF1Hbkpo7ovUerEsstVGxBr%3DEpDL-BKEMaqLA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86: Fix atomic64_xxx_cx8() functions
x86: Fix and improve cmpxchg_double{,_local}()
x86_64, asm: Optimise fls(), ffs() and fls64()
x86, bitops: Move fls64.h inside __KERNEL__
x86: Fix and improve percpu_cmpxchg{8,16}b_double()
x86: Report cpb and eff_freq_ro flags correctly
x86/i386: Use less assembly in strlen(), speed things up a bit
x86: Use the same node_distance for 32 and 64-bit
x86: Fix rflags in FAKE_STACK_FRAME
x86: Clean up and extend do_int3()
x86: Call do_notify_resume() with interrupts enabled
x86/div64: Add a micro-optimization shortcut if base is power of two
x86-64: Cleanup some assembly entry points
x86-64: Slightly shorten line system call entry and exit paths
x86-64: Reduce amount of redundant code generated for invalidate_interruptNN
x86-64: Slightly shorten int_ret_from_sys_call
x86, efi: Convert efi_phys_get_time() args to physical addresses
x86: Default to vsyscall=emulate
x86-64: Set siginfo and context on vsyscall emulation faults
x86: consolidate xchg and xadd macros
...
%r13 got saved and restored without ever getting touched, so
there's no need to do so.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/4F05D9F9020000780006AA0D@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Current i386 strlen() hardcodes NOT/DEC sequence. DEC is
mentioned to be suboptimal on Core2. So, put only REPNE SCASB
sequence in assembly, compiler can do the rest.
The difference in generated code is like below (MCORE2=y):
<strlen>:
push %edi
mov $0xffffffff,%ecx
mov %eax,%edi
xor %eax,%eax
repnz scas %es:(%edi),%al
not %ecx
- dec %ecx
- mov %ecx,%eax
+ lea -0x1(%ecx),%eax
pop %edi
ret
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Beulich <JBeulich@suse.com>
Link: http://lkml.kernel.org/r/20111211181319.GA17097@p183.telecom.by
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since new Intel software developers manual introduces
new format for AVX instruction set (including AVX2),
it is important to update x86-opcode-map.txt to fit
those changes.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120557.15475.13236.stgit@cloud
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For reducing memory usage of attribute table, x86 instruction
decoder puts "Group" attribute only on "no-last-prefix"
attribute table (same as vex_p == 0 case).
Thus, the decoder should look no-last-prefix table first, and
then only if it is not a group, move on to "with-last-prefix"
table (vex_p != 0).
However, current implementation, inat_get_avx_attribute()
looks with-last-prefix directly. So, when decoding
a grouped AVX instruction, the decoder fails to find correct
group because there is no "Group" attribute on the table.
This ends up with the mis-decoding of instructions, as Ingo
reported in http://thread.gmane.org/gmane.linux.kernel/1214103
This patch fixes it to check no-last-prefix table first
even if that is an AVX instruction, and get an attribute from
"with last-prefix" table only if that is not a group.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120539.15475.91428.stgit@cloud
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix x86 insn decoder for hardening against invalid length
instructions. This adds length checkings for each byte-read
site and if it exceeds MAX_INSN_SIZE, returns immediately.
This can happen when decoding user-space binary.
Caller can check whether it happened by checking insn.*.got
member is set or not.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: acme@redhat.com
Cc: ming.m.lin@intel.com
Cc: robert.richter@amd.com
Cc: ravitillo@lbl.gov
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20111007133155.10933.58577.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-64, vdso: Do not allocate memory for the vDSO
clocksource: Change __ARCH_HAS_CLOCKSOURCE_DATA to a CONFIG option
x86, vdso: Drop now wrong comment
Document the vDSO and add a reference parser
ia64: Replace clocksource.fsys_mmio with generic arch data
x86-64: Move vread_tsc and vread_hpet into the vDSO
clocksource: Replace vread with generic arch data
x86-64: Add --no-undefined to vDSO build
x86-64: Allow alternative patching in the vDSO
x86: Make alternative instruction pointers relative
x86-64: Improve vsyscall emulation CS and RIP handling
x86-64: Emulate legacy vsyscalls
x86-64: Fill unused parts of the vsyscall page with 0xcc
x86-64: Remove vsyscall number 3 (venosys)
x86-64: Map the HPET NX
x86-64: Remove kernel.vsyscall64 sysctl
x86-64: Give vvars their own page
x86-64: Document some of entry_64.S
x86-64: Fix alignment of jiffies variable
copy_from_user_nmi() is used in oprofile and perf. Moving it to other
library functions like copy_from_user(). As this is x86 code for 32
and 64 bits, create a new file usercopy.c for unified code.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110607172413.GJ20052@erda.amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With the write lock path simply subtracting RW_LOCK_BIAS there
is, on large systems, the theoretical possibility of overflowing
the 32-bit value that was used so far (namely if 128 or more
CPUs manage to do the subtraction, but don't get to do the
inverse addition in the failure path quickly enough).
A first measure is to modify RW_LOCK_BIAS itself - with the new
value chosen, it is good for up to 2048 CPUs each allowed to
nest over 2048 times on the read path without causing an issue.
Quite possibly it would even be sufficient to adjust the bias a
little further, assuming that allowing for significantly less
nesting would suffice.
However, as the original value chosen allowed for even more
nesting levels, to support more than 2048 CPUs (possible
currently only for 64-bit kernels) the lock itself gets widened
to 64 bits.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4E258E0D020000780004E3F0@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rather than having two functionally identical implementations
for 32- and 64-bit configurations, use the previously extended
assembly abstractions to fold the rwsem two implementations into
a shared one.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4E258DF3020000780004E3ED@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rather than having two functionally identical implementations
for 32- and 64-bit configurations, extend the existing assembly
abstractions enough to fold the two rwlock implementations into
a shared one.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4E258DD7020000780004E3EA@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Drop thunk_ra macro in favor of an additional argument to the thunk
macro since their bodies are almost identical. Do a whitespace scrubbing
and use CFI-aware macros for full annotation.
Signed-off-by: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/1306873314-32523-5-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
As reported in BZ #30352:
https://bugzilla.kernel.org/show_bug.cgi?id=30352
there's a kernel bug related to reading the last allowed page on x86_64.
The _copy_to_user() and _copy_from_user() functions use the following
check for address limit:
if (buf + size >= limit)
fail();
while it should be more permissive:
if (buf + size > limit)
fail();
That's because the size represents the number of bytes being
read/write from/to buf address AND including the buf address.
So the copy function will actually never touch the limit
address even if "buf + size == limit".
Following program fails to use the last page as buffer
due to the wrong limit check:
#include <sys/mman.h>
#include <sys/socket.h>
#include <assert.h>
#define PAGE_SIZE (4096)
#define LAST_PAGE ((void*)(0x7fffffffe000))
int main()
{
int fds[2], err;
void * ptr = mmap(LAST_PAGE, PAGE_SIZE, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
assert(ptr == LAST_PAGE);
err = socketpair(AF_LOCAL, SOCK_STREAM, 0, fds);
assert(err == 0);
err = send(fds[0], ptr, PAGE_SIZE, 0);
perror("send");
assert(err == PAGE_SIZE);
err = recv(fds[1], ptr, PAGE_SIZE, MSG_WAITALL);
perror("recv");
assert(err == PAGE_SIZE);
return 0;
}
The other place checking the addr limit is the access_ok() function,
which is working properly. There's just a misleading comment
for the __range_not_ok() macro - which this patch fixes as well.
The last page of the user-space address range is a guard page and
Brian Gerst observed that the guard page itself due to an erratum on K8 cpus
(#121 Sequential Execution Across Non-Canonical Boundary Causes Processor
Hang).
However, the test code is using the last valid page before the guard page.
The bug is that the last byte before the guard page can't be read
because of the off-by-one error. The guard page is left in place.
This bug would normally not show up because the last page is
part of the process stack and never accessed via syscalls.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Brian Gerst <brgerst@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/1305210630-7136-1-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Support memset() with enhanced rep stosb. On processors supporting enhanced
REP MOVSB/STOSB, the alternative memset_c_e function using enhanced rep stosb
overrides the fast string alternative memset_c and the original function.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1305671358-14478-10-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Support memmove() by enhanced rep movsb. On processors supporting enhanced
REP MOVSB/STOSB, the alternative memmove() function using enhanced rep movsb
overrides the original function.
The patch doesn't change the backward memmove case to use enhanced rep
movsb.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1305671358-14478-9-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Support memcpy() with enhanced rep movsb. On processors supporting enhanced
rep movsb, the alternative memcpy() function using enhanced rep movsb overrides the original function and the fast string
function.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1305671358-14478-8-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Support copy_to_user/copy_from_user() by enhanced REP MOVSB/STOSB.
On processors supporting enhanced REP MOVSB/STOSB, the alternative
copy_user_enhanced_fast_string function using enhanced rep movsb overrides the
original function and the fast string function.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1305671358-14478-7-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Intel processors are adding enhancements to REP MOVSB/STOSB and the use of
REP MOVSB/STOSB for optimal memcpy/memset or similar functions is recommended.
Enhancement availability is indicated by CPUID.7.0.EBX[9] (Enhanced REP MOVSB/
STOSB).
Support clear_page() with rep stosb for processor supporting enhanced REP MOVSB
/STOSB. On processors supporting enhanced REP MOVSB/STOSB, the alternative
clear_page_c_e function using enhanced REP STOSB overrides the original function
and the fast string function.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1305671358-14478-6-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Omit the segment prefix in the UP case. GS is not used then
and we will generate segfaults if cmpxchg16b is used otherwise.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Flush TLB if PGD entry is changed in i386 PAE mode
x86, dumpstack: Correct stack dump info when frame pointer is available
x86: Clean up csum-copy_64.S a bit
x86: Fix common misspellings
x86: Fix misspelling and align params
x86: Use PentiumPro-optimized partial_csum() on VIA C7
The many stray whitespaces and other uncleanlinesses made this code
almost unreadable to me - so fix those.
No changes to the code.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
They were generated by 'codespell' and then manually reviewed.
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: trivial@kernel.org
LKML-Reference: <1300389856-1099-3-git-send-email-lucas.demarchi@profusion.mobi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu, x86: Add arch-specific this_cpu_cmpxchg_double() support
percpu: Generic support for this_cpu_cmpxchg_double()
alpha: use L1_CACHE_BYTES for cacheline size in the linker script
percpu: align percpu readmostly subsection to cacheline
Fix up trivial conflict in arch/x86/kernel/vmlinux.lds.S due to the
percpu alignment having changed ("x86: Reduce back the alignment of the
per-CPU data section")
* 'x86-mem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-64, mem: Convert memmove() to assembly file and fix return value bug
'simple' would have required specifying current frame address
and return address location manually, but that's obviously not
the case (and not necessary) here.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4D6D1082020000780003454C@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some of the items removed were apparently never used, others
simply didn't get removed with their last user.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4D6BD3A002000078000341F1@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cleaning up and shortening code...
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
LKML-Reference: <4D6BD35002000078000341DA@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
These weren't part of the initial commit of this code.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
LKML-Reference: <4D6BCDFF02000078000341B0@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Support this_cpu_cmpxchg_double() using the cmpxchg16b and cmpxchg8b
instructions.
-tj: s/percpu_cmpxchg16b/percpu_cmpxchg16b_double/ for consistency and
other cosmetic changes.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>