We used to delay switching to the new credentials until after we had
mapped the executable (and possible elf interpreter). That was kind of
odd to begin with, since the new executable will actually then _run_
with the new creds, but whatever.
The bigger problem was that we also want to make sure that we turn off
prof events and tracing before we start mapping the new executable
state. So while this is a cleanup, it's also a fix for a possible
information leak.
Reported-by: Robert Święcki <robert@swiecki.net>
Tested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A double-bug exists in the bss calculation code, where an overflow can
happen in the "last_bss - elf_bss" calculation, but vm_brk internally
aligns the argument, underflowing it, wrapping back around safe. We
shouldn't depend on these bugs staying in sync, so this cleans up the
bss padding handling to avoid the overflow.
This moves the bss padzero() before the last_bss > elf_bss case, since
the zero-filling of the ELF_PAGE should have nothing to do with the
relationship of last_bss and elf_bss: any trailing portion should be
zeroed, and a zero size is already handled by padzero().
Then it handles the math on elf_bss vs last_bss correctly. These need
to both be ELF_PAGE aligned to get the comparison correct, since that's
the expected granularity of the mappings. Since elf_bss already had
alignment-based padding happen in padzero(), the "start" of the new
vm_brk() should be moved forward as done in the original code. However,
since the "end" of the vm_brk() area will already become PAGE_ALIGNed in
vm_brk() then last_bss should get aligned here to avoid hiding it as a
side-effect.
Additionally makes a cosmetic change to the initial last_bss calculation
so it's easier to read in comparison to the load_addr calculation above
it (i.e. the only difference is p_filesz vs p_memsz).
Link: http://lkml.kernel.org/r/1468014494-25291-2-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Ismael Ripoll Ripoll <iripoll@upv.es>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The offset in the core file used to be tracked with ->written field of
the coredump_params structure. The field was retired in favour of
file->f_pos.
However, ->f_pos is not maintained for pipes which leads to breakage.
Restore explicit tracking of the offset in coredump_params. Introduce
->pos field for this purpose since ->written was already reused.
Fixes: a008393951 ("get rid of coredump_params->written").
Reported-by: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The do_brk() and vm_brk() return value was "unsigned long" and returned
the starting address on success, and an error value on failure. The
reasons are entirely historical, and go back to it basically behaving
like the mmap() interface does.
However, nobody actually wanted that interface, and it causes totally
pointless IS_ERR_VALUE() confusion.
What every single caller actually wants is just the simpler integer
return of zero for success and negative error number on failure.
So just convert to that much clearer and more common calling convention,
and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
load_elf_library doesn't handle vm_brk failure although nothing really
indicates it cannot do that because the function is allowed to fail due
to vm_mmap failures already. This might be not a problem now but later
patch will make vm_brk killable (resp. mmap_sem for write waiting will
become killable) and so the failure will be more probable.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull misc vfs cleanups from Al Viro:
"Assorted cleanups and fixes all over the place"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
coredump: only charge written data against RLIMIT_CORE
coredump: get rid of coredump_params->written
ecryptfs_lookup(): try either only encrypted or plaintext name
ecryptfs: avoid multiple aliases for directories
bpf: reject invalid names right in ->lookup()
__d_alloc(): treat NULL name as QSTR("/", 1)
mtd: switch ubi_open_volume_path() to vfs_stat()
mtd: switch open_mtd_by_chdev() to use of vfs_stat()
cprm->written is redundant with cprm->file->f_pos, so use that instead.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace calls to get_random_int() followed by a cast to (unsigned long)
with calls to get_random_long(). Also address shifting bug which, in
case of x86 removed entropy mask for mmap_rnd_bits values > 31 bits.
Signed-off-by: Daniel Cashman <dcashman@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Kralevich <nnk@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Also pass any interpreter's file header to `arch_check_elf' so that any
architecture handler can have a look at it if needed.
Signed-off-by: Maciej W. Rozycki <macro@imgtec.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Fortune <Matthew.Fortune@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/11478/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Pull vfs update from Al Viro:
- misc stable fixes
- trivial kernel-doc and comment fixups
- remove never-used block_page_mkwrite() wrapper function, and rename
the function that is _actually_ used to not have double underscores.
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: 9p: cache.h: Add #define of include guard
vfs: remove stale comment in inode_operations
vfs: remove unused wrapper block_page_mkwrite()
binfmt_elf: Correct `arch_check_elf's description
fs: fix writeback.c kernel-doc warnings
fs: fix inode.c kernel-doc warning
fs/pipe.c: return error code rather than 0 in pipe_write()
fs/pipe.c: preserve alloc_file() error code
binfmt_elf: Don't clobber passed executable's file header
FS-Cache: Handle a write to the page immediately beyond the EOF marker
cachefiles: perform test on s_blocksize when opening cache file.
FS-Cache: Don't override netfs's primary_index if registering failed
FS-Cache: Increase reference of parent after registering, netfs success
debugfs: fix refcount imbalance in start_creating
Correct `arch_check_elf's description, mistakenly copied and pasted from
`arch_elf_pt_proc'.
Signed-off-by: Maciej W. Rozycki <macro@imgtec.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Do not clobber the buffer space passed from `search_binary_handler' and
originally preloaded by `prepare_binprm' with the executable's file
header by overwriting it with its interpreter's file header. Instead
keep the buffer space intact and directly use the data structure locally
allocated for the interpreter's file header, fixing a bug introduced in
2.1.14 with loadable module support (linux-mips.org commit beb11695
[Import of Linux/MIPS 2.1.14], predating kernel.org repo's history).
Adjust the amount of data read from the interpreter's file accordingly.
This was not an issue before loadable module support, because back then
`load_elf_binary' was executed only once for a given ELF executable,
whether the function succeeded or failed.
With loadable module support supported and enabled, upon a failure of
`load_elf_binary' -- which may for example be caused by architecture
code rejecting an executable due to a missing hardware feature requested
in the file header -- a module load is attempted and then the function
reexecuted by `search_binary_handler'. With the executable's file
header replaced with its interpreter's file header the executable can
then be erroneously accepted in this subsequent attempt.
Cc: stable@vger.kernel.org # all the way back
Signed-off-by: Maciej W. Rozycki <macro@imgtec.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add two new flags to the existing coredump mechanism for ELF files to
allow us to explicitly filter DAX mappings. This is desirable because
DAX mappings, like hugetlb mappings, have the potential to be very
large.
Update the coredump_filter documentation in
Documentation/filesystems/proc.txt so that it addresses the new DAX
coredump flags. Also update the documented default value of
coredump_filter to be consistent with the core(5) man page. The
documentation being updated talks about bit 4, Dump ELF headers, which
is enabled if CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is turned on in the
kernel config. This kernel config option defaults to "y" if both ELF
binaries and coredump are enabled.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pull more vfs updates from Al Viro:
"Assorted VFS fixes and related cleanups (IMO the most interesting in
that part are f_path-related things and Eric's descriptor-related
stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
fs-cache series, DAX patches, Jan's file_remove_suid() work"
[ I'd say this is much more than "fixes and related cleanups". The
file_table locking rule change by Eric Dumazet is a rather big and
fundamental update even if the patch isn't huge. - Linus ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
9p: cope with bogus responses from server in p9_client_{read,write}
p9_client_write(): avoid double p9_free_req()
9p: forgetting to cancel request on interrupted zero-copy RPC
dax: bdev_direct_access() may sleep
block: Add support for DAX reads/writes to block devices
dax: Use copy_from_iter_nocache
dax: Add block size note to documentation
fs/file.c: __fget() and dup2() atomicity rules
fs/file.c: don't acquire files->file_lock in fd_install()
fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
vfs: avoid creation of inode number 0 in get_next_ino
namei: make set_root_rcu() return void
make simple_positive() public
ufs: use dir_pages instead of ufs_dir_pages()
pagemap.h: move dir_pages() over there
remove the pointless include of lglock.h
fs: cleanup slight list_entry abuse
xfs: Correctly lock inode when removing suid and file capabilities
fs: Call security_ops->inode_killpriv on truncate
fs: Provide function telling whether file_remove_privs() will do anything
...
load_elf_binary() returns `retval', not `error'.
Fixes: a87938b2e2 ("fs/binfmt_elf.c: fix bug in loading of PIE binaries")
Reported-by: James Hogan <james.hogan@imgtec.com>
Cc: Michael Davidson <md@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The arch_randomize_brk() function is used on several architectures,
even those that don't support ET_DYN ASLR. To avoid bulky extern/#define
tricks, consolidate the support under CONFIG_ARCH_HAS_ELF_RANDOMIZE for
the architectures that support it, while still handling CONFIG_COMPAT_BRK.
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Russell King <linux@arm.linux.org.uk>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: "David A. Long" <dave.long@linaro.org>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Arun Chandran <achandran@mvista.com>
Cc: Yann Droneaud <ydroneaud@opteya.com>
Cc: Min-Hua Chen <orca.chen@gmail.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Alex Smith <alex@alex-smith.me.uk>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: Vineeth Vijayan <vvijayan@mvista.com>
Cc: Jeff Bailey <jeffbailey@google.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Behan Webster <behanw@converseincode.com>
Cc: Ismael Ripoll <iripoll@upv.es>
Cc: Jan-Simon Mller <dl9pf@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes the "offset2lib" weakness in ASLR for arm, arm64, mips,
powerpc, and x86. The problem is that if there is a leak of ASLR from
the executable (ET_DYN), it means a leak of shared library offset as
well (mmap), and vice versa. Further details and a PoC of this attack
is available here:
http://cybersecurity.upv.es/attacks/offset2lib/offset2lib.html
With this patch, a PIE linked executable (ET_DYN) has its own ASLR
region:
$ ./show_mmaps_pie
54859ccd6000-54859ccd7000 r-xp ... /tmp/show_mmaps_pie
54859ced6000-54859ced7000 r--p ... /tmp/show_mmaps_pie
54859ced7000-54859ced8000 rw-p ... /tmp/show_mmaps_pie
7f75be764000-7f75be91f000 r-xp ... /lib/x86_64-linux-gnu/libc.so.6
7f75be91f000-7f75beb1f000 ---p ... /lib/x86_64-linux-gnu/libc.so.6
7f75beb1f000-7f75beb23000 r--p ... /lib/x86_64-linux-gnu/libc.so.6
7f75beb23000-7f75beb25000 rw-p ... /lib/x86_64-linux-gnu/libc.so.6
7f75beb25000-7f75beb2a000 rw-p ...
7f75beb2a000-7f75beb4d000 r-xp ... /lib64/ld-linux-x86-64.so.2
7f75bed45000-7f75bed46000 rw-p ...
7f75bed46000-7f75bed47000 r-xp ...
7f75bed47000-7f75bed4c000 rw-p ...
7f75bed4c000-7f75bed4d000 r--p ... /lib64/ld-linux-x86-64.so.2
7f75bed4d000-7f75bed4e000 rw-p ... /lib64/ld-linux-x86-64.so.2
7f75bed4e000-7f75bed4f000 rw-p ...
7fffb3741000-7fffb3762000 rw-p ... [stack]
7fffb377b000-7fffb377d000 r--p ... [vvar]
7fffb377d000-7fffb377f000 r-xp ... [vdso]
The change is to add a call the newly created arch_mmap_rnd() into the
ELF loader for handling ET_DYN ASLR in a separate region from mmap ASLR,
as was already done on s390. Removes CONFIG_BINFMT_ELF_RANDOMIZE_PIE,
which is no longer needed.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Russell King <linux@arm.linux.org.uk>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: "David A. Long" <dave.long@linaro.org>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Arun Chandran <achandran@mvista.com>
Cc: Yann Droneaud <ydroneaud@opteya.com>
Cc: Min-Hua Chen <orca.chen@gmail.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Alex Smith <alex@alex-smith.me.uk>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: Vineeth Vijayan <vvijayan@mvista.com>
Cc: Jeff Bailey <jeffbailey@google.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Behan Webster <behanw@converseincode.com>
Cc: Ismael Ripoll <iripoll@upv.es>
Cc: Jan-Simon Mller <dl9pf@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE enabled, and a normal top-down
address allocation strategy, load_elf_binary() will attempt to map a PIE
binary into an address range immediately below mm->mmap_base.
Unfortunately, load_elf_ binary() does not take account of the need to
allocate sufficient space for the entire binary which means that, while
the first PT_LOAD segment is mapped below mm->mmap_base, the subsequent
PT_LOAD segment(s) end up being mapped above mm->mmap_base into the are
that is supposed to be the "gap" between the stack and the binary.
Since the size of the "gap" on x86_64 is only guaranteed to be 128MB this
means that binaries with large data segments > 128MB can end up mapping
part of their data segment over their stack resulting in corruption of the
stack (and the data segment once the binary starts to run).
Any PIE binary with a data segment > 128MB is vulnerable to this although
address randomization means that the actual gap between the stack and the
end of the binary is normally greater than 128MB. The larger the data
segment of the binary the higher the probability of failure.
Fix this by calculating the total size of the binary in the same way as
load_elf_interp().
Signed-off-by: Michael Davidson <md@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The issue is that the stack for processes is not properly randomized on
64 bit architectures due to an integer overflow.
The affected function is randomize_stack_top() in file
"fs/binfmt_elf.c":
static unsigned long randomize_stack_top(unsigned long stack_top)
{
unsigned int random_variable = 0;
if ((current->flags & PF_RANDOMIZE) &&
!(current->personality & ADDR_NO_RANDOMIZE)) {
random_variable = get_random_int() & STACK_RND_MASK;
random_variable <<= PAGE_SHIFT;
}
return PAGE_ALIGN(stack_top) + random_variable;
return PAGE_ALIGN(stack_top) - random_variable;
}
Note that, it declares the "random_variable" variable as "unsigned int".
Since the result of the shifting operation between STACK_RND_MASK (which
is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64):
random_variable <<= PAGE_SHIFT;
then the two leftmost bits are dropped when storing the result in the
"random_variable". This variable shall be at least 34 bits long to hold
the (22+12) result.
These two dropped bits have an impact on the entropy of process stack.
Concretely, the total stack entropy is reduced by four: from 2^28 to
2^30 (One fourth of expected entropy).
This patch restores back the entropy by correcting the types involved
in the operations in the functions randomize_stack_top() and
stack_maxrandom_size().
The successful fix can be tested with:
$ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done
7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0 [stack]
7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0 [stack]
7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0 [stack]
7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0 [stack]
...
Once corrected, the leading bytes should be between 7ffc and 7fff,
rather than always being 7fff.
Signed-off-by: Hector Marco-Gisbert <hecmargi@upv.es>
Signed-off-by: Ismael Ripoll <iripoll@upv.es>
[ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Fixes: CVE-2015-1593
Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.net
Signed-off-by: Borislav Petkov <bp@suse.de>
Pull MIPS updates from Ralf Baechle:
"This is an unusually large pull request for MIPS - in parts because
lots of patches missed the 3.18 deadline but primarily because some
folks opened the flood gates.
- Retire the MIPS-specific phys_t with the generic phys_addr_t.
- Improvments for the backtrace code used by oprofile.
- Better backtraces on SMP systems.
- Cleanups for the Octeon platform code.
- Cleanups and fixes for the Loongson platform code.
- Cleanups and fixes to the firmware library.
- Switch ATH79 platform to use the firmware library.
- Grand overhault to the SEAD3 and Malta interrupt code.
- Move the GIC interrupt code to drivers/irqchip
- Lots of GIC cleanups and updates to the GIC code to use modern IRQ
infrastructures and features of the kernel.
- OF documentation updates for the GIC bindings
- Move GIC clocksource driver to drivers/clocksource
- Merge GIC clocksource driver with clockevent driver.
- Further updates to bring the GIC clocksource driver up to date.
- R3000 TLB code cleanups
- Improvments to the Loongson 3 platform code.
- Convert pr_warning to pr_warn.
- Merge a bunch of small lantiq and ralink fixes that have been
staged/lingering inside the openwrt tree for a while.
- Update archhelp for IP22/IP32
- Fix a number of issues for Loongson 1B.
- New clocksource and clockevent driver for Loongson 1B.
- Further work on clk handling for Loongson 1B.
- Platform work for Broadcom BMIPS.
- Error handling cleanups for TurboChannel.
- Fixes and optimization to the microMIPS support.
- Option to disable the FTLB.
- Dump more relevant information on machine check exception
- Change binfmt to allow arch to examine PT_*PROC headers
- Support for new style FPU register model in O32
- VDSO randomization.
- BCM47xx cleanups
- BCM47xx reimplement the way the kernel accesses NVRAM information.
- Random cleanups
- Add support for ATH25 platforms
- Remove pointless locking code in some PCI platforms.
- Some improvments to EVA support
- Minor Alchemy cleanup"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (185 commits)
MIPS: Add MFHC0 and MTHC0 instructions to uasm.
MIPS: Cosmetic cleanups of page table headers.
MIPS: Add CP0 macros for extended EntryLo registers
MIPS: Remove now unused definition of phys_t.
MIPS: Replace use of phys_t with phys_addr_t.
MIPS: Replace MIPS-specific 64BIT_PHYS_ADDR with generic PHYS_ADDR_T_64BIT
PCMCIA: Alchemy Don't select 64BIT_PHYS_ADDR in Kconfig.
MIPS: lib: memset: Clean up some MIPS{EL,EB} ifdefery
MIPS: iomap: Use __mem_{read,write}{b,w,l} for MMIO
MIPS: <asm/types.h> fix indentation.
MAINTAINERS: Add entry for BMIPS multiplatform kernel
MIPS: Enable VDSO randomization
MIPS: Remove a temporary hack for debugging cache flushes in SMTC configuration
MIPS: Remove declaration of obsolete arch_init_clk_ops()
MIPS: atomic.h: Reformat to fit in 79 columns
MIPS: Apply `.insn' to fixup labels throughout
MIPS: Fix microMIPS LL/SC immediate offsets
MIPS: Kconfig: Only allow 32-bit microMIPS builds
MIPS: signal.c: Fix an invalid cast in ISA mode bit handling
MIPS: mm: Only build one microassembler that is suitable
...
vma_dump_size() has been used several times on actual dumper and it is
supposed to return the same value for the same vma. But vma_dump_size()
could return different values for same vma.
The known problem case is concurrent shared memory removal. If a vma is
used for a shared memory and that shared memory is removed between
writing program header and dumping vma memory, this will result in a
dump file which is internally consistent.
To fix the problem, we set baseline to get dump size and store the size
into vma_filesz and always use the same vma dump size which is stored in
vma_filsz. The consistnecy with reality is not actually guranteed, but
it's tolerable since that is fully consistent with base line.
Signed-off-by: Jungseung Lee <js07.lee@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MIPS is introducing new variants of its O32 ABI which differ in their
handling of floating point, in order to enable a gradual transition
towards a world where mips32 binaries can take advantage of new hardware
features only available when configured for certain FP modes. In order
to do this ELF binaries are being augmented with a new section that
indicates, amongst other things, the FP mode requirements of the binary.
The presence & location of such a section is indicated by a program
header in the PT_LOPROC ... PT_HIPROC range.
In order to allow the MIPS architecture code to examine the program
header & section in question, pass all program headers in this range
to an architecture-specific arch_elf_pt_proc function. This function
may return an error if the header is deemed invalid or unsuitable for
the system, in which case that error will be returned from
load_elf_binary and upwards through the execve syscall.
A means is required for the architecture code to make a decision once
it is known that all such headers have been seen, but before it is too
late to return from an execve syscall. For this purpose the
arch_check_elf function is added, and called once, after all PT_LOPROC
to PT_HIPROC headers have been passed to arch_elf_pt_proc but before
the code which invoked execve has been lost. This enables the
architecture code to make a decision based upon all the headers present
in an ELF binary and its interpreter, as is required to forbid
conflicting FP ABI requirements between an ELF & its interpreter.
In order to allow data to be stored throughout the calls to the above
functions, struct arch_elf_state is introduced.
Finally a variant of the SET_PERSONALITY macro is introduced which
accepts a pointer to the struct arch_elf_state, allowing it to act
based upon state observed from the architecture specific program
headers.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7679/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Load the program headers of an ELF interpreter early enough in
load_elf_binary that they can be examined before it's too late to return
an error from an exec syscall. This patch does not perform any such
checking, it merely lays the groundwork for a further patch to do so.
No functional change is intended.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7675/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
load_elf_binary & load_elf_interp both load program headers from an ELF
executable in the same way, duplicating the code. This patch introduces
a helper function (load_elf_phdrs) which performs this common task &
calls it from both load_elf_binary & load_elf_interp. In addition to
reducing code duplication, this is part of preparing to load the ELF
interpreter headers earlier such that they can be examined before it's
too late to return an error from an exec syscall.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7676/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
... rather than doing that in the guts of ->load_binary().
[updated to fix the bug spotted by Shentino - for SIGSEGV we really need
something stronger than send_sig_info(); again, better do that in one place]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull x86 cdso updates from Peter Anvin:
"Vdso cleanups and improvements largely from Andy Lutomirski. This
makes the vdso a lot less ''special''"
* 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso, build: Make LE access macros clearer, host-safe
x86/vdso, build: Fix cross-compilation from big-endian architectures
x86/vdso, build: When vdso2c fails, unlink the output
x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
x86, mm: Improve _install_special_mapping and fix x86 vdso naming
mm, fs: Add vm_ops->name as an alternative to arch_vma_name
x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments
x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
x86, vdso: Move the 32-bit vdso special pages after the text
x86, vdso: Reimplement vdso.so preparation in build-time C
x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c
x86, vdso: Clean up 32-bit vs 64-bit vdso params
x86, mm: Ensure correct alignment of the fixmap
arch_vma_name sucks. It's a silly hack, and it's annoying to
implement correctly. In fact, AFAICS, even the straightforward x86
implementation is incorrect (I suspect that it breaks if the vdso
mapping is split or gets remapped).
This adds a new vm_ops->name operation that can replace it. The
followup patches will remove all uses of arch_vma_name on x86,
fixing a couple of annoyances in the process.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/2eee21791bb36a0a408c5c2bdb382a9e6a41ca4a.1400538962.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
load_elf_binary() sets current->mm->def_flags = def_flags and def_flags
is always zero. Not only this looks strange, this is unnecessary
because mm_init() has already set ->def_flags = 0.
Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
uselib hasn't been used since libc5; glibc does not use it. Support
turning it off.
When disabled, also omit the load_elf_library implementation from
binfmt_elf.c, which only uselib invokes.
bloat-o-meter:
add/remove: 0/4 grow/shrink: 0/1 up/down: 0/-785 (-785)
function old new delta
padzero 39 36 -3
uselib_flags 20 - -20
sys_uselib 168 - -168
SyS_uselib 168 - -168
load_elf_library 426 - -426
The new CONFIG_USELIB defaults to `y'.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These two defines are unused since the removal of the a.out interpreter
support in the ELF loader in kernel 2.6.25
Signed-off-by: Todor Minchev <todor@minchev.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dump_write() analog, takes core_dump_params instead of file,
keeps track of the amount written in cprm->written and checks for
cprm->limit. Start using it in binfmt_elf.c...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A high setting of max_map_count, and a process core-dumping with a large
enough vm_map_count could result in an NT_FILE note not being written,
and the kernel crashing immediately later because it has assumed
otherwise.
Reproduction of the oops-causing bug described here:
https://lkml.org/lkml/2013/8/30/50
Rge ussue originated in commit 2aa362c49c ("coredump: extend core dump
note section to contain file names of mapped file") from Oct 4, 2012.
This patch make that section optional in that case. fill_files_note()
should signify the error, and also let the info struct in
elf_core_dump() be zero-initialized so that we can check for the
optionally written note.
[akpm@linux-foundation.org: avoid abusing E2BIG, remove a couple of not-really-needed local variables]
[akpm@linux-foundation.org: fix sparse warning]
Signed-off-by: Dan Aloni <alonid@stratoscale.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Reported-by: Martin MOKREJS <mmokrejs@gmail.com>
Tested-by: Martin MOKREJS <mmokrejs@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull powerpc update from Benjamin Herrenschmidt:
"The main highlights this time around are:
- A pile of addition POWER8 bits and nits, such as updated
performance counter support (Michael Ellerman), new branch history
buffer support (Anshuman Khandual), base support for the new PCI
host bridge when not using the hypervisor (Gavin Shan) and other
random related bits and fixes from various contributors.
- Some rework of our page table format by Aneesh Kumar which fixes a
thing or two and paves the way for THP support. THP itself will
not make it this time around however.
- More Freescale updates, including Altivec support on the new e6500
cores, new PCI controller support, and a pile of new boards support
and updates.
- The usual batch of trivial cleanups & fixes"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
powerpc: Fix build error for book3e
powerpc: Context switch the new EBB SPRs
powerpc: Turn on the EBB H/FSCR bits
powerpc: Replace CPU_FTR_BCTAR with CPU_FTR_ARCH_207S
powerpc: Setup BHRB instructions facility in HFSCR for POWER8
powerpc: Fix interrupt range check on debug exception
powerpc: Update tlbie/tlbiel as per ISA doc
powerpc: Print page size info during boot
powerpc: print both base and actual page size on hash failure
powerpc: Fix hpte_decode to use the correct decoding for page sizes
powerpc: Decode the pte-lp-encoding bits correctly.
powerpc: Use encode avpn where we need only avpn values
powerpc: Reduce PTE table memory wastage
powerpc: Move the pte free routines from common header
powerpc: Reduce the PTE_INDEX_SIZE
powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format
powerpc: New hugepage directory format
powerpc: Don't truncate pgd_index wrongly
powerpc: Don't hard code the size of pte page
powerpc: Save DAR and DSISR in pt_regs on MCE
...
The comment I originally added in commit a3defbe5c3 ("binfmt_elf: fix
PIE execution with randomization disabled") is not really 100% accurate
-- sysctl is not the only way how PF_RANDOMIZE could be forcibly unset
in runtime.
Another option of course is direct modification of personality flags
(i.e. running through setarch wrapper).
Make the comment more explicit and accurate.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are currently out of free bits in AT_HWCAP. With POWER8, we have
several hardware features that we need to advertise.
Tested on POWER and x86.
Signed-off-by: Michael Neuling <michael@neuling.org>
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Documentation/filesystems/proc.txt says about coredump_filter bitmask,
Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only
effected by bit 5-6.
However current code can go into the subsequent flag checks of bit 0-4
for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work
as written in the document.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [3.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds core architecture support for Imagination's Meta processor
cores, followed by some later miscellaneous arch/metag cleanups and
fixes which I kept separate to ease review:
- Support for basic Meta 1 (ATP) and Meta 2 (HTP) core architecture
- A few fixes all over, particularly for symbol prefixes
- A few privilege protection fixes
- Several cleanups (setup.c includes, split out a lot of metag_ksyms.c)
- Fix some missing exports
- Convert hugetlb to use vm_unmapped_area()
- Copy device tree to non-init memory
- Provide dma_get_sgtable()
Signed-off-by: James Hogan <james.hogan@imgtec.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIcBAABAgAGBQJRMmVXAAoJEKHZs+irPybfivgP/inEXqJyfw59omQdjwvYcU/a
/u0MJ3UKSNS3U+HknfaFCy/Nwk1dqPLjqqyVC1V6AbUPBXlaEwGcimlNRx2uRjdq
Uh96upMLHsNuF/xiiR477g3RwY0egIJdM1R1bGi3mZ3vVrNQGF+wbni6f61xCWGz
M/4rDglpQvE79oLhYdgj6tidZtHQT0YWtERA9W90zkQWXGYmpFPKBKbfZAi5+rKQ
U6Gpg26orUugzXNaxltJEYKE8gjLTppEabx8DARnItZ4zCMy4dw5RBJ35RFvQw6e
eSmfgTy9w9WqBMY2+QMSgU0KQt1IITCzX7OlOXC0jALQJXoU0WWbOELlBVQLCwF1
T0OcR/5ZP/hIlOk5Dh+e9U3AtbASXdMtqA0ZUe78woH1CBf7Nc/0c0vRg23EdMh8
lnHDJxT/UqskoOYLI4kgWbEdLDy4uTh19U2pVi7VCo7ksLB9Bj9Xc8VSKgscSXTl
OwzN+c4Jgtu8FDFTp+Af4AT8pYGJ08j8L2ErsV2sOv3Q44U5WXdrMz3GSgwXj8+4
wZk3HvdkQVkMD5sJCUZgAswaN6BnbB0pHdCz4wMQ8jR/Ogs015Ipk64Ecym9S/4n
uES7PnDtt/4lb5EyX2ScbvdnZTAFTaaP7OOhC77BOQvbQjIW1tkAcxWJqRry86uS
iM0BFgK6Ohx3geqa5Ft0
=65cR
-----END PGP SIGNATURE-----
Merge tag 'metag-v3.9-rc1-v4' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag
Pull new ImgTec Meta architecture from James Hogan:
"This adds core architecture support for Imagination's Meta processor
cores, followed by some later miscellaneous arch/metag cleanups and
fixes which I kept separate to ease review:
- Support for basic Meta 1 (ATP) and Meta 2 (HTP) core architecture
- A few fixes all over, particularly for symbol prefixes
- A few privilege protection fixes
- Several cleanups (setup.c includes, split out a lot of
metag_ksyms.c)
- Fix some missing exports
- Convert hugetlb to use vm_unmapped_area()
- Copy device tree to non-init memory
- Provide dma_get_sgtable()"
* tag 'metag-v3.9-rc1-v4' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag: (61 commits)
metag: Provide dma_get_sgtable()
metag: prom.h: remove declaration of metag_dt_memblock_reserve()
metag: copy devicetree to non-init memory
metag: cleanup metag_ksyms.c includes
metag: move mm/init.c exports out of metag_ksyms.c
metag: move usercopy.c exports out of metag_ksyms.c
metag: move setup.c exports out of metag_ksyms.c
metag: move kick.c exports out of metag_ksyms.c
metag: move traps.c exports out of metag_ksyms.c
metag: move irq enable out of irqflags.h on SMP
genksyms: fix metag symbol prefix on crc symbols
metag: hugetlb: convert to vm_unmapped_area()
metag: export clear_page and copy_page
metag: export metag_code_cache_flush_all
metag: protect more non-MMU memory regions
metag: make TXPRIVEXT bits explicit
metag: kernel/setup.c: sort includes
perf: Enable building perf tools for Meta
metag: add boot time LNKGET/LNKSET check
metag: add __init to metag_cache_probe()
...
The commit "binfmt_elf: cleanups"
(f670d0ecda) removed an ifndef elf_map but
this breaks compilation for metag which does define elf_map.
This adds the ifndef back in as it was before, but does not affect the
other cleanups made by that patch.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Acked-by: Mikael Pettersson <mikpe@it.uu.se>
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
In fill_elf_header(), elf->e_ident[EI_OSABI] is always set to ELF_OSABI,
so remove the unused argument 'osabi'.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is in preparation for the full dynticks feature. While
remotely reading the cputime of a task running in a full
dynticks CPU, we'll need to do some extra-computation. This
way we can account the time it spent tickless in userspace
since its last cputime snapshot.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
If elf_core_dump() is called and fill_note_info() fails in the kmalloc()
then it returns 0 but has not yet initialised all the needed fields. As a
result we do a kfree(randomness) after correctly skipping the thread data.
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull generic execve() changes from Al Viro:
"This introduces the generic kernel_thread() and kernel_execve()
functions, and switches x86, arm, alpha, um and s390 over to them."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (26 commits)
s390: convert to generic kernel_execve()
s390: switch to generic kernel_thread()
s390: fold kernel_thread_helper() into ret_from_fork()
s390: fold execve_tail() into start_thread(), convert to generic sys_execve()
um: switch to generic kernel_thread()
x86, um/x86: switch to generic sys_execve and kernel_execve
x86: split ret_from_fork
alpha: introduce ret_from_kernel_execve(), switch to generic kernel_execve()
alpha: switch to generic kernel_thread()
alpha: switch to generic sys_execve()
arm: get rid of execve wrapper, switch to generic execve() implementation
arm: optimized current_pt_regs()
arm: introduce ret_from_kernel_execve(), switch to generic kernel_execve()
arm: split ret_from_fork, simplify kernel_thread() [based on patch by rmk]
generic sys_execve()
generic kernel_execve()
new helper: current_pt_regs()
preparation for generic kernel_thread()
um: kill thread->forking
um: let signal_delivered() do SIGTRAP on singlestepping into handler
...
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
currently it lost original meaning but still has some effects:
| effect | alternative flags
-+------------------------+---------------------------------------------
1| account as reserved_vm | VM_IO
2| skip in core dump | VM_IO, VM_DONTDUMP
3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
This patch removes reserved_vm counter from mm_struct. Seems like nobody
cares about it, it does not exported into userspace directly, it only
reduces total_vm showed in proc.
Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
[akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename VM_NODUMP into VM_DONTDUMP: this name matches other negative flags:
VM_DONTEXPAND, VM_DONTCOPY. Currently this flag used only for
sys_madvise. The next patch will use it for replacing the outdated flag
VM_RESERVED.
Also forbid madvise(MADV_DODUMP) for special kernel mappings VM_SPECIAL
(VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This note has the following format:
long count -- how many files are mapped
long page_size -- units for file_ofs
array of [COUNT] elements of
long start
long end
long file_ofs
followed by COUNT filenames in ASCII: "FILE1" NUL "FILE2" NUL...
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: "Jonathan M. Foote" <jmfoote@cert.org>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Existing PRSTATUS note contains only si_signo, si_code, si_errno fields
from the siginfo of the signal which caused core to be dumped.
There are tools which try to analyze crashes for possible security
implications, and they want to use, among other data, si_addr field from
the SIGSEGV.
This patch adds a new elf note, NT_SIGINFO, which contains the complete
siginfo_t of the signal which killed the process.
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: "Jonathan M. Foote" <jmfoote@cert.org>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a preparatory patch for the introduction of NT_SIGINFO elf note.
With this patch we pass "siginfo_t *siginfo" instead of "int signr" to
do_coredump() and put it into coredump_params. It will be used by the
next patch. Most changes are simple s/signr/siginfo->si_signo/.
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: "Jonathan M. Foote" <jmfoote@cert.org>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
load_elf_interp() has interp_map_addr carefully described as
"uninitialized_var" and marked so as to avoid a warning. However if you
trace the code it is passed into load_elf_interp and then this value is
checked against NULL.
As this return value isn't used this is actually safe but it freaks
various analysis tools that see un-initialized memory addresses being read
before their value is ever defined.
Set it to NULL as a matter of programming good taste if nothing else
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In !CORE_DUMP_USE_REGSET case, if elf_note_info_init fails to allocate
memory for info->fields, it frees already allocated stuff and returns
error to its caller, fill_note_info. Which in turn returns error to its
caller, elf_core_dump. Which jumps to cleanup label and calls
free_note_info, which will happily try to free all info->fields again.
BOOM.
This is the fix.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Venu Byravarasu <vbyravarasu@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull user namespace enhancements from Eric Biederman:
"This is a course correction for the user namespace, so that we can
reach an inexpensive, maintainable, and reasonably complete
implementation.
Highlights:
- Config guards make it impossible to enable the user namespace and
code that has not been converted to be user namespace safe.
- Use of the new kuid_t type ensures the if you somehow get past the
config guards the kernel will encounter type errors if you enable
user namespaces and attempt to compile in code whose permission
checks have not been updated to be user namespace safe.
- All uids from child user namespaces are mapped into the initial
user namespace before they are processed. Removing the need to add
an additional check to see if the user namespace of the compared
uids remains the same.
- With the user namespaces compiled out the performance is as good or
better than it is today.
- For most operations absolutely nothing changes performance or
operationally with the user namespace enabled.
- The worst case performance I could come up with was timing 1
billion cache cold stat operations with the user namespace code
enabled. This went from 156s to 164s on my laptop (or 156ns to
164ns per stat operation).
- (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
Most uid/gid setting system calls treat these value specially
anyway so attempting to use -1 as a uid would likely cause
entertaining failures in userspace.
- If setuid is called with a uid that can not be mapped setuid fails.
I have looked at sendmail, login, ssh and every other program I
could think of that would call setuid and they all check for and
handle the case where setuid fails.
- If stat or a similar system call is called from a context in which
we can not map a uid we lie and return overflowuid. The LFS
experience suggests not lying and returning an error code might be
better, but the historical precedent with uids is different and I
can not think of anything that would break by lying about a uid we
can't map.
- Capabilities are localized to the current user namespace making it
safe to give the initial user in a user namespace all capabilities.
My git tree covers all of the modifications needed to convert the core
kernel and enough changes to make a system bootable to runlevel 1."
Fix up trivial conflicts due to nearby independent changes in fs/stat.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
userns: Silence silly gcc warning.
cred: use correct cred accessor with regards to rcu read lock
userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
userns: Convert cgroup permission checks to use uid_eq
userns: Convert tmpfs to use kuid and kgid where appropriate
userns: Convert sysfs to use kgid/kuid where appropriate
userns: Convert sysctl permission checks to use kuid and kgids.
userns: Convert proc to use kuid/kgid where appropriate
userns: Convert ext4 to user kuid/kgid where appropriate
userns: Convert ext3 to use kuid/kgid where appropriate
userns: Convert ext2 to use kuid/kgid where appropriate.
userns: Convert devpts to use kuid/kgid where appropriate
userns: Convert binary formats to use kuid/kgid where appropriate
userns: Add negative depends on entries to avoid building code that is userns unsafe
userns: signal remove unnecessary map_cred_ns
userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
userns: Convert stat to return values mapped from kuids and kgids
userns: Convert user specfied uids and gids in chown into kuids and kgid
userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
...
This continues the theme started with vm_brk() and vm_munmap():
vm_mmap() does the same thing as do_mmap(), but additionally does the
required VM locking.
This uninlines (and rewrites it to be clearer) do_mmap(), which sadly
duplicates it in mm/mmap.c and mm/nommu.c. But that way we don't have
to export our internal do_mmap_pgoff() function.
Some day we hopefully don't have to export do_mmap() either, if all
modular users can become the simpler vm_mmap() instead. We're actually
very close to that already, with the notable exception of the (broken)
use in i810, and a couple of stragglers in binfmt_elf.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It does the same thing as "do_brk()", except it handles the VM locking
too.
It turns out that all external callers want that anyway, so we can make
do_brk() static to just mm/mmap.c while at it.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x32 support for x86-64 from Ingo Molnar:
"This tree introduces the X32 binary format and execution mode for x86:
32-bit data space binaries using 64-bit instructions and 64-bit kernel
syscalls.
This allows applications whose working set fits into a 32 bits address
space to make use of 64-bit instructions while using a 32-bit address
space with shorter pointers, more compressed data structures, etc."
Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c}
* 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
x32: Fix alignment fail in struct compat_siginfo
x32: Fix stupid ia32/x32 inversion in the siginfo format
x32: Add ptrace for x32
x32: Switch to a 64-bit clock_t
x32: Provide separate is_ia32_task() and is_x32_task() predicates
x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
x86/x32: Fix the binutils auto-detect
x32: Warn and disable rather than error if binutils too old
x32: Only clear TIF_X32 flag once
x32: Make sure TS_COMPAT is cleared for x32 tasks
fs: Remove missed ->fds_bits from cessation use of fd_set structs internally
fs: Fix close_on_exec pointer in alloc_fdtable
x32: Drop non-__vdso weak symbols from the x32 VDSO
x32: Fix coding style violations in the x32 VDSO code
x32: Add x32 VDSO support
x32: Allow x32 to be configured
x32: If configured, add x32 system calls to system call tables
x32: Handle process creation
x32: Signal-related system calls
x86: Add #ifdef CONFIG_COMPAT to <asm/sys_ia32.h>
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIVAwUAT3NKzROxKuMESys7AQKElw/+JyDxJSlj+g+nymkx8IVVuU8CsEwNLgRk
8KEnRfLhGtkXFLSJYWO6jzGo16F8Uqli1PdMFte/wagSv0285/HZaKlkkBVHdJ/m
u40oSjgT013bBh6MQ0Oaf8pFezFUiQB5zPOA9QGaLVGDLXCmgqUgd7exaD5wRIwB
ZmyItjZeAVnDfk1R+ZiNYytHAi8A5wSB+eFDCIQYgyulA1Igd1UnRtx+dRKbvc/m
rWQ6KWbZHIdvP1ksd8wHHkrlUD2pEeJ8glJLsZUhMm/5oMf/8RmOCvmo8rvE/qwl
eDQ1h4cGYlfjobxXZMHqAN9m7Jg2bI946HZjdb7/7oCeO6VW3FwPZ/Ic75p+wp45
HXJTItufERYk6QxShiOKvA+QexnYwY0IT5oRP4DrhdVB/X9cl2MoaZHC+RbYLQy+
/5VNZKi38iK4F9AbFamS7kd0i5QszA/ZzEzKZ6VMuOp3W/fagpn4ZJT1LIA3m4A9
Q0cj24mqeyCfjysu0TMbPtaN+Yjeu1o1OFRvM8XffbZsp5bNzuTDEvviJ2NXw4vK
4qUHulhYSEWcu9YgAZXvEWDEM78FXCkg2v/CrZXH5tyc95kUkMPcgG+QZBB5wElR
FaOKpiC/BuNIGEf02IZQ4nfDxE90QwnDeoYeV+FvNj9UEOopJ5z5bMPoTHxm4cCD
NypQthI85pc=
=G9mT
-----END PGP SIGNATURE-----
Merge tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system
Pull "Disintegrate and delete asm/system.h" from David Howells:
"Here are a bunch of patches to disintegrate asm/system.h into a set of
separate bits to relieve the problem of circular inclusion
dependencies.
I've built all the working defconfigs from all the arches that I can
and made sure that they don't break.
The reason for these patches is that I recently encountered a circular
dependency problem that came about when I produced some patches to
optimise get_order() by rewriting it to use ilog2().
This uses bitops - and on the SH arch asm/bitops.h drags in
asm-generic/get_order.h by a circuituous route involving asm/system.h.
The main difficulty seems to be asm/system.h. It holds a number of
low level bits with no/few dependencies that are commonly used (eg.
memory barriers) and a number of bits with more dependencies that
aren't used in many places (eg. switch_to()).
These patches break asm/system.h up into the following core pieces:
(1) asm/barrier.h
Move memory barriers here. This already done for MIPS and Alpha.
(2) asm/switch_to.h
Move switch_to() and related stuff here.
(3) asm/exec.h
Move arch_align_stack() here. Other process execution related bits
could perhaps go here from asm/processor.h.
(4) asm/cmpxchg.h
Move xchg() and cmpxchg() here as they're full word atomic ops and
frequently used by atomic_xchg() and atomic_cmpxchg().
(5) asm/bug.h
Move die() and related bits.
(6) asm/auxvec.h
Move AT_VECTOR_SIZE_ARCH here.
Other arch headers are created as needed on a per-arch basis."
Fixed up some conflicts from other header file cleanups and moving code
around that has happened in the meantime, so David's testing is somewhat
weakened by that. We'll find out anything that got broken and fix it..
* tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
Delete all instances of asm/system.h
Remove all #inclusions of asm/system.h
Add #includes needed to permit the removal of asm/system.h
Move all declarations of free_initmem() to linux/mm.h
Disintegrate asm/system.h for OpenRISC
Split arch_align_stack() out from asm-generic/system.h
Split the switch_to() wrapper out of asm-generic/system.h
Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
Create asm-generic/barrier.h
Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
Disintegrate asm/system.h for Xtensa
Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
Disintegrate asm/system.h for Tile
Disintegrate asm/system.h for Sparc
Disintegrate asm/system.h for SH
Disintegrate asm/system.h for Score
Disintegrate asm/system.h for S390
Disintegrate asm/system.h for PowerPC
Disintegrate asm/system.h for PA-RISC
Disintegrate asm/system.h for MN10300
...
asm/system.h is a cause of circular dependency problems because it contains
commonly used primitive stuff like barrier definitions and uncommonly used
stuff like switch_to() that might require MMU definitions.
asm/system.h has been disintegrated by this point on all arches into the
following common segments:
(1) asm/barrier.h
Moved memory barrier definitions here.
(2) asm/cmpxchg.h
Moved xchg() and cmpxchg() here. #included in asm/atomic.h.
(3) asm/bug.h
Moved die() and similar here.
(4) asm/exec.h
Moved arch_align_stack() here.
(5) asm/elf.h
Moved AT_VECTOR_SIZE_ARCH here.
(6) asm/switch_to.h
Moved switch_to() here.
Signed-off-by: David Howells <dhowells@redhat.com>
Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit
for 'VM_NODUMP' flag. The idea is is to add a new madvise() flag:
MADV_DONTDUMP, which can be set by applications to specifically request
memory regions which should not dump core.
The specific application I have in mind is qemu: we can add a flag there
that wouldn't dump all of guest memory when qemu dumps core. This flag
might also be useful for security sensitive apps that want to absolutely
make sure that parts of memory are not dumped. To clear the flag use:
MADV_DODUMP.
[akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland]
[akpm@linux-foundation.org: fix up the architectures which broke]
Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Roland McGrath <roland@hack.frob.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The motivation for this patchset was that I was looking at a way for a
qemu-kvm process, to exclude the guest memory from its core dump, which
can be quite large. There are already a number of filter flags in
/proc/<pid>/coredump_filter, however, these allow one to specify 'types'
of kernel memory, not specific address ranges (which is needed in this
case).
Since there are no more vma flags available, the first patch eliminates
the need for the 'VM_ALWAYSDUMP' flag. The flag is used internally by
the kernel to mark vdso and vsyscall pages. However, it is simple
enough to check if a vma covers a vdso or vsyscall page without the need
for this flag.
The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new
'VM_NODUMP' flag, which can be set by userspace using new madvise flags:
'MADV_DONTDUMP', and unset via 'MADV_DODUMP'. The core dump filters
continue to work the same as before unless 'MADV_DONTDUMP' is set on the
region.
The qemu code which implements this features is at:
http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch
In my testing the qemu core dump shrunk from 383MB -> 13MB with this
patch.
I also believe that the 'MADV_DONTDUMP' flag might be useful for
security sensitive apps, which might want to select which areas are
dumped.
This patch:
The VM_ALWAYSDUMP flag is currently used by the coredump code to
indicate that a vma is part of a vsyscall or vdso section. However, we
can determine if a vma is in one these sections by checking it against
the gate_vma and checking for a non-NULL return value from
arch_vma_name(). Thus, freeing a valuable vma bit.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Roland McGrath <roland@hack.frob.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The regset common infrastructure assumed that regsets would always
have .get and .set methods, but not necessarily .active methods.
Unfortunately people have since written regsets without .set methods.
Rather than putting in stub functions everywhere, handle regsets with
null .get or .set methods explicitly.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@hack.frob.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow some core dump-related fields to be overridden. This allows
core dumps to work correctly for x32.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Randomization of PIE load address is hard coded in binfmt_elf.c for X86
and ARM. Create a new Kconfig variable
(CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE) for this and use it instead. Thus
architecture specific policy is pushed out of the generic binfmt_elf.c and
into the architecture Kconfig files.
X86 and ARM Kconfigs are modified to select the new variable so there is
no change in behavior. A follow on patch will select it for MIPS too.
Signed-off-by: David Daney <david.daney@cavium.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The case of address space randomization being disabled in runtime through
randomize_va_space sysctl is not treated properly in load_elf_binary(),
resulting in SIGKILL coming at exec() time for certain PIE-linked binaries
in case the randomization has been disabled at runtime prior to calling
exec().
Handle the randomize_va_space == 0 case the same way as if we were not
supporting .text randomization at all.
Based on original patch by H.J. Lu and Josh Boyer.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: H.J. Lu <hongjiu.lu@intel.com>
Cc: <stable@kernel.org>
Tested-by: Josh Boyer <jwboyer@redhat.com>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
new helper: would_dump(bprm, file). Checks if we are allowed to
read the file and if we are not - sets ENFORCE_NODUMP. Exported,
used in places that previously open-coded the same logics.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
5520e89 ("brk: fix min_brk lower bound computation for COMPAT_BRK")
tried to get the whole logic of brk randomization for legacy
(libc5-based) applications finally right.
It turns out that the way to detect whether brk has actually been
randomized in the end or not introduced by that patch still doesn't work
for those binaries, as reported by Geert:
: /sbin/init from my old m68k ramdisk exists prematurely.
:
: Before the patch:
:
: | brk(0x80005c8e) = 0x80006000
:
: After the patch:
:
: | brk(0x80005c8e) = 0x80005c8e
:
: Old libc5 considers brk() to have failed if the return value is not
: identical to the requested value.
I don't like it, but currently see no better option than a bit flag in
task_struct to catch the CONFIG_COMPAT_BRK && randomize_va_space == 2
case.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
deal with races in /proc/*/{syscall,stack,personality}
proc: enable writing to /proc/pid/mem
proc: make check_mem_permission() return an mm_struct on success
proc: hold cred_guard_mutex in check_mem_permission()
proc: disable mem_write after exec
mm: implement access_remote_vm
mm: factor out main logic of access_process_vm
mm: use mm_struct to resolve gate vma's in __get_user_pages
mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm
mm: arch: make in_gate_area take an mm_struct instead of a task_struct
mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
x86: mark associated mm when running a task in 32 bit compatibility mode
x86: add context tag to mark mm when running a task in 32-bit compatibility mode
auxv: require the target to be tracable (or yourself)
close race in /proc/*/environ
report errors in /proc/*/*map* sanely
pagemap: close races with suid execve
make sessionid permissions in /proc/*/task/* match those in /proc/*
fix leaks in path_lookupat()
Fix up trivial conflicts in fs/proc/base.c
Morally, the presence of a gate vma is more an attribute of a particular mm than
a particular task. Moreover, dropping the dependency on task_struct will help
make both existing and future operations on mm's more flexible and convenient.
Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With GCC-4.6 we get warnings about things being 'set but not used'.
In load_elf_binary() this can happen with reloc_func_desc if ELF_PLAT_INIT
is defined, but doesn't use the reloc_func_desc argument.
Quiet the warning/error by marking reloc_func_desc as __maybe_unused.
Signed-off-by: David Daney <ddaney@caviumnetworks.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This cleans up a few bits in binfmt_elf.c and binfmts.h:
- the hasvdso field in struct linux_binfmt is unused, so remove it and
the only initialization of it
- the elf_map CPP symbol is not defined anywhere in the kernel, so
remove an unnecessary #ifndef elf_map
- reduce excessive indentation in elf_format's initializer
- add missing spaces, remove extraneous spaces
No functional changes, but tested on x86 (32 and 64 bit), powerpc (32 and
64 bit), sparc64, arm, and alpha.
Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commits 990cb8acf2 and cc92c28b2d, it is possible to have full
address space layout randomization (ASLR) on ARM. Except that one small
change was missing for ASLR of PIE executables.
Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Pass mm->flags as a coredump parameter for consistency.
---
1787 if (mm->core_state || !get_dumpable(mm)) { <- (1)
1788 up_write(&mm->mmap_sem);
1789 put_cred(cred);
1790 goto fail;
1791 }
1792
[...]
1798 if (get_dumpable(mm) == 2) { /* Setuid core dump mode */ <-(2)
1799 flag = O_EXCL; /* Stop rewrite attacks */
1800 cred->fsuid = 0; /* Dump root private */
1801 }
---
Since dumpable bits are not protected by lock, there is a chance to change
these bits between (1) and (2).
To solve this issue, this patch copies mm->flags to
coredump_params.mm_flags at the beginning of do_coredump() and uses it
instead of get_dumpable() while dumping core.
This copy is also passed to binfmt->core_dump, since elf*_core_dump() uses
dump_filter bits in mm->flags.
[akpm@linux-foundation.org: fix merge]
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current ELF dumper implementation can produce broken corefiles if
program headers exceed 65535. This number is determined by the number of
vmas which the process have. In particular, some extreme programs may use
more than 65535 vmas. (If you google max_map_count, you can find some
users facing this problem.) This kind of program never be able to generate
correct coredumps.
This patch implements ``extended numbering'' that uses sh_info field of
the first section header instead of e_phnum field in order to represent
upto 4294967295 vmas.
This is supported by
AMD64-ABI(http://www.x86-64.org/documentation.html) and
Solaris(http://docs.sun.com/app/docs/doc/817-1984/).
Of course, we are preparing patches for gdb and binutils.
Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By the next patch, elf_core_dump() and elf_fdpic_core_dump() will support
extended numbering and so will produce the corefiles with section header
table in a special case.
The problem is the process of writing a file header offset of the section
header table into e_shoff field of the ELF header. ELF header is
positioned at the beginning of the corefile, while section header at the
end. So, we need to take which of the following ways:
1. Seek backward to retry writing operation for ELF header
after writing process for a whole part
2. Make offset calculation process and writing process
totally sequential
The clause 1. is not always possible: one cannot assume that file system
supports seek function. Consider the no_llseek case.
Therefore, this patch adopts the clause 2.
Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
elf_core_dump() and elf_fdpic_core_dump() use #ifdef and the corresponding
macro for hiding _multiline_ logics in functions. This patch removes
#ifdef and replaces ELF_CORE_EXTRA_* by corresponding functions. For
architectures not implemeonting ELF_CORE_EXTRA_*, we use weak functions in
order to reduce a range of modification.
This cleanup is for my next patches, but I think this cleanup itself is
worth doing regardless of my firnal purpose.
Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
My next patch will replace ELF_CORE_EXTRA_* macros by functions, putting
them into other newly created *.c files. Then, each files will contain
dump_write(), where each pair of binfmt_*.c and elfcore.c should be the
same. So, this patch moves them into a header file with dump_seek().
Also, the patch deletes confusing DUMP_WRITE macros in each files.
Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed. It doesn't just flush the old executable
environment, it also starts up the new one.
Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.
As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.
This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()). All callers are changed
to trivially comply with the new world order.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently all architectures but microblaze unconditionally define
USE_ELF_CORE_DUMP. The microblaze omission seems like an error to me, so
let's kill this ifdef and make sure we are the same everywhere.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: <linux-arch@vger.kernel.org>
Cc: Michal Simek <michal.simek@petalogix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a helper function elf_note_info_init() to help fill_note_info()
to do initializations, also fix the potential memory leaks.
[akpm@linux-foundation.org: remove NUM_NOTES]
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation for the next patch, add a simple get_dump_page(addr)
interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
get_user_pages() directly. They're not interested in errors: they
just want to use holes as much as possible, to save space and make
sure that the data is aligned where the headers said it would be.
Oh, and don't use that horrid DUMP_SEEK(off) macro!
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In fs/binfmt_elf.c, load_elf_interp() calls padzero() for .bss even if
the PT_LOAD has no PROT_WRITE and no .bss. This generates EFAULT.
Here is a small test case. (Yes, there are other, useful PT_INTERP
which have only .text and no .data/.bss.)
----- ptinterp.S
_start: .globl _start
nop
int3
-----
$ gcc -m32 -nostartfiles -nostdlib -o ptinterp ptinterp.S
$ gcc -m32 -Wl,--dynamic-linker=ptinterp -o hello hello.c
$ ./hello
Segmentation fault # during execve() itself
After applying the patch:
$ ./hello
Trace trap # user-mode execution after execve() finishes
If the ELF headers are actually self-inconsistent, then dying is fine.
But having no PROT_WRITE segment is perfectly normal and correct if
there is no segment with p_memsz > p_filesz (i.e. bss). John Reiser
suggested checking for PROT_WRITE in the bss logic. I think it makes
most sense to simply apply the bss logic only when there is bss.
This patch looks less trivial than it is due to some reindentation.
It just moves the "if (last_bss > elf_bss) {" test up to include the
partial-page bss logic as well as the more-pages bss logic.
Reported-by: John Reiser <jreiser@bitwagon.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Check before use it.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With ELF, at generating coredump, some more headers other than used
vmas are added.
When max_map_count == 65536, a core generated by following kinds of
code can be unreadable because the number of ELF's program header is
written in 16bit in Ehdr (please see elf.h) and the number overflows.
==
... = mmap(); (munmap, mprotect, etc...)
if (failed)
abort();
==
This can happen in mmap/munmap/mprotect/etc...which calls split_vma().
I think 65536 is not safe as _default_ and reduce it to 65530 is good
for avoiding unexpected corrupted core.
Anyway, max_map_count can be enlarged by sysctl if a user is brave..
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Jakub Jelinek <jakub@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In theory it is not safe to dereference ->parent/real_parent without
tasklist or rcu lock, we can race with re-parenting.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... since we don't tell anyone which descriptor does the file get.
We used to, but only in case of ELF binary with a.out loader and
that stuff has been gone for a while.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The elf_core_dump() code does its work with set_fs(KERNEL_DS) in force,
so vma_dump_size() needs to switch back with set_fs(USER_DS) to safely
use get_user() for a normal user-space address.
Checking for VM_READ optimizes out the case where get_user() would fail
anyway. The vm_file check here was already superfluous given the control
flow earlier in the function, so that is a cleanup/optimization unrelated
to other changes but an obvious and trivial one.
Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
While discussing[1] the need for glibc to have access to random bytes
during program load, it seems that an earlier attempt to implement
AT_RANDOM got stalled. This implements a random 16 byte string, available
to every ELF program via a new auxv AT_RANDOM vector.
[1] http://sourceware.org/ml/libc-alpha/2008-10/msg00006.html
Ulrich said:
glibc needs right after startup a bit of random data for internal
protections (stack canary etc). What is now in upstream glibc is that we
always unconditionally open /dev/urandom, read some data, and use it. For
every process startup. That's slow.
...
The solution is to provide a limited amount of random data to the
starting process in the aux vector. I suggested 16 bytes and this is
what the patch implements. If we need only 16 bytes or less we use the
data directly. If we need more we'll use the 16 bytes to see a PRNG.
This avoids the costly /dev/urandom use and it allows the kernel to use
the most adequate source of random data for this purpose. It might not
be the same pool as that for /dev/urandom.
Concerns were expressed about the depletion of the randomness pool. But
this patch doesn't make the situation worse, it doesn't deplete entropy
more than happens now.
Signed-off-by: Kees Cook <kees.cook@canonical.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
arch_setup_additional_pages currently gets two arguments, the binary
format descripton and an indication if the process uses an executable
stack or not. The second argument is not used by anybody, it could
be removed without replacement.
What actually does make sense is to pass an indication if the process
uses the elf interpreter or not. The glibc code will not use anything
from the vdso if the process does not use the dynamic linker, so for
statically linked binaries the architecture backend can choose not
to map the vdso.
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Use RCU to access another task's creds and to release a task's own creds.
This means that it will be possible for the credentials of a task to be
replaced without another task (a) requiring a full lock to read them, and (b)
seeing deallocated memory.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Wrap current->cred and a few other accessors to hide their actual
implementation.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Separate the task security context from task_struct. At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.
Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
entry.S via asm-offsets.
With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
* 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (36 commits)
fix documentation of sysrq-q really
Fix documentation of sysrq-q
timer_list: add base address to clock base
timer_list: print cpu number of clockevents device
timer_list: print real timer address
NOHZ: restart tick device from irq_enter()
NOHZ: split tick_nohz_restart_sched_tick()
NOHZ: unify the nohz function calls in irq_enter()
timers: fix itimer/many thread hang, fix
timers: fix itimer/many thread hang, v3
ntp: improve adjtimex frequency rounding
timekeeping: fix rounding problem during clock update
ntp: let update_persistent_clock() sleep
hrtimer: reorder struct hrtimer to save 8 bytes on 64bit builds
posix-timers: lock_timer: make it readable
posix-timers: lock_timer: kill the bogus ->it_id check
posix-timers: kill ->it_sigev_signo and ->it_sigev_value
posix-timers: sys_timer_create: cleanup the error handling
posix-timers: move the initialization of timer->sigq from send to create path
posix-timers: sys_timer_create: simplify and s/tasklist/rcu/
...
Fix trivial conflicts due to sysrq-q description clahes in
Documentation/sysrq.txt and drivers/char/sysrq.c
Presently hugepage's vma has a VM_RESERVED flag in order not to be
swapped. But a VM_RESERVED vma isn't core dumped because this flag is
often used for some kernel vmas (e.g. vmalloc, sound related).
Thus hugepages are never dumped and it can't be debugged easily. Many
developers want hugepages to be included into core-dump.
However, We can't read generic VM_RESERVED area because this area is often
IO mapping area. then these area reading may change device state. it is
definitly undesiable side-effect.
So adding a hugepage specific bit to the coredump filter is better. It
will be able to hugepage core dumping and doesn't cause any side-effect to
any i/o devices.
In additional, libhugetlb use hugetlb private mapping pages as anonymous
page. Then, hugepage private mapping pages should be core dumped by
default.
Then, /proc/[pid]/core_dump_filter has two new bits.
- bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
- bit 6 mean hugetlb shared mapping pages are dumped or not. (default: no)
I tested by following method.
% ulimit -c unlimited
% ./crash_hugepage 50
% ./crash_hugepage 50 -p
% ls -lh
% gdb ./crash_hugepage core
%
% echo 0x43 > /proc/self/coredump_filter
% ./crash_hugepage 50
% ./crash_hugepage 50 -p
% ls -lh
% gdb ./crash_hugepage core
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include "hugetlbfs.h"
int main(int argc, char** argv){
char* p;
int ch;
int mmap_flags = MAP_SHARED;
int fd;
int nr_pages;
while((ch = getopt(argc, argv, "p")) != -1) {
switch (ch) {
case 'p':
mmap_flags &= ~MAP_SHARED;
mmap_flags |= MAP_PRIVATE;
break;
default:
/* nothing*/
break;
}
}
argc -= optind;
argv += optind;
if (argc == 0){
printf("need # of pages\n");
exit(1);
}
nr_pages = atoi(argv[0]);
if (nr_pages < 2) {
printf("nr_pages must >2\n");
exit(1);
}
fd = hugetlbfs_unlinked_fd();
p = mmap(NULL, nr_pages * gethugepagesize(),
PROT_READ|PROT_WRITE, mmap_flags, fd, 0);
sleep(2);
*(p + gethugepagesize()) = 1; /* COW */
sleep(2);
/* crash! */
*(int*)0 = 1;
return 0;
}
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: William Irwin <wli@holomorphy.com>
Cc: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The SET_PERSONALITY macro is always called with a second argument of 0.
Remove the ibcs argument and the various tests to set the PER_SVR4
personality.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Overview
This patch reworks the handling of POSIX CPU timers, including the
ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together
with the help of Roland McGrath, the owner and original writer of this code.
The problem we ran into, and the reason for this rework, has to do with using
a profiling timer in a process with a large number of threads. It appears
that the performance of the old implementation of run_posix_cpu_timers() was
at least O(n*3) (where "n" is the number of threads in a process) or worse.
Everything is fine with an increasing number of threads until the time taken
for that routine to run becomes the same as or greater than the tick time, at
which point things degrade rather quickly.
This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."
Code Changes
This rework corrects the implementation of run_posix_cpu_timers() to make it
run in constant time for a particular machine. (Performance may vary between
one machine and another depending upon whether the kernel is built as single-
or multiprocessor and, in the latter case, depending upon the number of
running processors.) To do this, at each tick we now update fields in
signal_struct as well as task_struct. The run_posix_cpu_timers() function
uses those fields to make its decisions.
We define a new structure, "task_cputime," to contain user, system and
scheduler times and use these in appropriate places:
struct task_cputime {
cputime_t utime;
cputime_t stime;
unsigned long long sum_exec_runtime;
};
This is included in the structure "thread_group_cputime," which is a new
substructure of signal_struct and which varies for uniprocessor versus
multiprocessor kernels. For uniprocessor kernels, it uses "task_cputime" as
a simple substructure, while for multiprocessor kernels it is a pointer:
struct thread_group_cputime {
struct task_cputime totals;
};
struct thread_group_cputime {
struct task_cputime *totals;
};
We also add a new task_cputime substructure directly to signal_struct, to
cache the earliest expiration of process-wide timers, and task_cputime also
replaces the it_*_expires fields of task_struct (used for earliest expiration
of thread timers). The "thread_group_cputime" structure contains process-wide
timers that are updated via account_user_time() and friends. In the non-SMP
case the structure is a simple aggregator; unfortunately in the SMP case that
simplicity was not achievable due to cache-line contention between CPUs (in
one measured case performance was actually _worse_ on a 16-cpu system than
the same test on a 4-cpu system, due to this contention). For SMP, the
thread_group_cputime counters are maintained as a per-cpu structure allocated
using alloc_percpu(). The timer functions update only the timer field in
the structure corresponding to the running CPU, obtained using per_cpu_ptr().
We define a set of inline functions in sched.h that we use to maintain the
thread_group_cputime structure and hide the differences between UP and SMP
implementations from the rest of the kernel. The thread_group_cputime_init()
function initializes the thread_group_cputime structure for the given task.
The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the
out-of-line function thread_group_cputime_alloc_smp() to allocate and fill
in the per-cpu structures and fields. The thread_group_cputime_free()
function, also a no-op for UP, in SMP frees the per-cpu structures. The
thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls
thread_group_cputime_alloc() if the per-cpu structures haven't yet been
allocated. The thread_group_cputime() function fills the task_cputime
structure it is passed with the contents of the thread_group_cputime fields;
in UP it's that simple but in SMP it must also safely check that tsk->signal
is non-NULL (if it is it just uses the appropriate fields of task_struct) and,
if so, sums the per-cpu values for each online CPU. Finally, the three
functions account_group_user_time(), account_group_system_time() and
account_group_exec_runtime() are used by timer functions to update the
respective fields of the thread_group_cputime structure.
Non-SMP operation is trivial and will not be mentioned further.
The per-cpu structure is always allocated when a task creates its first new
thread, via a call to thread_group_cputime_clone_thread() from copy_signal().
It is freed at process exit via a call to thread_group_cputime_free() from
cleanup_signal().
All functions that formerly summed utime/stime/sum_sched_runtime values from
from all threads in the thread group now use thread_group_cputime() to
snapshot the values in the thread_group_cputime structure or the values in
the task structure itself if the per-cpu structure hasn't been allocated.
Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit.
The run_posix_cpu_timers() function has been split into a fast path and a
slow path; the former safely checks whether there are any expired thread
timers and, if not, just returns, while the slow path does the heavy lifting.
With the dedicated thread group fields, timers are no longer "rebalanced" and
the process_timer_rebalance() function and related code has gone away. All
summing loops are gone and all code that used them now uses the
thread_group_cputime() inline. When process-wide timers are set, the new
task_cputime structure in signal_struct is used to cache the earliest
expiration; this is checked in the fast path.
Performance
The fix appears not to add significant overhead to existing operations. It
generally performs the same as the current code except in two cases, one in
which it performs slightly worse (Case 5 below) and one in which it performs
very significantly better (Case 2 below). Overall it's a wash except in those
two cases.
I've since done somewhat more involved testing on a dual-core Opteron system.
Case 1: With no itimer running, for a test with 100,000 threads, the fixed
kernel took 1428.5 seconds, 513 seconds more than the unfixed system,
all of which was spent in the system. There were twice as many
voluntary context switches with the fix as without it.
Case 2: With an itimer running at .01 second ticks and 4000 threads (the most
an unmodified kernel can handle), the fixed kernel ran the test in
eight percent of the time (5.8 seconds as opposed to 70 seconds) and
had better tick accuracy (.012 seconds per tick as opposed to .023
seconds per tick).
Case 3: A 4000-thread test with an initial timer tick of .01 second and an
interval of 10,000 seconds (i.e. a timer that ticks only once) had
very nearly the same performance in both cases: 6.3 seconds elapsed
for the fixed kernel versus 5.5 seconds for the unfixed kernel.
With fewer threads (eight in these tests), the Case 1 test ran in essentially
the same time on both the modified and unmodified kernels (5.2 seconds versus
5.8 seconds). The Case 2 test ran in about the same time as well, 5.9 seconds
versus 5.4 seconds but again with much better tick accuracy, .013 seconds per
tick versus .025 seconds per tick for the unmodified kernel.
Since the fix affected the rlimit code, I also tested soft and hard CPU limits.
Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer
running), the modified kernel was very slightly favored in that while
it killed the process in 19.997 seconds of CPU time (5.002 seconds of
wall time), only .003 seconds of that was system time, the rest was
user time. The unmodified kernel killed the process in 20.001 seconds
of CPU (5.014 seconds of wall time) of which .016 seconds was system
time. Really, though, the results were too close to call. The results
were essentially the same with no itimer running.
Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds
(where the hard limit would never be reached) and an itimer running,
the modified kernel exhibited worse tick accuracy than the unmodified
kernel: .050 seconds/tick versus .028 seconds/tick. Otherwise,
performance was almost indistinguishable. With no itimer running this
test exhibited virtually identical behavior and times in both cases.
In times past I did some limited performance testing. those results are below.
On a four-cpu Opteron system without this fix, a sixteen-thread test executed
in 3569.991 seconds, of which user was 3568.435s and system was 1.556s. On
the same system with the fix, user and elapsed time were about the same, but
system time dropped to 0.007 seconds. Performance with eight, four and one
thread were comparable. Interestingly, the timer ticks with the fix seemed
more accurate: The sixteen-thread test with the fix received 149543 ticks
for 0.024 seconds per tick, while the same test without the fix received 58720
for 0.061 seconds per tick. Both cases were configured for an interval of
0.01 seconds. Again, the other tests were comparable. Each thread in this
test computed the primes up to 25,000,000.
I also did a test with a large number of threads, 100,000 threads, which is
impossible without the fix. In this case each thread computed the primes only
up to 10,000 (to make the runtime manageable). System time dominated, at
1546.968 seconds out of a total 2176.906 seconds (giving a user time of
629.938s). It received 147651 ticks for 0.015 seconds per tick, still quite
accurate. There is obviously no comparable test without the fix.
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This moves all the ptrace hooks related to exec into tracehook.h inlines.
This also lifts the calls for tracing out of the binfmt load_binary hooks
into search_binary_handler() after it calls into the binfmt module. This
change has no effect, since all the binfmt modules' load_binary functions
did the call at the end on success, and now search_binary_handler() does
it immediately after return if successful. We consolidate the repeated
code, and binfmt modules no longer need to import ptrace_notify().
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (34 commits)
powerpc: Wireup new syscalls
Move update_mmu_cache() declaration from tlbflush.h to pgtable.h
powerpc/pseries: Remove kmalloc call in handling writes to lparcfg
powerpc/pseries: Update arch vector to indicate support for CMO
ibmvfc: Add support for collaborative memory overcommit
ibmvscsi: driver enablement for CMO
ibmveth: enable driver for CMO
ibmveth: Automatically enable larger rx buffer pools for larger mtu
powerpc/pseries: Verify CMO memory entitlement updates with virtual I/O
powerpc/pseries: vio bus support for CMO
powerpc/pseries: iommu enablement for CMO
powerpc/pseries: Add CMO paging statistics
powerpc/pseries: Add collaborative memory manager
powerpc/pseries: Utilities to set firmware page state
powerpc/pseries: Enable CMO feature during platform setup
powerpc/pseries: Split retrieval of processor entitlement data into a helper routine
powerpc/pseries: Add memory entitlement capabilities to /proc/ppc64/lparcfg
powerpc/pseries: Split processor entitlement retrieval and gathering to helper routines
powerpc/pseries: Remove extraneous error reporting for hcall failures in lparcfg
powerpc: Fix compile error with binutils 2.15
...
Fixed up conflict in arch/powerpc/platforms/52xx/Kconfig manually.
Kill the nasty rcu_read_lock() + do_each_thread() loop, use the list
encoded in mm->core_state instead, s/GFP_ATOMIC/GFP_KERNEL/.
This patch allows futher cleanups in binfmt_elf.c, in particular we can
kill the parallel info->threads list.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
linux_binfmt->core_dump() runs before the process does exit_aio(), this
means that we can hit the kernel thread which shares the same ->mm.
Afaics, nothing really bad can happen, but perhaps it makes sense to fix
this minor bug.
It is sad we have to iterate over all threads in system and use
GFP_ATOMIC. Hopefully we can kill theses ugly do_each_thread()s, but this
needs some nontrivial changes in mm_struct and do_coredump.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some IBM POWER-based platforms have the ability to run in a
mode which mostly appears to the OS as a different processor from the
actual hardware. For example, a Power6 system may appear to be a
Power5+, which makes the AT_PLATFORM value "power5+". This means that
programs are restricted to the ISA supported by Power5+;
Power6-specific instructions are treated as illegal.
However, some applications (virtual machines, optimized libraries) can
benefit from knowledge of the underlying CPU model. A new aux vector
entry, AT_BASE_PLATFORM, will denote the actual hardware. For
example, on a Power6 system in Power5+ compatibility mode, AT_PLATFORM
will be "power5+" and AT_BASE_PLATFORM will be "power6". The idea is
that AT_PLATFORM indicates the instruction set supported, while
AT_BASE_PLATFORM indicates the underlying microarchitecture.
If the architecture has defined ELF_BASE_PLATFORM, copy that value to
the user stack in the same manner as ELF_PLATFORM.
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The Linux kernel puts the filename argument of execve() into the new
address space. Many developers are surprised to learn this. Those who
know and could use it, object "But it's not documented."
Those who want to use it dislike the expression
(char *)(1+ strlen(env[-1+ n_env]) + env[-1+ n_env])
because it requires locating the last original environment variable,
and assumes that the filename follows the characters.
This patch documents the insertion of the filename, and makes it easier
to find by adding a new tag AT_EXECFN in the ElfXX_auxv_t; see <elf.h>.
In many cases readlink("/proc/self/exe",) gives the same answer. But if
all the original pages get unmapped, then the kernel erases the symlink
for /proc/self/exe. This can happen when a program decompressor does a
good job of cleaning up after uncompressing directly to memory, so that
the address space of the target program looks the same as if compression
had never happened. One example is http://upx.sourceforge.net .
One notable use of the underlying concept (what path containED the
executable) is glibc expanding $ORIGIN in DT_RUNPATH. In practice for
the near term, it may be a good idea for user-mode code to use both
/proc/self/exe and AT_EXECFN as fall-back methods for each other.
/proc/self/exe can fail due to unmapping, AT_EXECFN can fail because it
won't be present on non-new systems. The auxvec or {AT_EXECFN}.d_val
also can get overwritten, although in nearly all cases this would be the
result of a bug.
The runtime cost is one NEW_AUX_ENT using two words of stack space. The
underlying value is maintained already as bprm->exec; setup_arg_pages()
in fs/exec.c slides it for stack_shift, etc.
Signed-off-by: John Reiser <jreiser@BitWagon.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit d20894a237 ("Remove a.out
interpreter support in ELF loader"), Andi removed support for a.out
interpreters from the ELF loader, which was only ever needed for the
transition from a.out to ELF.
This removes the last traces of that support, in particular the
inclusion of <linux/a.out.h>.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
create_elf_tables() returns 0 on success. But when strnlen_user() "fails",
it returns 0 directly. So this is wrong.
Signed-off-by: WANG Cong <wangcong@zeuux.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In kmalloc failing path, we shouldn't free pointers in 'info',
because the struct 'info' is uninitilized when kmalloc is called.
And when kmalloc returns NULL, it's needless to kfree it.
Signed-off-by: WANG Cong <wangcong@zeuux.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
--
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix these sparse warings:
fs/binfmt_elf.c:1749:29: warning: symbol 'tmp' shadows an earlier one
fs/binfmt_elf.c:1734:28: originally declared here
fs/binfmt_elf.c:2009:26: warning: symbol 'vma' shadows an earlier one
fs/binfmt_elf.c:1892:24: originally declared here
[akpm@linux-foundation.org: chose better variable name]
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch does simplify fill_elf_header function by setting
to zero the whole elf header first. So we fillup the fields
we really need only.
before:
text data bss dec hex filename
11735 80 0 11815 2e27 fs/binfmt_elf.o
after:
text data bss dec hex filename
11710 80 0 11790 2e0e fs/binfmt_elf.o
viola, 25 bytes of text is freed
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* unshare_files() can fail; doing it after irreversible actions is wrong
and de_thread() is certainly irreversible.
* since we do it unconditionally anyway, we might as well do it in do_execve()
and save ourselves the PITA in binfmt handlers, etc.
* while we are at it, binfmt_som actually leaked files_struct on failure.
As a side benefit, unshare_files(), put_files_struct() and reset_files_struct()
become unexported.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This makes the user_regset-based core dump code call user_regset writeback
hooks when available. This is necessary groundwork to allow IA64 to set
CORE_DUMP_USE_REGSET.
Cc: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following the deprecation schedule the a.out ELF interpreter support
is removed now with this patch. a.out ELF interpreters were an transition
feature for moving a.out systems to ELF, but they're unlikely to be still
needed. Pure a.out systems will still work of course. This allows to
simplify the hairy ELF loader.
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Suppress A.OUT library support if CONFIG_ARCH_SUPPORTS_AOUT is not set.
Not all architectures support the A.OUT binfmt, so the ELF binfmt should not
be permitted to go looking for A.OUT libraries to load in such a case. Not
only that, but under such conditions A.OUT core dumps are not produced either.
To make this work, this patch also does the following:
(1) Makes the existence of the contents of linux/a.out.h contingent on
CONFIG_ARCH_SUPPORTS_AOUT.
(2) Renames dump_thread() to aout_dump_thread() as it's only called by A.OUT
core dumping code.
(3) Moves aout_dump_thread() into asm/a.out-core.h and makes it inline. This
is then included only where needed. This means that this bit of arch
code will be stored in the appropriate A.OUT binfmt module rather than
the core kernel.
(4) Drops A.OUT support for Blackfin (according to Mike Frysinger it's not
needed) and FRV.
This patch depends on the previous patch to move STACK_TOP[_MAX] out of
asm/a.out.h and into asm/processor.h as they're required whether or not A.OUT
format is available.
[jdike@addtoit.com: uml: re-remove accidentally restored code]
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
based on similar patch from: Pavel Machek <pavel@ucw.cz>
Introduce CONFIG_COMPAT_BRK. If disabled then the kernel is free
(but not obliged to) randomize the brk area.
Heap randomization breaks ancient binaries, so we keep COMPAT_BRK
enabled by default.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
ibcs2 support has never been supported on 2.6 kernels as far as I know,
and if it has it must have been an external patch. Anyways, if anybody
applies an external patch they could as well readd the ibcs checking
code to the ELF loader in the same patch. But there is no reason to
keep this code running in all Linux kernels. This will save at least
two strcmps each ELF execution.
No deprecation period because it could not have been used anyway.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This modifies the ELF core dump code under #ifdef CORE_DUMP_USE_REGSET.
It changes nothing when this macro is not defined. When it's #define'd
by some arch header (e.g. asm/elf.h), the arch must support the
user_regset (linux/regset.h) interface for reading thread state.
This provides an alternate version of note segment writing that is based
purely on the user_regset interfaces. When CORE_DUMP_USE_REGSET is set,
the arch need not define macros such as ELF_CORE_COPY_REGS and ELF_ARCH.
All that information is taken from the user_regset data structures.
The core dumps come out exactly the same if arch's definitions for its
user_regset details are correct.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This pulls out the code for writing the notes segment of an ELF core dump
into separate functions. This cleanly isolates into one cluster of
functions everything that deals with the note formats and the hooks into
arch code to fill them. The top-level elf_core_dump function itself now
deals purely with the generic ELF format and the memory segments.
This only moves code around into functions that can be inlined away.
It should not change any behavior at all.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
#39: FILE: arch/ia64/ia32/binfmt_elf32.c:229:
+elf32_map (struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, int type, unsigned long unused)
WARNING: no space between function name and open parenthesis '('
#39: FILE: arch/ia64/ia32/binfmt_elf32.c:229:
+elf32_map (struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, int type, unsigned long unused)
WARNING: line over 80 characters
#67: FILE: arch/x86/kernel/sys_x86_64.c:80:
+ new_begin = randomize_range(*begin, *begin + 0x02000000, 0);
ERROR: use tabs not spaces
#110: FILE: arch/x86/kernel/sys_x86_64.c:185:
+ ^I mm->cached_hole_size = 0;$
ERROR: use tabs not spaces
#111: FILE: arch/x86/kernel/sys_x86_64.c:186:
+ ^I^Imm->free_area_cache = mm->mmap_base;$
ERROR: use tabs not spaces
#112: FILE: arch/x86/kernel/sys_x86_64.c:187:
+ ^I}$
ERROR: use tabs not spaces
#141: FILE: arch/x86/kernel/sys_x86_64.c:216:
+ ^I^I/* remember the largest hole we saw so far */$
ERROR: use tabs not spaces
#142: FILE: arch/x86/kernel/sys_x86_64.c:217:
+ ^I^Iif (addr + mm->cached_hole_size < vma->vm_start)$
ERROR: use tabs not spaces
#143: FILE: arch/x86/kernel/sys_x86_64.c:218:
+ ^I^I mm->cached_hole_size = vma->vm_start - addr;$
ERROR: use tabs not spaces
#157: FILE: arch/x86/kernel/sys_x86_64.c:232:
+ ^Imm->free_area_cache = TASK_UNMAPPED_BASE;$
ERROR: need a space before the open parenthesis '('
#291: FILE: arch/x86/mm/mmap_64.c:101:
+ } else if(mmap_is_legacy()) {
WARNING: braces {} are not necessary for single statement blocks
#302: FILE: arch/x86/mm/mmap_64.c:112:
+ if (current->flags & PF_RANDOMIZE) {
+ mm->mmap_base += ((long)rnd) << PAGE_SHIFT;
+ }
WARNING: line over 80 characters
#314: FILE: fs/binfmt_elf.c:48:
+static unsigned long elf_map (struct file *, unsigned long, struct elf_phdr *, int, int, unsigned long);
WARNING: no space between function name and open parenthesis '('
#314: FILE: fs/binfmt_elf.c:48:
+static unsigned long elf_map (struct file *, unsigned long, struct elf_phdr *, int, int, unsigned long);
WARNING: line over 80 characters
#429: FILE: fs/binfmt_elf.c:438:
+ eppnt, elf_prot, elf_type, total_size);
ERROR: need space after that ',' (ctx:VxV)
#480: FILE: fs/binfmt_elf.c:939:
+ elf_prot, elf_flags,0);
^
total: 9 errors, 7 warnings, 461 lines checked
Your patch has style problems, please review. If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
main executable of (specially compiled/linked -pie/-fpie) ET_DYN binaries
onto a random address (in cases in which mmap() is allowed to perform a
randomization).
The code has been extraced from Ingo's exec-shield patch
http://people.redhat.com/mingo/exec-shield/
[akpm@linux-foundation.org: fix used-uninitialsied warning]
[kamezawa.hiroyu@jp.fujitsu.com: fixed ia32 ELF on x86_64 handling]
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Randomize the location of the heap (brk) for i386 and x86_64. The range is
randomized in the range starting at current brk location up to 0x02000000
offset for both architectures. This, together with
pie-executable-randomization.patch and
pie-executable-randomization-fix.patch, should make the address space
randomization on i386 and x86_64 complete.
Arjan says:
This is known to break older versions of some emacs variants, whose dumper
code assumed that the last variable declared in the program is equal to the
start of the dynamically allocated memory region.
(The dumper is the code where emacs effectively dumps core at the end of it's
compilation stage; this coredump is then loaded as the main program during
normal use)
iirc this was 5 years or so; we found this way back when I was at RH and we
first did the security stuff there (including this brk randomization). It
wasn't all variants of emacs, and it got fixed as a result (I vaguely remember
that emacs already had code to deal with it for other archs/oses, just
ifdeffed wrongly).
It's a rare and wrong assumption as a general thing, just on x86 it mostly
happened to be true (but to be honest, it'll break too if gcc does
something fancy or if the linker does a non-standard order). Still its
something we should at least document.
Note 2: afaik it only broke the emacs *build*. I'm not 100% sure about that
(it IS 5 years ago) though.
[ akpm@linux-foundation.org: deuglification ]
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The pr_ppid field reported in core dumps should match what
getppid() would have returned to that process, regardless of
whether a debugger is attached.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.
The idea is:
- all in-kernel data structures must store either struct pid itself
or the pid's global nr, obtained with pid_nr() call;
- when seeking the task from kernel code with the stored id one
should use find_task_by_pid() call that works with global pids;
- when showing pid's numerical value to the user the virtual one
should be used, but however when one shows task's pid outside this
task's namespace the global one is to be used;
- when getting the pid from userspace one need to consider this as
the virtual one and use appropriate task/pid-searching functions.
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The set of functions process_session, task_session, process_group and
task_pgrp is confusing, as the names can be mixed with each other when looking
at the code for a long time.
The proposals are to
* equip the functions that return the integer with _nr suffix to
represent that fact,
* and to make all functions work with task (not process) by making
the common prefix of the same name.
For monotony the routines signal_session() and set_signal_session() are
replaced with task_session_nr() and set_task_session(), especially since they
are only used with the explicit task->signal dereference.
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently arch_align_stack() is used by fs/binfmt_elf.c to randomize
stack pointer inside a page. But this happens only if ELF_PLATFORM
symbol is defined.
ELF_PLATFORM is normally set if the architecture wants ld.so to load
implementation specific libraries for optimization. And currently a
lot of architectures just yield this symbol to NULL.
This is the case for MIPS architecture where ELF_PLATFORM is NULL but
arch_align_stack() has been redefined to do stack inside page
randomization. So in this case no randomization is actually done.
This patch breaks this dependency which seems to be useless and allows
platforms such MIPS to do the randomization.
Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>