Commit Graph

144 Commits

Author SHA1 Message Date
Sukadev Bhattiprolu 3fbc964864 [PATCH] Define struct pspace
Define a per-container pid space object.  And create one instance of this
object, init_pspace, to define the entire pid space.  Subsequent patches
will provide/use interfaces to create/destroy pid spaces.

Its a subset/rework of Eric Biederman's patch
http://lkml.org/lkml/2006/2/6/285 .

Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:15 -07:00
Eric W. Biederman f6c7a1f34e [PATCH] proc: give the root directory a task
Helper functions in base.c like proc_pident_readdir and proc_pident_lookup
assume the directories have an associated task, and cannot currently be used
on the /proc root directory because it does not have such a task.

This small changes allows for base.c to be simplified and later when multiple
pid spaces are introduced it makes getting the needed context information
trivial.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:13 -07:00
Eric W. Biederman 20cdc894c4 [PATCH] proc: modify proc_pident_lookup to be completely table driven
Currently proc_pident_lookup gets the names and types from a table and then
has a huge switch statement to get the inode and file operations it needs.
That is silly and is becoming increasingly hard to maintain so I just put all
of the information in the table.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:13 -07:00
Eric W. Biederman 28a6d67179 [PATCH] proc: reorder the functions in base.c
There were enough changes in my last round of cleaning up proc I had to break
up the patch series into smaller chunks, and my last chunk never got resent.

This patchset gives proc dynamic inode numbers (the static inode numbers were
a pain to maintain and prevent all kinds of things), and removes the horrible
switch statements that had to be kept in sync with everything else.  Being
fully table driver takes us 90% of the way of being able to register new
process specific attributes in proc.

This patch:

Group the functions by what they implement instead of by type of operation.
As it existed base.c was quickly approaching the point where it could not be
followed.

No functionality or code changes asside from adding/removing forward
declartions are implemented in this patch.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:13 -07:00
Eric W. Biederman 0804ef4b0d [PATCH] proc: readdir race fix (take 3)
The problem: An opendir, readdir, closedir sequence can fail to report
process ids that are continually in use throughout the sequence of system
calls.  For this race to trigger the process that proc_pid_readdir stops at
must exit before readdir is called again.

This can cause ps to fail to report processes, and it is in violation of
posix guarantees and normal application expectations with respect to
readdir.

Currently there is no way to work around this problem in user space short
of providing a gargantuan buffer to user space so the directory read all
happens in on system call.

This patch implements the normal directory semantics for proc, that
guarantee that a directory entry that is neither created nor destroyed
while reading the directory entry will be returned.  For directory that are
either created or destroyed during the readdir you may or may not see them.
 Furthermore you may seek to a directory offset you have previously seen.

These are the guarantee that ext[23] provides and that posix requires, and
more importantly that user space expects.  Plus it is a simple semantic to
implement reliable service.  It is just a matter of calling readdir a
second time if you are wondering if something new has show up.

These better semantics are implemented by scanning through the pids in
numerical order and by making the file offset a pid plus a fixed offset.

The pid scan happens on the pid bitmap, which when you look at it is
remarkably efficient for a brute force algorithm.  Given that a typical
cache line is 64 bytes and thus covers space for 64*8 == 200 pids.  There
are only 40 cache lines for the entire 32K pid space.  A typical system
will have 100 pids or more so this is actually fewer cache lines we have to
look at to scan a linked list, and the worst case of having to scan the
entire pid bitmap is pretty reasonable.

If we need something more efficient we can go to a more efficient data
structure for indexing the pids, but for now what we have should be
sufficient.

In addition this takes no additional locks and is actually less code than
what we are doing now.

Also another very subtle bug in this area has been fixed.  It is possible
to catch a task in the middle of de_thread where a thread is assuming the
thread of it's thread group leader.  This patch carefully handles that case
so if we hit it we don't fail to return the pid, that is undergoing the
de_thread dance.

Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for
providing the first fix, pointing this out and working on it.

[oleg@tv-sign.ru: fix it]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:12 -07:00
David Howells 9361401eb7 [PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer.  Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.

This patch does the following:

 (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
     support.

 (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
     an item that uses the block layer.  This includes:

     (*) Block I/O tracing.

     (*) Disk partition code.

     (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.

     (*) The SCSI layer.  As far as I can tell, even SCSI chardevs use the
     	 block layer to do scheduling.  Some drivers that use SCSI facilities -
     	 such as USB storage - end up disabled indirectly from this.

     (*) Various block-based device drivers, such as IDE and the old CDROM
     	 drivers.

     (*) MTD blockdev handling and FTL.

     (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
     	 taking a leaf out of JFFS2's book.

 (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
     linux/elevator.h contingent on CONFIG_BLOCK being set.  sector_div() is,
     however, still used in places, and so is still available.

 (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
     parts of linux/fs.h.

 (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.

 (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.

 (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
     is not enabled.

 (*) fs/no-block.c is created to hold out-of-line stubs and things that are
     required when CONFIG_BLOCK is not set:

     (*) Default blockdev file operations (to give error ENODEV on opening).

 (*) Makes some /proc changes:

     (*) /proc/devices does not list any blockdevs.

     (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.

 (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.

 (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
     given command other than Q_SYNC or if a special device is specified.

 (*) In init/do_mounts.c, no reference is made to the blockdev routines if
     CONFIG_BLOCK is not defined.  This does not prohibit NFS roots or JFFS2.

 (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
     error ENOSYS by way of cond_syscall if so).

 (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
     CONFIG_BLOCK is not set, since they can't then happen.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:31 +02:00
Vivek Goyal 632dd2053a [PATCH] Kcore elf note namesz field fix
o As per ELF specifications, it looks like that elf note "namesz" field
  contains the length of "name" including the size of null character.  And
  currently we are filling "namesz" without taking into the consideration
  the null character size.

o Kexec-tools performs this check deligently hence I ran into the issue
  while trying to open /proc/kcore in kexec-tools for some info.

Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:25 -07:00
Frederik Deweerdt f7ca54f486 [PATCH] fix mem_write() return value
At the beginning of the routine, "copied" is set to 0, but it is no good
because in lines 805 and 812 it is set to other values.  Finally, the
routine returns as if it copied 12 (=ENOMEM) bytes less than it actually
did.

Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:19 -07:00
Alan Cox 3cfd0885fa [PATCH] tty: stop the tty vanishing under procfs access
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:16 -07:00
David Howells f269fdd182 [PATCH] NOMMU: move the fallback arch_vma_name() to a sensible place
Move the fallback arch_vma_name() to a sensible place (kernel/signal.c).

Currently it's in fs/proc/task_mmu.c, a file that is dependent on both
CONFIG_PROC_FS and CONFIG_MMU being enabled, but it's used from
kernel/signal.c from where it is called unconditionally.

[akpm@osdl.org: build fix]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:15 -07:00
David Howells dbf8685c8e [PATCH] NOMMU: Implement /proc/pid/maps for NOMMU
Implement /proc/pid/maps for NOMMU by reading the vm_area_list attached to
current->mm->context.vmlist.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:14 -07:00
Panagiotis Issaris f8314dc60c [PATCH] fs: Conversions from kmalloc+memset to k(z|c)alloc
Conversions from kmalloc+memset to kzalloc.

Signed-off-by: Panagiotis Issaris <takis@issaris.org>
Jffs2-bit-acked-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:10 -07:00
Christoph Lameter 972d1a7b14 [PATCH] ZVC: Support NR_SLAB_RECLAIMABLE / NR_SLAB_UNRECLAIMABLE
Remove the atomic counter for slab_reclaim_pages and replace the counter
and NR_SLAB with two ZVC counter that account for unreclaimable and
reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.

Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE.  The
intend seems to be to check for slab pages that could be freed.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:51 -07:00
Christoph Lameter 182e8e2373 [PATCH] reduce MAX_NR_ZONES: make display of highmem counters conditional on CONFIG_HIGHMEM
Do not display HIGHMEM memory sizes if CONFIG_HIGHMEM is not set.

Make HIGHMEM dependent texts and make display of highmem counters optional

Some texts are depending on CONFIG_HIGHMEM.

Remove those strings and remove the display of highmem counter values if
CONFIG_HIGHMEM is not set.

[akpm@osdl.org: remove some ifdefs]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:46 -07:00
Andrew Morton f5ef68da5f [PATCH] /proc/meminfo: don't put spaces in names
None of the other /proc/meminfo lines have a space in the identifier.  This
post-2.6.17 addition has the potential to break existing parsers, so use an
underscore instead (like Committed_AS).

Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:33 -07:00
Linus Torvalds 6d76fa58b0 Don't allow chmod() on the /proc/<pid>/ files
This just turns off chmod() on the /proc/<pid>/ files, since there is no
good reason to allow it, and had we disallowed it originally, the nasty
/proc race exploit wouldn't have been possible.

The other patches already fixed the problem chmod() could cause, so this
is really just some final mop-up..

This particular version is based off a patch by Eugene and Marcel which
had much better naming than my original equivalent one.

Signed-off-by: Eugene Teo <eteo@redhat.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-15 12:26:45 -07:00
Linus Torvalds 92d032855e Mark /proc MS_NOSUID and MS_NOEXEC
Not that we really need this any more, but at the same time there's no
reason not to do this.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-15 12:20:05 -07:00
Shailabh Nagar 2589045466 [PATCH] per-task-delay-accounting: /proc export of aggregated block I/O delays
Export I/O delays seen by a task through /proc/<tgid>/stats for use in top
etc.

Note that delays for I/O done for swapping in pages (swapin I/O) is clubbed
together with all other I/O here (this is not the case in the netlink
interface where the swapin I/O is kept distinct)

[akpm@osdl.org: printk warning fix]
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:57 -07:00
Linus Torvalds 9ee8ab9fbf Relax /proc fix a bit
Clearign all of i_mode was a bit draconian. We only really care about
S_ISUID/ISGID, after all.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:48:03 -07:00
Linus Torvalds 18b0bbd8ca Fix nasty /proc vulnerability
We have a bad interaction with both the kernel and user space being able
to change some of the /proc file status.  This fixes the most obvious
part of it, but I expect we'll also make it harder for users to modify
even their "own" files in /proc.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 16:51:34 -07:00
Adam B. Jerome 0635170b54 [PATCH] /fs/proc/: 'larger than buffer size' memory accessed by clear_user()
Address a potential 'larger than buffer size' memory access by
clear_user().  Without this patch, this call to clear_user() can attempt to
clear too many (tsz) bytes resulting in a wrong (-EFAULT) return code by
read_kcore().

Signed-off-by: Adam B. Jerome <abj@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-12 12:52:55 -07:00
David Howells b4cac1a022 [PATCH] FDPIC: Move roundup() into linux/kernel.h
Move the roundup() macro from binfmt_elf.c into linux/kernel.h as it's
generally useful.

[akpm@osdl.org: nuke all the other implementations]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:22 -07:00
Greg Ungerer 31304c909e [PATCH] uclinux: fix proc_task()/get_proc-task() naming
Fix changed name of proc_task() to get_proc_task().

Signed-off-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 22:37:13 -07:00
Linus Torvalds 22a3e233ca Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
  Remove obsolete #include <linux/config.h>
  remove obsolete swsusp_encrypt
  arch/arm26/Kconfig typos
  Documentation/IPMI typos
  Kconfig: Typos in net/sched/Kconfig
  v9fs: do not include linux/version.h
  Documentation/DocBook/mtdnand.tmpl: typo fixes
  typo fixes: specfic -> specific
  typo fixes in Documentation/networking/pktgen.txt
  typo fixes: occuring -> occurring
  typo fixes: infomation -> information
  typo fixes: disadvantadge -> disadvantage
  typo fixes: aquire -> acquire
  typo fixes: mecanism -> mechanism
  typo fixes: bandwith -> bandwidth
  fix a typo in the RTC_CLASS help text
  smb is no longer maintained

Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S
2006-06-30 15:39:30 -07:00
Christoph Lameter d2c5e30c9a [PATCH] zoned vm counters: conversion of nr_bounce to per zone counter
Conversion of nr_bounce to a per zone counter

nr_bounce is only used for proc output.  So it could be left as an event
counter.  However, the event counters may not be accurate and nr_bounce is
categorizing types of pages in a zone.  So we really need this to also be a
per zone counter.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:36 -07:00
Christoph Lameter fd39fc8561 [PATCH] zoned vm counters: conversion of nr_unstable to per zone counter
Conversion of nr_unstable to a per zone counter

We need to do some special modifications to the nfs code since there are
multiple cases of disposition and we need to have a page ref for proper
accounting.

This converts the last critical page state of the VM and therefore we need to
remove several functions that were depending on GET_PAGE_STATE_LAST in order
to make the kernel compile again.  We are only left with event type counters
in page state.

[akpm@osdl.org: bugfixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:36 -07:00
Christoph Lameter ce866b34ae [PATCH] zoned vm counters: conversion of nr_writeback to per zone counter
Conversion of nr_writeback to per zone counter.

This removes the last page_state counter from arch/i386/mm/pgtable.c so we
drop the page_state from there.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:35 -07:00
Christoph Lameter b1e7a8fd85 [PATCH] zoned vm counters: conversion of nr_dirty to per zone counter
This makes nr_dirty a per zone counter.  Looping over all processors is
avoided during writeback state determination.

The counter aggregation for nr_dirty had to be undone in the NFS layer since
we summed up the page counts from multiple zones.  Someone more familiar with
NFS should probably review what I have done.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:35 -07:00
Christoph Lameter df849a1529 [PATCH] zoned vm counters: conversion of nr_pagetables to per zone counter
Conversion of nr_page_table_pages to a per zone counter

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:35 -07:00
Christoph Lameter 9a865ffa34 [PATCH] zoned vm counters: conversion of nr_slab to per zone counter
- Allows reclaim to access counter without looping over processor counts.

- Allows accurate statistics on how many pages are used in a zone by
  the slab. This may become useful to balance slab allocations over
  various zones.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:35 -07:00
Christoph Lameter f3dbd34460 [PATCH] zoned vm counters: split NR_ANON_PAGES off from NR_FILE_MAPPED
The current NR_FILE_MAPPED is used by zone reclaim and the dirty load
calculation as the number of mapped pagecache pages.  However, that is not
true.  NR_FILE_MAPPED includes the mapped anonymous pages.  This patch
separates those and therefore allows an accurate tracking of the anonymous
pages per zone.

It then becomes possible to determine the number of unmapped pages per zone
and we can avoid scanning for unmapped pages if there are none.

Also it may now be possible to determine the mapped/unmapped ratio in
get_dirty_limit.  Isnt the number of anonymous pages irrelevant in that
calculation?

Note that this will change the meaning of the number of mapped pages reported
in /proc/vmstat /proc/meminfo and in the per node statistics.  This may affect
user space tools that monitor these counters!  NR_FILE_MAPPED works like
NR_FILE_DIRTY.  It is only valid for pagecache pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:35 -07:00
Christoph Lameter 347ce434d5 [PATCH] zoned vm counters: conversion of nr_pagecache to per zone counter
Currently a single atomic variable is used to establish the size of the page
cache in the whole machine.  The zoned VM counters have the same method of
implementation as the nr_pagecache code but also allow the determination of
the pagecache size per zone.

Remove the special implementation for nr_pagecache and make it a zoned counter
named NR_FILE_PAGES.

Updates of the page cache counters are always performed with interrupts off.
We can therefore use the __ variant here.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:34 -07:00
Christoph Lameter 65ba55f500 [PATCH] zoned vm counters: convert nr_mapped to per zone counter
nr_mapped is important because it allows a determination of how many pages of
a zone are not mapped, which would allow a more efficient means of determining
when we need to reclaim memory in a zone.

We take the nr_mapped field out of the page state structure and define a new
per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
from NR_MAPPED in the next patch).

We replace the use of nr_mapped in various kernel locations.  This avoids the
looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
zone reclaim).

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:34 -07:00
Jörn Engel 6ab3d5624e Remove obsolete #include <linux/config.h>
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 19:25:36 +02:00
Ingo Molnar e6e5494cb2 [PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.

Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.

It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).

There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO.  Newer
distributions (using glibc 2.3.3 or later) can turn this option off.  Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.

There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.

(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)

This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.

[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:38 -07:00
Eric Paris 42c3e03ef6 [PATCH] SELinux: Add sockcreate node to procattr API
Below is a patch to add a new /proc/self/attr/sockcreate A process may write a
context into this interface and all subsequent sockets created will be labeled
with that context.  This is the same idea as the fscreate interface where a
process can specify the label of a file about to be created.  At this time one
envisioned user of this will be xinetd.  It will be able to better label
sockets for the actual services.  At this time all sockets take the label of
the creating process, so all xinitd sockets would just be labeled the same.

I tested this by creating a tcp sender and listener.  The sender was able to
write to this new proc file and then create sockets with the specified label.
I am able to be sure the new label was used since the avc denial messages
kicked out by the kernel included both the new security permission
setsockcreate and all the socket denials were for the new label, not the label
of the running process.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:26 -07:00
Oleg Nesterov c1df7fb88a [PATCH] cleanup next_tid()
Try to make next_tid() a bit more readable and deletes unnecessary
"pid_alive(pos)" check.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:26 -07:00
Oleg Nesterov a872ff0cb2 [PATCH] simplify/fix first_tid()
first_tid:

	/* If nr exceeds the number of threads there is nothing todo */
	if (nr) {
		if (nr >= get_nr_threads(leader))
			goto done;
	}

This is not reliable: sub-threads can exit after this check, so the
'for' loop below can overlap and proc_task_readdir() can return an
already filldir'ed dirents.

	for (; pos && pid_alive(pos); pos = next_thread(pos)) {
		if (--nr > 0)
			continue;

Off-by-one error, will return 'leader' when nr == 1.

This patch tries to fix these problems and simplify the code.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:26 -07:00
Eric W. Biederman cc288738c9 [PATCH] proc: Remove tasklist_lock from proc_task_readdir.
This is just like my previous removal of tasklist_lock from first_tgid, and
next_tgid.  It simply had to wait until it was rcu safe to walk the thread
list.

This should be the last instance of the tasklist_lock in proc.  So user
processes should not be able to influence the tasklist lock hold times.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:26 -07:00
Eric W. Biederman df26c40e56 [PATCH] proc: Cleanup proc_fd_access_allowed
In process of getting proc_fd_access_allowed to work it has developed a few
warts.  In particular the special case that always allows introspection and
the special case to allow inspection of kernel threads.

The special case for introspection is needed for /proc/self/mem.

The special case for kernel threads really should be overridable
by security modules.

So consolidate these checks into ptrace.c:may_attach().

The check to always allow introspection is trivial.

The check to allow access to kernel threads, and zombies is a little
trickier.  mem_read and mem_write already verify an mm exists so it isn't
needed twice.  proc_fd_access_allowed only doesn't want a check to verify
task->mm exits, s it prevents all access to kernel threads.  So just move
the task->mm check into ptrace_attach where it is needed for practical
reasons.

I did a quick audit and none of the security modules in the kernel seem to
care if they are passed a task without an mm into security_ptrace.  So the
above move should be safe and it allows security modules to come up with
more restrictive policy.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:26 -07:00
Eric W. Biederman 778c114477 [PATCH] proc: Use sane permission checks on the /proc/<pid>/fd/ symlinks
Since 2.2 we have been doing a chroot check to see if it is appropriate to
return a read or follow one of these magic symlinks.  The chroot check was
asking a question about the visibility of files to the calling process and
it was actually checking the destination process, and not the files
themselves.  That test was clearly bogus.

In my first pass through I simply fixed the test to check the visibility of
the files themselves.  That naive approach to fixing the permissions was
too strict and resulted in cases where a task could not even see all of
it's file descriptors.

What has disturbed me about relaxing this check is that file descriptors
are per-process private things, and they are occasionaly used a user space
capability tokens.  Looking a little farther into the symlink path on /proc
I did find userid checks and a check for capability (CAP_DAC_OVERRIDE) so
there were permissions checking this.

But I was still concerned about privacy.  Besides /proc there is only one
other way to find out this kind of information, and that is ptrace.  ptrace
has been around for a long time and it has a well established security
model.

So after thinking about it I finally realized that the permission checks
that make sense are the permission checks applied to ptrace_attach.  The
checks are simple per process, and won't cause nasty surprises for people
coming from less capable unices.

Unfortunately there is one case that the current ptrace_attach test does
not cover: Zombies and kernel threads.  Single stepping those kinds of
processes is impossible.  Being able to see which file descriptors are open
on these tasks is important to lsof, fuser and friends.  So for these
special processes I made the rule you can't find out unless you have
CAP_SYS_PTRACE.

These proc permission checks should now conform to the principle of least
surprise.  As well as using much less code to implement :)

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:26 -07:00
Eric W. Biederman 5b0c1dd38b [PATCH] proc: optimize proc_check_dentry_visible
The code doesn't need to sleep to when making this check so I can just do the
comparison and not worry about the reference counts.

TODO: While looking at this I realized that my original cleanup did not push
the permission check far enough down into the stack.  The call of
proc_check_dentry_visible needs to move out of the generic proc
readlink/follow link code and into the individual get_link instances.
Otherwise the shared resources checks are not quite correct (shared
files_struct does not require a shared fs_struct), and there are races with
unshare.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:26 -07:00
Eric W. Biederman 13b41b0949 [PATCH] proc: Use struct pid not struct task_ref
Incrementally update my proc-dont-lock-task_structs-indefinitely patches so
that they work with struct pid instead of struct task_ref.

Mostly this is a straight 1-1 substitution.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:26 -07:00
Eric W. Biederman 99f8955183 [PATCH] proc: don't lock task_structs indefinitely
Every inode in /proc holds a reference to a struct task_struct.  If a
directory or file is opened and remains open after the the task exits this
pinning continues.  With 8K stacks on a 32bit machine the amount pinned per
file descriptor is about 10K.

Normally I would figure a reasonable per user process limit is about 100
processes.  With 80 processes, with a 1000 file descriptors each I can trigger
the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
data.

This patch replaces the struct task_struct pointer with a pointer to a struct
task_ref which has a struct task_struct pointer.  The so the pinning of dead
tasks does not happen.

The code now has to contend with the fact that the task may now exit at any
time.  Which is a little but not muh more complicated.

With this change it takes about 1000 processes each opening up 1000 file
descriptors before I can trigger the OOM killer.  Much better.

[mlp@google.com: task_mmu small fixes]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Paul Jackson <pj@sgi.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Albert Cahalan <acahalan@gmail.com>
Signed-off-by: Prasanna Meda <mlp@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:25 -07:00
Eric W. Biederman 8578cea750 [PATCH] proc: make PROC_NUMBUF the buffer size for holding integers as strings
Currently in /proc at several different places we define buffers to hold a
process id, or a file descriptor .  In most of them we use either a hard coded
number or a different define.  Modify them all to use PROC_NUMBUF, so the code
has a chance of being maintained.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:25 -07:00
Eric W. Biederman 9cc8cbc7f8 [PATCH] simply fix first_tgid
Like the bug Oleg spotted in first_tid there was also a small off by one
error in first_tgid, when a seek was done on the /proc directory.  This
fixes that and changes the code structure to make it a little more obvious
what is going on.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:25 -07:00
Eric W. Biederman de7587343b [PATCH] proc: Remove tasklist_lock from proc_pid_lookup() and proc_task_lookup()
Since we no longer need the tasklist_lock for get_task_struct the lookup
methods no longer need the tasklist_lock.

This just depends on my previous patch that makes get_task_struct() rcu
safe.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:25 -07:00
Eric W. Biederman 454cc105ef [PATCH] proc: Remove tasklist_lock from proc_pid_readdir
We don't need the tasklist_lock to safely iterate through processes
anymore.

This depends on my previous to task patches that make get_task_struct rcu
safe, and that make next_task() rcu safe.  I haven't gotten
first_tid/next_tid yet only because next_thread is missing an
rcu_dereference.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:25 -07:00
Eric W. Biederman 0bc58a9102 [PATCH] proc: refactor reading directories of tasks
There are a couple of problems this patch addresses.
- /proc/<tgid>/task currently does not work correctly if you stop reading
  in the middle of a directory.

- /proc/ currently requires a full pass through the task list with
  the tasklist lock held, to determine there are no more processes to read.

- The hand rolled integer to string conversion does not properly running
  out of buffer space.

- We seem to be batching reading of pids from the tasklist without reason,
  and complicating the logic of the code.

This patch addresses that by changing how tasks are processed.  A
first_<task_type> function is built that handles restarts, and a
next_<task_type> function is built that just advances to the next task.

first_<task_type> when it detects a restart usually uses find_task_by_pid.  If
that doesn't work because there has been a seek on the directory, or we have
already given a complete directory listing, it first checks the number tasks
of that type, and only if we are under that count does it walk through all of
the tasks to find the one we are interested in.

The code that fills in the directory is simpler because there is only a single
for loop.

The hand rolled integer to string conversion is replaced by snprintf which
should handle the the out of buffer case correctly.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:25 -07:00
Eric W. Biederman cd6a3ce9ec [PATCH] proc: Close the race of a process dying durning lookup
proc_lookup and task exiting are not synchronized, although some of the
previous code may have suggested that.  Every time before we reuse a dentry
namei.c calls d_op->derevalidate which prevents us from reusing a stale dcache
entry.  Unfortunately it does not prevent us from returning a stale dcache
entry.  This race has been explicitly plugged in proc_pid_lookup but there is
nothing to confine it to just that proc lookup function.

So to prevent the race I call revalidate explictily in all of the proc lookup
functions after I call d_add, and report an error if the revalidate does not
succeed.

Years ago Al Viro did something similar but those changes got lost in the
churn.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:25 -07:00